7 min read

Test evidence for regulated teams: what auditors actually want from your CI

Auditors don't care that your tests passed. They care that you can prove which tests ran, against which code, and that the results weren't quietly massaged before you shipped.

BuildPulse Team

June 24, 2026

CI audit evidence for regulated teams | BuildPulse Blog

The audit question that catches teams off guard

You're three days into a SOC 2 Type II audit. The auditor asks to see evidence that your change-management controls work — specifically, that a code change touching payment processing can't reach production without passing a defined test suite. You pull up your CI dashboard. Green checkmarks everywhere. You feel fine.

Then she asks: "Can you show me the actual test results artifact for this specific deployment from six weeks ago? And can you show me which test cases map to PCI requirement 6.3.2?"

Most engineering teams freeze here. Not because they don't test — they do — but because how they test, what they retain, and how they handle intermittent failures was never designed with this question in mind.

This post is about closing that gap before you're sitting across from that auditor.

What auditors actually want (it's not a green badge)

Compliance frameworks like SOC 2, ISO 27001, PCI DSS, and HIPAA's Security Rule all share a common thread when it comes to software change controls: they want a traceable, tamper-evident record that testing happened, what it covered, and what it concluded.

That breaks down into four concrete asks:

Artifact retention: The raw test output — JUnit XML, TAP output, whatever your runner produces — stored somewhere auditors can access, tied to a specific commit SHA and build ID. A screenshot of a green pipeline is not an artifact.
Coverage of defined requirements: For regulated scope (auth flows, data encryption, audit-log writes), auditors want to see tests labeled to those requirements, not just passing.
Immutability: The evidence record can't be a file someone could overwrite after the fact. S3 with Object Lock, artifact storage in your CI platform with retention policies, or a signed attestation — something that demonstrates the record hasn't changed.
Honest pass/fail semantics: If a test failed, then passed on retry, that is not a clean pass. That's a conditional pass with an unexplained failure event in the middle. The difference matters enormously in a regulated context.

The fourth point is where flaky tests go from an annoyance to a liability.

How reruns quietly corrupt your evidence trail

Here's a scenario I've watched play out at a mid-size fintech. Their CI pipeline was configured with retry: 2 on all test jobs — a reasonable default to avoid blocking engineers over transient infrastructure hiccups. The problem: every flaky test in the suite was silently absorbing those retries. A test that failed, then failed again, then passed on the third attempt would report as green. The JUnit XML artifact uploaded to S3 reflected only the final passing run.

From an audit perspective, the evidence showed a clean pass. From a reality perspective, a test covering a critical funds-transfer validation was failing two out of three times on a specific code path — and nobody knew.

This is the quiet corruption. The evidence trail said "tests passed." The underlying behavior said "something is wrong, and we're hiding it by running the test until it agrees with us."

When an auditor asks "did your tests pass for this release?" and the honest answer is "well, they passed eventually," you have a problem. Change-management controls in frameworks like PCI DSS explicitly require that testing demonstrates fitness for release — not that testing eventually produced a passing result after an undisclosed number of attempts.

The retry behavior also breaks traceability. If your JUnit XML is generated from the final retry attempt, it doesn't contain the failure data from earlier attempts. You've overwritten evidence.

What your JUnit XML should actually contain

JUnit XML is the lingua franca of CI test evidence. Nearly every framework can emit it. Nearly every CI platform can ingest it. And it's what most auditors will accept as the raw artifact, provided it's retained correctly.

A well-structured JUnit XML artifact for a regulated environment looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<testsuites name="payment-service" time="14.321" tests="42" failures="0" errors="0" skipped="1">
  <testsuite name="TransferAuthorizationTest" timestamp="2025-04-15T18:32:01Z"
             hostname="runner-abc123" tests="8" failures="0" time="3.204">
    <!-- requirement traceability via custom properties -->
    <properties>
      <property name="requirement" value="PCI-DSS-6.3.2"/>
      <property name="git.sha" value="a3f9c1d"/>
      <property name="build.id" value="ci-4821"/>
    </properties>
    <testcase name="authorize_transfer_above_threshold_requires_mfa"
              classname="TransferAuthorizationTest" time="0.412">
    </testcase>
    <testcase name="authorize_transfer_with_expired_token_is_rejected"
              classname="TransferAuthorizationTest" time="0.389">
    </testcase>
  </testsuite>
</testsuites>

A few things worth calling out:

timestamp and hostname place the test run in time and on a specific runner. Auditors use these to correlate with deployment logs.
properties with requirement tags create the traceability link. You can add these via your test framework's metadata APIs — JUnit 5 has @Tag, pytest has pytest.mark, RSpec has :metadata. The discipline is adding them consistently for regulated-scope tests.
git.sha and build.id in properties tie the artifact to a specific code state and pipeline execution. Without these, you have test results floating in a void.

One thing this artifact doesn't capture: retry history. If your CI platform retried this suite before producing a green result, that information is gone unless you capture it explicitly.

Retention: where most teams are misconfigured

GitHub Actions artifacts expire in 90 days by default. GitLab CI artifacts have configurable expiry that teams frequently leave at the platform default or set short to manage storage costs. CircleCI artifact retention policies are similarly easy to overlook.

For a SOC 2 Type II audit, the auditors typically look at a 12-month window. For PCI DSS, the requirement is 12 months of audit log retention with 3 months immediately available. If your JUnit XML artifacts are expiring in 90 days, you have a gap.

The fix is straightforward but requires intentionality:

# GitHub Actions — upload JUnit XML with extended retention
- name: Upload test results
  uses: actions/upload-artifact@v4
  if: always()  # critical: run even if tests fail
  with:
    name: test-results-${{ github.sha }}-${{ github.run_id }}
    path: test-results/**/*.xml
    retention-days: 400  # > 12 months

Two things here beyond retention days. First, if: always() — if you only upload artifacts on success, you have no evidence for failed runs. Auditors want to see the full record, including failures and their remediation. Second, including both github.sha and github.run_id in the artifact name makes retrieval by commit or by pipeline run tractable when you're digging through six months of history.

For longer-term storage, many regulated teams pipe artifacts to S3 with Object Lock in Compliance mode:

import boto3

s3 = boto3.client('s3')

# Upload with object lock — prevents deletion or overwrite
s3.put_object(
    Bucket='ci-audit-artifacts',
    Key=f'test-results/{git_sha}/{build_id}/junit.xml',
    Body=junit_xml_content,
    ObjectLockMode='COMPLIANCE',
    ObjectLockRetainUntilDate='2026-04-15T00:00:00Z'
)

Object Lock in Compliance mode means even bucket owners can't delete the object before the retention date. That's the tamper-evident property auditors are looking for.

The flaky test problem is a traceability problem

Here's where this connects directly to flaky test management and not just DevOps housekeeping.

A test that's flaky — genuinely nondeterministic, intermittently failing for reasons unrelated to the code under test — produces ambiguous evidence. If that test covers a regulated requirement, every run it's involved in produces evidence of uncertain quality.

Regulated teams often respond to this by quarantining known-flaky tests from their compliance-tagged suites. That's a defensible approach, but it requires actually knowing which tests are flaky, tracking them systematically, and having a documented remediation process. "We excluded this test because it's been flaky" is an acceptable answer to an auditor only if you also have a documented plan to fix it and a history of the flaky behavior that justifies the exclusion.

This is one of the places where a platform like BuildPulse earns its keep in regulated environments — not just flagging flaky tests for developer convenience, but giving you the historical record of failure patterns that makes a quarantine decision auditable. "We quarantined this test on this date because it had a 34% failure rate over 90 days, and here's the evidence" is a very different posture than "we turned it off because it was annoying."

Building that paper trail intentionally is worth the effort. If you want a deeper look at how quarantine policies should be structured, this post on flaky test quarantine strategies covers the mechanics.

Requirement traceability: the missing layer

Most teams have tests. Fewer have tests labeled to requirements. Almost none have an automated check that every item in their requirements traceability matrix (RTM) has at least one passing test in the current build.

For fintech and healthtech, an RTM isn't optional — it's often a direct audit deliverable. The pattern I'd recommend:

Define your regulated-scope requirements as tags in your test framework (@pytest.mark.pci_6_3_2, @Tag("hipaa-164.312.a.1")).
Emit those tags into your JUnit XML <properties> at test run time.
Add a CI step that parses the JUnit XML and validates that every required tag appears in the results with zero failures. Fail the build if coverage is missing.
Retain that validation artifact alongside the test results.

This makes the traceability check part of the gate, not a separate manual exercise before each audit.

The CI signal question, applied

Compliance work and test reliability work are often treated as separate concerns — one belongs to security and GRC, the other to the engineering team's developer experience charter. That separation is artificial and expensive.

When a flaky test fires in a compliance-tagged suite and gets silently retried to green, the CI signal failed at two levels: it failed as a reliability indicator (something is wrong with this test or the code), and it failed as compliance evidence (the record no longer honestly represents what happened).

The teams that handle audits well aren't the ones with the most tests. They're the ones that can answer "can I trust my CI signal?" with a yes — and then hand an auditor a traceable, retained, immutable artifact that backs that claim up.

That's not a compliance project. That's just engineering done right, with the retention policies turned on.

Flaky Tests

7 min read

The flakiest test in your suite is fighting over a database row

Order-dependent tests that fight over the same database rows are one of the sneakiest sources of CI flakiness. Here's how to find and kill them.

BuildPulse Team

Jul 13, 2026

Flaky Tests

6 min read

The async race condition is why your tests are flaky

The single biggest source of flaky tests isn't infrastructure — it's async code racing your assertions. Here's how to find and kill it.

BuildPulse Team

Jul 6, 2026

Flaky Tests

6 min read

Merge queues don't fix flaky tests — they industrialize them

Merge queues make CI failures more expensive, not less. One flaky test can serialize an entire team's output. Here's the math — and the fix.

BuildPulse Team

Jun 29, 2026