A flaky test detector identifies flaky tests before they erode your team's trust in CI.
Flaky tests fail intermittently for the same code, consume hours of developer
time investigating non-issues, and block merges on tests that would pass if re-run.
Here's how to automatically detect flaky tests, manage flaky tests once found,
and ensure test passed status is reliable — not just lucky.
What makes a test flaky
A flaky test is one that produces different results for the same code. The
classic pattern is:
Run 1: PASS
Run 2: FAIL
Run 3: PASS
Run 4: FAIL
Same code, different result. Common causes include:
- Timing dependencies: Tests that depend on async operations completing within a certain time
- Shared state: Tests that modify global state and don't clean up
- Network flakiness: Tests that make external API calls
- Resource contention: Tests that compete for CPU, memory, or database connections
- Random data: Tests that use non-deterministic inputs
The key characteristic is that the failure isn't caused by a bug in the code
under test — it's caused by something in the test infrastructure.
Why flaky tests are expensive
Flaky tests are more than just annoying. They have real costs:
- Time investigation: Developers spend hours looking for bugs that don't exist
- CI cycle waste: Re-runs to check if it's "real" waste time and money
- Trust erosion: When "tests fail" too often, teams start ignoring test failures
- Blocker creation: A flaky test on main can block merging for the entire team
The cost multiplies with team size. A 10-person team with 2 flaky tests loses
2-4 hours per week to flaky-related overhead. That's $720-$1,800 per week in
lost productivity.
The detection pattern
The classic flaky test signal is an alternating pass/fail pattern. Here's
how to detect it. Deviera's
CI Intelligence
runs this analysis continuously across your CI history, without any manual querying.
Step 1: Collect test history
Query your CI provider for test results across multiple runs. You're looking
for:
- Test name
- Run timestamp
- Pass/fail status
- Branch
- Duration
Step 2: Analyze for patterns
For each test, look for the flaky pattern:
Recent runs: [PASS, FAIL, PASS, FAIL, PASS]
Pattern: alternating
Flaky: YES
Or more complex patterns:
Recent runs: [PASS, PASS, FAIL, PASS, PASS, FAIL, PASS]
Pattern: intermittent (2 failures in 7 runs)
Flaky: YES
Step 3: Calculate a score
Not all intermittent failures are equal. Calculate a "flakiness score" based on:
- Failure rate: What percentage of recent runs failed?
- Recency: Are failures happening now, or were they months ago?
- Branch distribution: Does it fail on main, feature branches, or both?
A test that failed 4 of the last 5 times on main is more urgent than a test
that failed 1 of the last 20 runs on feature branches.
What to do when you detect a flaky test
Detection is only useful if it leads to action. When a flaky test is detected:
1. Create a tracking issue
Automatically create a ticket in your issue tracker with:
- Test name and file path
- Failure history (when it started, how often it fails)
- Likely cause (if determinable)
- Recommended fix
2. Mark it as known-flaky
Configure your CI to allow known-flaky tests to fail without blocking merges.
This prevents a single flaky test from stopping the entire team.
3. Add to tech debt backlog
Track flaky tests in a dedicated "flaky tests" category. Prioritize fixing
the ones that fail most frequently on main.
4. Auto-retry
As a short-term fix, configure your CI to auto-retry flaky tests. If it passes
on retry, don't block the merge. This isn't a fix — it's a band-aid while
you fix the root cause.
Measuring flaky test health
Track these metrics over time:
- Flaky test count: How many tests are currently marked as flaky?
- Flaky test ratio: What percentage of your test suite is flaky?
- Flaky test impact: How many builds were blocked by flaky tests?
- Fix rate: How many flaky tests are being fixed vs. added?
The goal: a test suite with 0 flaky tests, or as close to 0 as possible.
Track your progress weekly. The
CI Health Score Calculator
gives you a single score across flakiness, pass rate, and recovery time — a
fast diagnostic before diving into raw metrics.
Tools that detect flaky tests automatically
Several CI platforms have built-in or bolt-on flaky test detection. Here's how
the main options compare, and where each one falls short.
GitHub Actions — native retry
GitHub Actions supports test retries via continue-on-error: true
at the job level and third-party actions like nick-fields/retry.
This reduces noise but doesn't detect flakiness — it just masks it. No
flakiness score, no historical pattern analysis, no automatic ticket creation.
Datadog CI Visibility
Datadog's CI Visibility product tracks test runs across your pipeline and flags
tests with inconsistent results. It produces a per-test flakiness signal and
can integrate with APM traces to help diagnose timing-related failures. The
limitation: it's a visibility layer, not an action layer. Detecting the flaky
test and creating the ticket are still separate manual steps.
Gradle Develocity
Develocity includes predictive test selection and flakiness detection that works
well for Gradle and Maven builds. It aggregates test outcomes across builds and
surfaces flaky tests in its dashboard. Scope is narrow — it's JVM-first and
tied to the Develocity build scan ecosystem.
trunk.io
trunk.io offers dedicated flaky test detection that plugs into GitHub Actions
and other CI providers. It tracks test history, assigns a flakiness score, and
can quarantine flaky tests automatically by skipping them in PR builds.
Strongest standalone offering in this category.
Azure DevOps Test Analytics
Azure DevOps includes a Test Analytics panel that surfaces flaky tests based on
historical run data. It marks tests as flaky when they fail and pass on
identical commits, and supports automatic retry at the pipeline level. Best
suited to teams fully in the Azure DevOps ecosystem.
Where these tools stop short
Every tool above detects. None of them close the loop automatically. When a
flaky test is identified, someone still has to open a ticket, assign it, and
route it to the right engineer. For high-velocity teams, that manual step is
where flaky tests go to be forgotten.
Deviera adds the automation layer on top: when a flaky test pattern is confirmed,
it automatically creates a structured ticket in Linear, Jira, or ClickUp — with
the test name, file path, failure history, and suggested owner — so detection
turns into a fix.
How to quarantine a flaky test
Quarantining a flaky test means moving it out of your blocking CI suite so it
can't hold up merges, while keeping it visible enough that it still gets fixed.
This is the correct short-term response — not deletion, not ignoring, not
unlimited retries.
GitHub Actions: continue-on-error
Split your test suite into two jobs — tests-required (blocking)
and tests-flaky (non-blocking):
- name: Run flaky test suite
run: npm run test:flaky
continue-on-error: true
The overall workflow passes even if the flaky suite fails. The failure is still
logged — it just doesn't block the merge.
GitLab CI: allow_failure
In GitLab CI, use allow_failure: true on the quarantined job:
test:flaky:
script: npm run test:flaky
allow_failure: true
You can also restrict the flaky job to scheduled (nightly) pipelines only,
keeping the signal without merge friction.
Jest: test.skip with a ticket link
For Jest projects, skip the flaky test at the source level and link to the
tracking ticket:
// Quarantined: race condition — see LINEAR-1234
test.skip("should process payment within timeout", () => {
...
});
A test.skip with no context becomes permanent. Linked to an open
ticket, it has a resolution path.
When to unquarantine
A test leaves quarantine when the root cause is identified and fixed — not when
it happens to pass a few times. Run the formerly-flaky test across 20+ CI runs
after the fix. Promote it back to the blocking suite only if the failure rate
drops to zero across those runs.
How a flaky test detector works
A flaky test detector is a system that continuously analyzes CI test data to identify
tests whose results are unreliable — not because the code is broken, but because the
test itself is non-deterministic. Here's how the detection pipeline works end to end.
Step 1: Collect test data from CI
The detector ingests test data from your CI provider across every run: test name,
file path, pass/fail status, run timestamp, branch, and duration. This test data
is stored per-test over time — not just the most recent run, but a rolling history
of 20–50 runs per test.
Most CI providers expose this via API (GitHub Actions, CircleCI, GitLab CI all
have test reporting endpoints). Some require a JUnit XML report or a test summary
artifact uploaded as part of the pipeline.
Step 2: Identify flaky tests by pattern
The detector scans each test's history for the flakiness signal: results that alternate
or vary without a corresponding code change. A test that produces PASS, FAIL, PASS, FAIL
on the same commit is identified as flaky. So is a test with a failure rate above 15%
on main that has no correlation with specific code changes.
Good detectors also weight recency: a test that failed yesterday is more urgent than
one that had a rough month six weeks ago and has been stable since.
Step 3: Manage flaky tests with automated routing
Detection is only useful if it leads to action. Once a test is identified as flaky,
the detector should automatically manage flaky tests by:
- Creating a structured ticket in your issue tracker with the test name, file path, and failure history
- Marking the test as known-flaky in your CI configuration so it stops blocking merges
- Assigning it to the last engineer who touched the test file
- Tracking it in a dedicated backlog so it has a resolution path
Step 4: Ensure test passed is reliable, not lucky
After a flaky test is fixed, you need to validate that test passed actually means the
test is stable — not that it happened to pass on the next run. Ensure flaky tests are
promoted back to the blocking suite only after running cleanly across 20+ consecutive CI runs,
not just once or twice. A detector that tracks "confirmed stable" status prevents the common
pattern of fixing a flaky test on paper while the root cause is still present.
Frequently asked questions
What is a flaky test?
A flaky test produces different results — pass or fail — for the same code,
without any code changes between runs. The failure isn't caused by a bug in
the feature being tested. It's caused by something in the test environment:
timing, shared state, network calls, or non-deterministic data. The defining
characteristic is inconsistency: the same input produces different outputs
depending on when or where the test runs.
How do I find flaky tests in my CI pipeline?
The most reliable method is historical analysis: pull the last 20–30 runs for
each test across the same codebase revision, and flag any test where results
alternate or where the failure rate exceeds 10–20% without a corresponding code
change. Most CI platforms expose test results via API (GitHub Actions, CircleCI,
GitLab CI all have test reporting endpoints). Enabling test retry in CI and
treating "failed then passed on retry" as a flakiness signal is also highly
reliable — that pattern is nearly always a flaky test, not a real failure.
Should I delete flaky tests or fix them?
Fix them, not delete them. A flaky test almost always covers real behavior —
the test logic is usually correct, and the flakiness is a symptom of a
legitimate problem in the test setup (race condition, missing cleanup,
unreliable external dependency). Deleting the test removes coverage without
fixing the underlying issue. The right sequence: quarantine it so it stops
blocking merges, investigate the root cause, fix the test or the system under
test, then return it to the blocking suite. Only delete a test if it covers
behavior that no longer exists in the codebase.