A flaky test detector identifies flaky tests before they erode your team's trust in CI. Flaky tests fail intermittently for the same code, consume hours of developer time investigating non-issues, and block merges on tests that would pass if re-run. Here's how to automatically detect flaky tests, manage flaky tests once found, and ensure test passed status is reliable — not just lucky.

What makes a test flaky

A flaky test is one that produces different results for the same code. The classic pattern is:

Run 1: PASS
Run 2: FAIL
Run 3: PASS
Run 4: FAIL

Same code, different result. Common causes include:

Timing dependencies: Tests that depend on async operations completing within a certain time
Shared state: Tests that modify global state and don't clean up
Network flakiness: Tests that make external API calls
Resource contention: Tests that compete for CPU, memory, or database connections
Random data: Tests that use non-deterministic inputs

The key characteristic is that the failure isn't caused by a bug in the code under test — it's caused by something in the test infrastructure.

Why flaky tests are expensive

Flaky tests are more than just annoying. They have real costs:

Time investigation: Developers spend hours looking for bugs that don't exist
CI cycle waste: Re-runs to check if it's "real" waste time and money
Trust erosion: When "tests fail" too often, teams start ignoring test failures
Blocker creation: A flaky test on main can block merging for the entire team

The cost multiplies with team size. A 10-person team with 2 flaky tests loses 2-4 hours per week to flaky-related overhead. That's $720-$1,800 per week in lost productivity.

The detection pattern

The classic flaky test signal is an alternating pass/fail pattern. Here's how to detect it. Deviera's

CI Intelligence

runs this analysis continuously across your CI history, without any manual querying.

Step 1: Collect test history

Query your CI provider for test results across multiple runs. You're looking for:

Test name
Run timestamp
Pass/fail status
Branch
Duration

Step 2: Analyze for patterns

For each test, look for the flaky pattern:

Recent runs: [PASS, FAIL, PASS, FAIL, PASS]
Pattern: alternating
Flaky: YES

Or more complex patterns:

Recent runs: [PASS, PASS, FAIL, PASS, PASS, FAIL, PASS]
Pattern: intermittent (2 failures in 7 runs)
Flaky: YES

Step 3: Calculate a score

Not all intermittent failures are equal. Calculate a "flakiness score" based on:

Failure rate: What percentage of recent runs failed?
Recency: Are failures happening now, or were they months ago?
Branch distribution: Does it fail on main, feature branches, or both?

A test that failed 4 of the last 5 times on main is more urgent than a test that failed 1 of the last 20 runs on feature branches.

What to do when you detect a flaky test

Detection is only useful if it leads to action. When a flaky test is detected:

1. Create a tracking issue

Automatically create a ticket in your issue tracker with:

Test name and file path
Failure history (when it started, how often it fails)
Likely cause (if determinable)
Recommended fix

2. Mark it as known-flaky

Configure your CI to allow known-flaky tests to fail without blocking merges. This prevents a single flaky test from stopping the entire team.

3. Add to tech debt backlog

Track flaky tests in a dedicated "flaky tests" category. Prioritize fixing the ones that fail most frequently on main.

4. Auto-retry

As a short-term fix, configure your CI to auto-retry flaky tests. If it passes on retry, don't block the merge. This isn't a fix — it's a band-aid while you fix the root cause.

Measuring flaky test health

Track these metrics over time:

Flaky test count: How many tests are currently marked as flaky?
Flaky test ratio: What percentage of your test suite is flaky?
Flaky test impact: How many builds were blocked by flaky tests?
Fix rate: How many flaky tests are being fixed vs. added?

The goal: a test suite with 0 flaky tests, or as close to 0 as possible. Track your progress weekly. The

CI Health Score Calculator

gives you a single score across flakiness, pass rate, and recovery time — a fast diagnostic before diving into raw metrics.

Tools that detect flaky tests automatically

Several CI platforms have built-in or bolt-on flaky test detection. Here's how the main options compare, and where each one falls short.

GitHub Actions — native retry

GitHub Actions supports test retries via continue-on-error: true at the job level and third-party actions like nick-fields/retry. This reduces noise but doesn't detect flakiness — it just masks it. No flakiness score, no historical pattern analysis, no automatic ticket creation.

Datadog CI Visibility

Datadog's CI Visibility product tracks test runs across your pipeline and flags tests with inconsistent results. It produces a per-test flakiness signal and can integrate with APM traces to help diagnose timing-related failures. The limitation: it's a visibility layer, not an action layer. Detecting the flaky test and creating the ticket are still separate manual steps.

Gradle Develocity

Develocity includes predictive test selection and flakiness detection that works well for Gradle and Maven builds. It aggregates test outcomes across builds and surfaces flaky tests in its dashboard. Scope is narrow — it's JVM-first and tied to the Develocity build scan ecosystem.

trunk.io

trunk.io offers dedicated flaky test detection that plugs into GitHub Actions and other CI providers. It tracks test history, assigns a flakiness score, and can quarantine flaky tests automatically by skipping them in PR builds. Strongest standalone offering in this category.

Azure DevOps Test Analytics

Azure DevOps includes a Test Analytics panel that surfaces flaky tests based on historical run data. It marks tests as flaky when they fail and pass on identical commits, and supports automatic retry at the pipeline level. Best suited to teams fully in the Azure DevOps ecosystem.

Where these tools stop short

Every tool above detects. None of them close the loop automatically. When a flaky test is identified, someone still has to open a ticket, assign it, and route it to the right engineer. For high-velocity teams, that manual step is where flaky tests go to be forgotten.

Deviera adds the automation layer on top: when a flaky test pattern is confirmed, it automatically creates a structured ticket in Linear, Jira, or ClickUp — with the test name, file path, failure history, and suggested owner — so detection turns into a fix.

How to quarantine a flaky test

Quarantining a flaky test means moving it out of your blocking CI suite so it can't hold up merges, while keeping it visible enough that it still gets fixed. This is the correct short-term response — not deletion, not ignoring, not unlimited retries.

GitHub Actions: continue-on-error

Split your test suite into two jobs — tests-required (blocking) and tests-flaky (non-blocking):

- name: Run flaky test suite
run: npm run test:flaky
continue-on-error: true

The overall workflow passes even if the flaky suite fails. The failure is still logged — it just doesn't block the merge.

GitLab CI: allow_failure

In GitLab CI, use allow_failure: true on the quarantined job:

test:flaky:
script: npm run test:flaky
allow_failure: true

You can also restrict the flaky job to scheduled (nightly) pipelines only, keeping the signal without merge friction.

Jest: test.skip with a ticket link

For Jest projects, skip the flaky test at the source level and link to the tracking ticket:

// Quarantined: race condition — see LINEAR-1234
test.skip("should process payment within timeout", () => {
...
});

A test.skip with no context becomes permanent. Linked to an open ticket, it has a resolution path.

When to unquarantine

A test leaves quarantine when the root cause is identified and fixed — not when it happens to pass a few times. Run the formerly-flaky test across 20+ CI runs after the fix. Promote it back to the blocking suite only if the failure rate drops to zero across those runs.

How a flaky test detector works

A flaky test detector is a system that continuously analyzes CI test data to identify tests whose results are unreliable — not because the code is broken, but because the test itself is non-deterministic. Here's how the detection pipeline works end to end.

Step 1: Collect test data from CI

The detector ingests test data from your CI provider across every run: test name, file path, pass/fail status, run timestamp, branch, and duration. This test data is stored per-test over time — not just the most recent run, but a rolling history of 20–50 runs per test.

Most CI providers expose this via API (GitHub Actions, CircleCI, GitLab CI all have test reporting endpoints). Some require a JUnit XML report or a test summary artifact uploaded as part of the pipeline.

Step 2: Identify flaky tests by pattern

The detector scans each test's history for the flakiness signal: results that alternate or vary without a corresponding code change. A test that produces PASS, FAIL, PASS, FAIL on the same commit is identified as flaky. So is a test with a failure rate above 15% on main that has no correlation with specific code changes.

Good detectors also weight recency: a test that failed yesterday is more urgent than one that had a rough month six weeks ago and has been stable since.

Step 3: Manage flaky tests with automated routing

Detection is only useful if it leads to action. Once a test is identified as flaky, the detector should automatically manage flaky tests by:

Creating a structured ticket in your issue tracker with the test name, file path, and failure history
Marking the test as known-flaky in your CI configuration so it stops blocking merges
Assigning it to the last engineer who touched the test file
Tracking it in a dedicated backlog so it has a resolution path

Step 4: Ensure test passed is reliable, not lucky

After a flaky test is fixed, you need to validate that test passed actually means the test is stable — not that it happened to pass on the next run. Ensure flaky tests are promoted back to the blocking suite only after running cleanly across 20+ consecutive CI runs, not just once or twice. A detector that tracks "confirmed stable" status prevents the common pattern of fixing a flaky test on paper while the root cause is still present.

Frequently asked questions

What is a flaky test?

A flaky test produces different results — pass or fail — for the same code, without any code changes between runs. The failure isn't caused by a bug in the feature being tested. It's caused by something in the test environment: timing, shared state, network calls, or non-deterministic data. The defining characteristic is inconsistency: the same input produces different outputs depending on when or where the test runs.

How do I find flaky tests in my CI pipeline?

The most reliable method is historical analysis: pull the last 20–30 runs for each test across the same codebase revision, and flag any test where results alternate or where the failure rate exceeds 10–20% without a corresponding code change. Most CI platforms expose test results via API (GitHub Actions, CircleCI, GitLab CI all have test reporting endpoints). Enabling test retry in CI and treating "failed then passed on retry" as a flakiness signal is also highly reliable — that pattern is nearly always a flaky test, not a real failure.

Should I delete flaky tests or fix them?

Fix them, not delete them. A flaky test almost always covers real behavior — the test logic is usually correct, and the flakiness is a symptom of a legitimate problem in the test setup (race condition, missing cleanup, unreliable external dependency). Deleting the test removes coverage without fixing the underlying issue. The right sequence: quarantine it so it stops blocking merges, investigate the root cause, fix the test or the system under test, then return it to the blocking suite. Only delete a test if it covers behavior that no longer exists in the codebase.

Flaky Test Detector: Automatically Find and Fix Unreliable Tests