Products

Ultimate Guide to Fixing Flaky Tests

Flaky Tests

Mar 28, 2024

You have flaky tests? Where do you start? You can start here.

This is where BuildPulse shines - we automatically catalog flaky tests as they appear across pull requests or periodic builds, assess the impact through disruptiveness and developer time consumed, mitigate impact through test quarantining, surface relevant context clues to determine root cause, and assess if flakiness has truly been fixed.

We see all kinds of flaky tests - this is a comprehensive guide on test flakiness, setting up workflows and SOPs around them, mitigating impact, and fixing them. This guide is applicable for both unit and integration tests.

What is a Flaky Test?

A flaky test is an unreliable test that sometimes passes and sometimes fails, even though no changes were made to the code. How can this happen? There’s usually some form of nondeterminism going on, where behavior or execution of code changes each time it’s run. This unpredictability makes it challenging to spot and debug, as they dont fail reliably - how are you supposed to reproduce? Left unchecked, developers have to keep re-running builds which wastes time and increases CI costs. The test itself doesn’t work, so it masks issues that could lead to incidents in production - not to mention, they undermine confidence in the testing process itself.

Keeping track of flaky tests is crucial for maintaining a healthy engineering team and product, ensuring that tests reliably indicate the state of the codebase.

Reasons They Occur

The root cause of flaky tests always contains an element of randomness: race conditions, external resource access, leaked state, or timing related issues. Let’s go a bit more into these:

Race Conditions
A race condition is the condition of a program, when 2 actions are being taken on the same resource at the same time. the order of actions on a particular resource changes the outcome of the program. For example, if both my friend and I reach for the same piece of cake, in each occurrence a different person will end up with the cake.
External Resource Access
If I rely on an external resource that’s out of my control, I’m relying on it to behave a certain way. If the behavior changes intermittently, then my assumptions are broken. One example of this is an API request to another service that’s not maintained by my team.
Leaked State
A leaked state is the condition of a program, when the one action on a particular resource unintentionally breaks precondition assumptions for the other action. For example, if I have a tennis ball and 2 operations: pick up & throw. If I pick up the ball first and throw it, it’ll end up far away from me. If I do a throwing motion first and then pick up the ball, the ball hasn’t gone anywhere. Me picking up the ball was an assumption for being able to throw the ball.
Timing Issue
Timing issues can occur if you rely on an event occurring within a period of time, where the origin of the event is an external system. An example of this, would be a test if a popup appears on button click.
Deprecated Functionality
Old or unused code that has persisted over time, usually no directly responsible maintainer.
Brittle Tests
Tests that are overfitted to code as opposed to functionality. We will do another post on this :)

Why they’re hard to capture, catalog, and fix

Capturing and cataloging flaky tests is challenging due to their intermittent nature - they might only fail under specific conditions that are hard to replicate. Because they occur sporadically, they’re often ignored - especially if the test isn’t owned by the developer.

Without a dedicated system for tracking these failures, identifying a pattern becomes difficult, especially in large organizations where information might be siloed.

Fixing flaky tests is complicated because the root cause is not always clear. It requires thorough investigation to replicate the flaky behavior consistently, which can take time - often in short supply. The intermittent nature of these tests means that a passing test doesn't always indicate a problem has been resolved.

Impact on the Organization If They’re Not Fixed

These tests are continually re-ran until they pass, which consumes a significant amount of time in aggregate. Aside from this being a waste of time and CI cost, code that’s hard to merge slows down releases and hurts developer experience. At the end of the day, these tests are broken and don’t test anything, which gives opportunities for issues to slide into production. And these issues tend to be just as tough to reproduce as the test failures themselves.

All this is great, but I have flaky tests today. Where do I start?

Cataloging and Finding Impact
The first step in addressing flaky tests is to catalog them systematically. This involves creating a repository or a tracking system (spreadsheet, tickets, etc) where each flaky test is recorded along with its characteristics—such as when it fails, under what conditions, and how often. This catalog should be accessible to all team members to contribute and consult. The purpose of cataloging is not just to enumerate and track status of the tests but also provide context into reproducing, and create ownership.
Some tests may be more critical than others, affecting key features or frequently used code paths. By assessing the impact, teams can prioritize which flaky tests require immediate attention, ensuring that efforts are focused where they can have the most significant effect on improving the stability and reliability of the product.
Impact Mitigation
Once flaky tests are identified and prioritized, the next step is to mitigate their impact until they are fixed - so you don’t suffer in the meantime. One common strategy is to retry tests - you can configure your test framework to retry failed tests. This is a more granular form of retrying the whole build, but can save time. One downside of this approach is that it hides which tests may be flaky.
Another strategy is to simply disable the most disruptive tests temporarily until a fix can be implemented. It's crucial, however, that this is only a temporary measure and disabled tests are tracked - flakiness is best solved right away as the state of the application and test are fresh, and ownership is clear. Although these strategies don’t not fix the underlying issue, they can help avoid unnecessary retries.
Delegation and Test Ownership
Assigning ownership of tests to specific team members or sub-teams is critical in managing flaky tests effectively. Ownership means that specific individuals are responsible for maintaining the test, tracking its performance, and addressing any issues that arise. This approach ensures that tests do not become orphaned, which often leads to neglect. This is understandable, as no one team can understand all components of a system, and also have their own deadlines - making it easy to abdicate responsibility.
The owners are also responsible for making the tough calls—such as whether a test needs to be rewritten, removed, or fixed. This level of responsibility and accountability encourages a more proactive approach to maintaining test health and ensures that flaky tests are addressed in a timely manner.
Debugging
1. Surfacing Relevant Context
  Debugging flaky tests begins with gathering as much context as possible about the test and its environment. This includes finding the commit when the test first started flaking, error messages for flakes (not true failures), disruptiveness (for prioritization), timing of occurrences, environments where flakiness is observed. Surfacing this context requires tools and practices that can capture and record detailed logs and system states, or combing through build logs. This detailed information is important for understanding the complex interplay of factors that contribute to the flakiness of a test.
2. Reproducing Flakiness
  The next step is being able to reproduce the conditions where the test fails - this will be the first step in determining how the test fails. There’s a number of strategies you can try to reproduce the failure:
  - Rerunning the individual test in isolation: If you see failures here, then the issue is likely contained within the test.
  - Running your tests in serial and in different orders: If you see failures here, the issue is likely due to leaked shared state between 2 test runs. For example, if one test alters rows in a database, and another test assumes the rows are unaltered, the ordering of tests matter.
  - Running your tests in parallel: If you see failures here, the issue could be leaked state, but could also be due to a race condition.
  - Using a debugger with the strategies above and stepping through to check state along the way
  It’s important to keep an open mind, as any of the common issues can crop up as you increase the scope of execution.
3. Reproducing Stability
  After identifying failure cases, it’s important to identify success cases as well and how they occur. This is often easier if you first identify the failure cases, as you can control various variables to force the test to pass - giving context clues in what might be the root cause.
  This might involve adjustments to the test itself, such as refining assertions or eliminating race conditions, as well as changes to the testing infrastructure to simplify testing conditions.
4. Diagnosing why (root cause)
  The final step in debugging flaky tests is diagnosing why the test was flaky in the first place. This requires a deep dive into the test's failure and success patterns, exploring both the test code and the application code it interacts with. You want to look for key levers that influence the outcome of tests, while using a debugger to check state along the way - whether it's a timing issue, a resource contention problem, or a hidden bug in the application.
  If you’re running integration tests, looking at spans from traces can also help - you may find different tests are accessing the same resources or data. This can also help conditions where flakiness is more likely to occur.
Applying the fix
Once the root cause is understood, given your context of the test and application, you can then make the changes necessary to mitigate and prevent the root cause from occurring.
Verifying if the Test is Fixed
After applying fixes to address flakiness, the critical next step is to verify if the test is indeed fixed. This verification process is not a one-time check but a rigorous, ongoing evaluation to ensure the test consistently behaves as expected across a variety of conditions. To effectively verify a test's stability, it should be run multiple times under different configurations, environments, and orders to mimic the range of scenarios it might encounter in real-world operations. This could involve varying network conditions, different database states, or changes in dependent services.
Moreover, integrating the fixed test into the continuous integration and deployment (CI/CD) pipeline serves as an additional layer of validation. Observing the test's behavior over time provides valuable insights into its reliability and the success of the fixes applied. Teams should also consider using metrics and monitoring tools to track the test's performance, looking for any signs of regression or new instances of flakiness.

Wow, this is great! But it’s so much work.

You’re right. Dealing with flakiness requires a significant amount of effort that is often in short supply, given time constraints, knowledge constraints, and impact. Although you can do all of this in a spreadsheet, at a certain scale, this process will become prone to human error.

This is where BuildPulse shines - we help automatically catalog flaky tests as they appear across pull requests or periodic builds, help assess the impact through disruptiveness and aggregate developer time consumed, mitigate impact through test quarantining, help surface relevant context clues to determine root cause, and assess if flakiness has truly been fixed.

We provide reporting, alerts, and integrations with whichever tools you use as a part of your developer workflow. If you’re interested in increasing developer productivity, or need advice on conquering flakiness - get in touch!

FAQ

Does BuildPulse replace my current CI system?

No.

We use GitHub Actions / CircleCI / Semaphore CI self-hosted functionality to run your builds on our infrastructure.

Other than faster builds, there are no changes to tooling or your developer workflows. You can continue using your CI system as-is.

How is BuildPulse faster than GitHub Actions hosted runners?

We use GitHub’s self-hosted functionality to run your builds on our infrastructure with latest generation + high single-core performance CPUs, also then further optimized for CI-type workloads. We’ve also tuned our VMs and block storage devices, increasing baseline performance while also cutting costs in half.

We also provide a toolkit to further speed up your pipelines, which includes ultra fast remote docker builders, docker layer caching, dependency caching, and more. With all of these improvements, we’ve seen 2x+ performance improvements in build times.

Can I use BuildPulse with other CI providers than GitHub Actions?

Yes! BuildPulse Runners will run jobs for CircleCI, SemaphoreCI - GitLab coming soon.

We aim to support all popular CI systems. If you're using one that's not listed, please contact support@buildpulse.io!

Is there a free trial available?

Yes, you can book a meeting here!

How do you secure my builds?

BuildPulse runs each job in a network- and compute- isolated environment with ephemeral VMs that leave behind a clean state after every run.

Do you support Mac and Windows runners?

This is on our roadmap! Email us at hello@buildpulse.io, or book a demo here!

Is BuildPulse SOC 2 compliant?

Yes, BuildPulse is SOC 2 Type 2 compliant.

How are BuildPulse Runners priced?

BuildPulse Runners charges on a per-second basis, which depend on the runner-type used. See our pricing page for more details.

How long does implementation/integration with BuildPulse take?

The minimum implementation involves 2 steps: Signing up for BuildPulse, and changing 1 in your GitHub Actions yaml file.

If you're using Semaphore CI or Circle CI, it's a 4 line change. See our Quickstart guide for more details.

Does BuildPulse replace my current CI system?

No.

We use GitHub Actions / CircleCI / Semaphore CI self-hosted functionality to run your builds on our infrastructure.

Other than faster builds, there are no changes to tooling or your developer workflows. You can continue using your CI system as-is.

How is BuildPulse faster than GitHub Actions hosted runners?

Can I use BuildPulse with other CI providers than GitHub Actions?

Yes! BuildPulse Runners will run jobs for CircleCI, SemaphoreCI - GitLab coming soon.

We aim to support all popular CI systems. If you're using one that's not listed, please contact support@buildpulse.io!

Is there a free trial available?

Yes, you can book a meeting here!

How do you secure my builds?

BuildPulse runs each job in a network- and compute- isolated environment with ephemeral VMs that leave behind a clean state after every run.

Do you support Mac and Windows runners?

This is on our roadmap! Email us at hello@buildpulse.io, or book a demo here!

Is BuildPulse SOC 2 compliant?

Yes, BuildPulse is SOC 2 Type 2 compliant.

How are BuildPulse Runners priced?

BuildPulse Runners charges on a per-second basis, which depend on the runner-type used. See our pricing page for more details.

How long does implementation/integration with BuildPulse take?

The minimum implementation involves 2 steps: Signing up for BuildPulse, and changing 1 in your GitHub Actions yaml file.

If you're using Semaphore CI or Circle CI, it's a 4 line change. See our Quickstart guide for more details.

Does BuildPulse replace my current CI system?

No.

We use GitHub Actions / CircleCI / Semaphore CI self-hosted functionality to run your builds on our infrastructure.

Other than faster builds, there are no changes to tooling or your developer workflows. You can continue using your CI system as-is.

How is BuildPulse faster than GitHub Actions hosted runners?

Can I use BuildPulse with other CI providers than GitHub Actions?

Yes! BuildPulse Runners will run jobs for CircleCI, SemaphoreCI - GitLab coming soon.

We aim to support all popular CI systems. If you're using one that's not listed, please contact support@buildpulse.io!

Is there a free trial available?

Yes, you can book a meeting here!

How do you secure my builds?

BuildPulse runs each job in a network- and compute- isolated environment with ephemeral VMs that leave behind a clean state after every run.

Do you support Mac and Windows runners?

This is on our roadmap! Email us at hello@buildpulse.io, or book a demo here!

Is BuildPulse SOC 2 compliant?

Yes, BuildPulse is SOC 2 Type 2 compliant.

How are BuildPulse Runners priced?

BuildPulse Runners charges on a per-second basis, which depend on the runner-type used. See our pricing page for more details.

How long does implementation/integration with BuildPulse take?

The minimum implementation involves 2 steps: Signing up for BuildPulse, and changing 1 in your GitHub Actions yaml file.

If you're using Semaphore CI or Circle CI, it's a 4 line change. See our Quickstart guide for more details.

Ready for Takeoff?

Book a Demo

Ready for Takeoff?

Book a Demo

Ready for Takeoff?

Book a Demo

Ultimate Guide to Fixing Flaky Tests

What is a Flaky Test?

Reasons They Occur

Why they’re hard to capture, catalog, and fix

Impact on the Organization If They’re Not Fixed

All this is great, but I have flaky tests today. Where do I start?

Cataloging and Finding Impact

Impact Mitigation

Delegation and Test Ownership

Debugging

Surfacing Relevant Context

Reproducing Flakiness

Reproducing Stability

Diagnosing why (root cause)

Applying the fix

Verifying if the Test is Fixed

Wow, this is great! But it’s so much work.

FAQ