Ultimate Guide to Fixing Flaky Tests

Flaky Tests

Mar 28, 2024

You have flaky tests? Where do you start? You can start here.

This is where BuildPulse shines - we automatically catalog flaky tests as they appear across pull requests or periodic builds, assess the impact through disruptiveness and developer time consumed, mitigate impact through test quarantining, surface relevant context clues to determine root cause, and assess if flakiness has truly been fixed.

We see all kinds of flaky tests - this is a comprehensive guide on test flakiness, setting up workflows and SOPs around them, mitigating impact, and fixing them. This guide is applicable for both unit and integration tests.

What is a Flaky Test?

A flaky test is an unreliable test that sometimes passes and sometimes fails, even though no changes were made to the code.  How can this happen? There’s usually some form of nondeterminism going on, where behavior or execution of code changes each time it’s run. This unpredictability makes it challenging to spot and debug, as they dont fail reliably - how are you supposed to reproduce? Left unchecked, developers have to keep re-running builds which wastes time and increases CI costs. The test itself doesn’t work, so it masks issues that could lead to incidents in production - not to mention, they undermine confidence in the testing process itself.

Keeping track of flaky tests is crucial for maintaining a healthy engineering team and product, ensuring that tests reliably indicate the state of the codebase.

Reasons They Occur

The root cause of flaky tests always contains an element of randomness: race conditions, external resource access, leaked state, or timing related issues. Let’s go a bit more into these:

  • Race Conditions
    A race condition is the condition of a program, when 2 actions are being taken on the same resource at the same time. the order of actions on a particular resource changes the outcome of the program. For example, if both my friend and I reach for the same piece of cake, in each occurrence a different person will end up with the cake. 

  • External Resource Access
    If I rely on an external resource that’s out of my control, I’m relying on it to behave a certain way. If the behavior changes intermittently, then my assumptions are broken. One example of this is an API request to another service that’s not maintained by my team.

  • Leaked State
    A leaked state is the condition of a program, when the one action on a particular resource unintentionally breaks precondition assumptions for the other action. For example, if I have a tennis ball and 2 operations: pick up & throw. If I pick up the ball first and throw it, it’ll end up far away from me. If I do a throwing motion first and then pick up the ball, the ball hasn’t gone anywhere. Me picking up the ball was an assumption for being able to throw the ball.

  • Timing Issue
    Timing issues can occur if you rely on an event occurring within a period of time, where the origin of the event is an external system. An example of this, would be a test if a popup appears on button click.

  • Deprecated Functionality
    Old or unused code that has persisted over time, usually no directly responsible maintainer.

  • Brittle Tests
    Tests that are overfitted to code as opposed to functionality. We will do another post on this :)

Why they’re hard to capture, catalog, and fix

Capturing and cataloging flaky tests is challenging due to their intermittent nature - they might only fail under specific conditions that are hard to replicate. Because they occur sporadically, they’re often ignored - especially if the test isn’t owned by the developer. 

Without a dedicated system for tracking these failures, identifying a pattern becomes difficult, especially in large organizations where information might be siloed.

Fixing flaky tests is complicated because the root cause is not always clear. It requires thorough investigation to replicate the flaky behavior consistently, which can take time - often in short supply. The intermittent nature of these tests means that a passing test doesn't always indicate a problem has been resolved.

Impact on the Organization If They’re Not Fixed

These tests are continually re-ran until they pass, which consumes a significant amount of time in aggregate. Aside from this being a waste of time and CI cost, code that’s hard to merge slows down releases and hurts developer experience. At the end of the day, these tests are broken and don’t test anything, which gives opportunities for issues to slide into production. And these issues tend to be just as tough to reproduce as the test failures themselves.

All this is great, but I have flaky tests today. Where do I start?

  1. Cataloging and Finding Impact

    The first step in addressing flaky tests is to catalog them systematically. This involves creating a repository or a tracking system (spreadsheet, tickets, etc)  where each flaky test is recorded along with its characteristics—such as when it fails, under what conditions, and how often. This catalog should be accessible to all team members to contribute and consult. The purpose of cataloging is not just to enumerate and track status of the tests but also provide context into reproducing, and create ownership.

    Some tests may be more critical than others, affecting key features or frequently used code paths. By assessing the impact, teams can prioritize which flaky tests require immediate attention, ensuring that efforts are focused where they can have the most significant effect on improving the stability and reliability of the product.

  2. Impact Mitigation

    Once flaky tests are identified and prioritized, the next step is to mitigate their impact until they are fixed - so you don’t suffer in the meantime. One common strategy is to retry tests - you can configure your test framework to retry failed tests. This is a more granular form of retrying the whole build, but can save time. One downside of this approach is that it hides which tests may be flaky. 

    Another strategy is to simply disable the most disruptive tests temporarily until a fix can be implemented. It's crucial, however, that this is only a temporary measure and disabled tests are tracked - flakiness is best solved right away as the state of the application and test are fresh, and ownership is clear. Although these strategies don’t not fix the underlying issue, they can help avoid unnecessary retries.

  3. Delegation and Test Ownership

    Assigning ownership of tests to specific team members or sub-teams is critical in managing flaky tests effectively. Ownership means that specific individuals are responsible for maintaining the test, tracking its performance, and addressing any issues that arise. This approach ensures that tests do not become orphaned, which often leads to neglect. This is understandable, as no one team can understand all components of a system, and also have their own deadlines - making it easy to abdicate responsibility. 

    The owners are also responsible for making the tough calls—such as whether a test needs to be rewritten, removed, or fixed. This level of responsibility and accountability encourages a more proactive approach to maintaining test health and ensures that flaky tests are addressed in a timely manner.

  4. Debugging

    1. Surfacing Relevant Context

      Debugging flaky tests begins with gathering as much context as possible about the test and its environment. This includes finding the commit when the test first started flaking, error messages for flakes (not true failures), disruptiveness (for prioritization), timing of occurrences, environments where flakiness is observed. Surfacing this context requires tools and practices that can capture and record detailed logs and system states, or combing through build logs. This detailed information is important for understanding the complex interplay of factors that contribute to the flakiness of a test.

    2. Reproducing Flakiness

      The next step is being able to reproduce the conditions where the test fails - this will be the first step in determining how the test fails. There’s a number of strategies you can try to reproduce the failure:

      • Rerunning the individual test in isolation: If you see failures here, then the issue is likely contained within the test.

      • Running your tests in serial and in different orders: If you see failures here, the issue is likely due to leaked shared state between 2 test runs. For example, if one test alters rows in a database, and another test assumes the rows are unaltered, the ordering of tests matter.

      • Running your tests in parallel: If you see failures here, the issue could be leaked state, but could also be due to a race condition.

      • Using a debugger with the strategies above and stepping through to check state along the way

      It’s important to keep an open mind, as any of the common issues can crop up as you increase the scope of execution.

    3. Reproducing Stability

      After identifying failure cases, it’s important to identify success cases as well and how they occur. This is often easier if you first identify the failure cases, as you can control various variables to force the test to pass - giving context clues in what might be the root cause.

      This might involve adjustments to the test itself, such as refining assertions or eliminating race conditions, as well as changes to the testing infrastructure to simplify testing conditions. 

    4. Diagnosing why (root cause)

      The final step in debugging flaky tests is diagnosing why the test was flaky in the first place. This requires a deep dive into the test's failure and success patterns, exploring both the test code and the application code it interacts with. You want to look for key levers that influence the outcome of tests, while using a debugger to check state along the way - whether it's a timing issue, a resource contention problem, or a hidden bug in the application.

      If you’re running integration tests, looking at spans from traces can also help - you may find different tests are accessing the same resources or data. This can also help conditions where flakiness is more likely to occur.

  5. Applying the fix

    Once the root cause is understood, given your context of the test and application, you can then make the changes necessary to mitigate and prevent the root cause from occurring.

  6. Verifying if the Test is Fixed

    After applying fixes to address flakiness, the critical next step is to verify if the test is indeed fixed. This verification process is not a one-time check but a rigorous, ongoing evaluation to ensure the test consistently behaves as expected across a variety of conditions. To effectively verify a test's stability, it should be run multiple times under different configurations, environments, and orders to mimic the range of scenarios it might encounter in real-world operations. This could involve varying network conditions, different database states, or changes in dependent services.

    Moreover, integrating the fixed test into the continuous integration and deployment (CI/CD) pipeline serves as an additional layer of validation. Observing the test's behavior over time provides valuable insights into its reliability and the success of the fixes applied. Teams should also consider using metrics and monitoring tools to track the test's performance, looking for any signs of regression or new instances of flakiness.

Wow, this is great! But it’s so much work.

You’re right. Dealing with flakiness requires a significant amount of effort that is often in short supply, given time constraints, knowledge constraints, and impact. Although you can do all of this in a spreadsheet, at a certain scale, this process will become prone to human error.

This is where BuildPulse shines - we help automatically catalog flaky tests as they appear across pull requests or periodic builds, help assess the impact through disruptiveness and aggregate developer time consumed, mitigate impact through test quarantining, help surface relevant context clues to determine root cause, and assess if flakiness has truly been fixed.

We provide reporting, alerts, and integrations with whichever tools you use as a part of your developer workflow. If you’re interested in increasing developer productivity, or need advice on conquering flakiness - get in touch!

FAQ

What is the difference between a flaky test and a false positive?

A false positive is a test failure in your test suite due to an actual error in the code being executed, or a mismatch in what the test expects from the code.

A flaky test is when you have conflicting test results for the same code. For example, while running tests if you see that a test fails and passes, but the code hasn’t changed, then it’s a flaky test. There’s many causes of flakiness.

What is an example of a flaky test?

An example can be seen in growing test suites - when pull request builds fail for changes you haven’t made. Put differently, when you see a test pass and fail without any code change. These failed tests are flaky tests.

What are common causes of flakiness?

Broken assumptions in test automation and development process can introduce flaky tests - for example, if test data is shared between different tests whether asynchronous, high concurrency, or sequential, the results of one test can affect another. 

Poorly written test code can also be a factor. Improper polling, race conditions, improper event dependency handling, shared test data, or timeout handling for network requests or page loads. Any of these can lead to flaky test failures and test flakiness.

End-to-end tests that rely on internal API uptime can cause test flakiness and test failures.

What's the impact of flaky tests?

Flaky tests can wreck havoc on the development process - from wasted developer time from test retries, to creating bugs and product instability and missed releases, time-consuming flaky tests can grind your development process to a halt.

What is the best way to resolve or fix flaky tests?

Devops, software engineering, and software development teams will often need to compare code changes, logs, and other context across test environments from before the test instability started, and after - adding retries or reruns can also help with debugging. Test detection and test execution tooling can help automate this process as well. 

BuildPulse enables you to find, assess impact metrics, quarantine, and fix flaky tests.

What are some strategies for preventing flaky tests?

Paying attention and prioritizing flaky tests as they come up can be a good way to prevent them from becoming an issue. This is where a testing culture is important - if a flaky test case is spotted by an engineer, it should be logged right away. This, however, takes a certain level of hygiene - BuildPulse can provide monitoring so flaky tests are caught right away.

What type of tests have flaky tests?

Flaky tests can be seen across the testing process - unit tests, integration tests, end-to-end tests, UI tests, acceptance tests.

What if I don't have that many flaky tests?

Flaky tests can be stealthy - often ignored by engineers and test runs are retried, they build up until they can’t be ignored anymore. These automated tests slow down developer productivity, impact functionality, and reduce confidence in test results and test suites. Better to get ahead while it’s easy and invest in test management.

It’s also important to prevent regressions to catch flakiness early while it’s manageable.

What languages and continuous integration providers does BuildPulse work with?

BuildPulse integrates with all continuous integration providers (including GitHub Actions, BitBucket Pipelines, and more), test frameworks, and workflows.

Combat non-determinism, drive test confidence, and provide the best experience you can to your developers!

How long does implementation/integration with BuildPulse take?

Implementation/integration takes 5 minutes!

Ready for Takeoff?

Ready for Takeoff?

Ready for Takeoff?