flaky.rst

Overview

This document provides an in-depth discussion and guidance on the issue of **flaky tests** in software development, specifically within the context of using the `pytest` testing framework. A flaky test is defined as one that intermittently passes or fails without clear deterministic reasons, posing significant challenges for continuous integration (CI) pipelines and developer trust in test reliability.

The file serves as a comprehensive resource covering:

The nature and impact of flaky tests.
Common root causes and contributing factors.
Relevant pytest features and plugins that help manage flaky tests.
General strategies to identify, isolate, and fix flaky tests.
References to academic research and industry resources on flaky tests.

This is a documentation page rather than executable code, so it focuses on conceptual explanations, practical advice, and curated external references rather than programmatic APIs.

Detailed Content Breakdown

Flaky Tests: Definition and Problems

**Purpose:** Explain what flaky tests are and why they are problematic, particularly in CI environments where test reliability is crucial for code integration.

**Key Points:**

Flaky tests pass and fail nondeterministically.
They erode developer confidence in test results.
Cause wasted time due to reruns and investigations.
Tend to be more frequent in higher-level or integration tests.

Potential Root Causes of Flaky Tests

This section categorizes common reasons flaky tests occur:

System State Issues
- Tests depend on external or shared system state that is not isolated.
- Parallel test execution (e.g., via pytest-xdist) can expose ordering dependencies.
- Tests failing to clean up after themselves cause side effects.
Overly Strict Assertions
- For example, exact floating-point comparisons or timing-sensitive checks.
- pytest.approx is recommended for tolerant numeric comparisons.
Thread Safety
- pytest itself is single-threaded, but tests may spawn threads.
- Thread-related flakiness can arise if spawned threads are not properly joined.
- pytest’s primitives (pytest.warns, pytest.raises) are not thread-safe.
- Global state usage inside pytest can cause flakiness in multithreaded tests.

Related pytest Features and Plugins

xfail marker with strict=False:
Allows marking tests expected to fail without breaking the build. Acts as a manual quarantine but risky if used long-term.
Environment Variable PYTEST_CURRENT_TEST:
Useful for debugging stuck tests by identifying the current test in execution.
Plugins to Mitigate Flaky Tests:
- pytest-rerunfailures: Automatically reruns failed tests a specified number of times.
- pytest-replay: Helps reproduce flaky test failures locally by replaying test runs.
- pytest-flakefinder: Detects flaky tests in your suite.
- Randomization plugins (pytest-random-order, pytest-randomly) to reveal hidden dependencies by varying test order.

General Strategies for Handling Flaky Tests

Split Test Suites:
Separate unit tests (fast, reliable) from integration or higher-level tests (slower, more prone to flakiness). Use unit tests as the main CI gate.
Capture Video/Screenshots on Failure:
Useful especially for UI tests to diagnose state at failure. Plugins like pytest-splinter can automate screenshots on failure.
Delete or Rewrite Tests:
Remove redundant flaky tests or rewrite them at a lower level to eliminate flakiness.
Quarantine Flaky Tests:
Isolate flaky tests temporarily while investigating (see linked blog post by Mark Lapierre).
CI Tools with Rerun Capabilities:
Examples include Azure Pipelines, which can detect and rerun flaky tests automatically.

Research and References

The document lists seminal academic papers and industry whitepapers on flaky tests, covering detection techniques, root causes, and mitigation approaches. It also provides links to blog posts, talks, and case studies from major organizations like Microsoft, Google, Dropbox, and Uber, offering insights into real-world experiences managing flaky tests.

Implementation Details / Algorithms

Since this is a documentation page, it contains no algorithmic implementations or software classes/functions. Instead, it organizes knowledge about flaky tests in a structured format to help users understand and address flakiness in their test suites.

System Interaction

While not a code file, this documentation interacts with the broader pytest ecosystem by:

Referencing pytest core features (pytest.approx, pytest.mark.xfail).
Linking to pytest plugins that extend functionality around flaky test detection and management.
Advising usage patterns that affect how tests integrate within CI pipelines and test runners.
Providing environment variables and pytest internals useful for debugging test execution.

Usage Example

The document is intended to be read by developers and testers who want to understand flaky tests better and improve their test reliability. It can be used as:

A knowledge base article within pytest documentation.
A starting point for teams to adopt best practices against flaky tests.
A guide to select appropriate pytest plugins and CI configurations.

Visual Diagram: Structure of flaky.rst

This diagram presents the hierarchical structure of the document’s topics and their relationships.

flowchart TD
    A[Flaky Tests Documentation] --> B[Definition & Problem]
    A --> C[Root Causes]
    C --> C1[System State]
    C --> C2[Strict Assertions]
    C --> C3[Thread Safety]
    A --> D[pytest Features & Plugins]
    D --> D1[xfail Marker]
    D --> D2[PYTEST_CURRENT_TEST]
    D --> D3[Plugins]
    D3 --> D3a[pytest-rerunfailures]
    D3 --> D3b[pytest-replay]
    D3 --> D3c[pytest-flakefinder]
    D3 --> D3d[Randomization Plugins]
    A --> E[General Strategies]
    E --> E1[Split Test Suites]
    E --> E2[Video/Screenshot on Failure]
    E --> E3[Delete or Rewrite Tests]
    E --> E4[Quarantine]
    E --> E5[CI Tools Rerun]
    A --> F[Research & References]

Summary

This file is a comprehensive guide focused on the identification, causes, and mitigation of flaky tests within pytest-based environments. It combines practical advice, tooling options, and references to external research and industry practices, supporting developers in improving test reliability and CI pipeline stability.