.NET and pipelines featured image

Many open source projects have test suites that include nondeterministic test cases with unpredictable behavior. Tests might be nondeterministic because they launch several parallel processes or threads that interact in an unpredictable manner, or because they depend on some activity in the operating system that has nondeterministic behavior. The presence of these tests can interfere with automated regression checking in CI/CD pipelines. This article shows how to automate the discovery of nondeterministic test cases using a short Python script based on the Bunsen test suite analysis toolkit.

The problem: Finding nondeterministic ("flaky") test cases

Test cases in an open source project's test suite can have nondeterministic behavior and produce different outcomes when run repeatedly. Such test cases are commonly referred to as flaky, and their presence in a test suite tends to complicate the evaluation of test results. Without additional investigation, a PASS or FAIL outcome for a nondeterministic test case doesn't conclusively prove the presence or absence of a problem.

Nondeterministic test cases are usually found in test suites of projects such as SystemTap and the GNU Debugger (GDB) because they provide value when testing the project's functionality under ideal conditions. Rewriting these test suites to eliminate nondeterminism would be a large and low-priority task tying up a large amount of scarce developer time. Therefore, it's worthwhile to develop tools to analyze test results from a project and identify nondeterministic test cases. A developer reading test results could use this analysis to recognize nondeterministic test cases and interpret their outcomes separately from outcomes of reliable test cases.

In a previous article, Automating the testing process for SystemTap, Part 2: Test result analysis with Bunsen, I described Bunsen, a tool that collects a set of test result log files from a project and stores them in a deduplicated Git repository together with an index in JSON format. Bunsen also provides a Python library for accessing the data in this repository. These capabilities can be used to implement a script to detect nondeterministic test cases.

Developing the script

The overall strategy of the script is to find test cases that have been run multiple times on the same system configuration with varying outcomes. Such test cases are likely to be nondeterministic.

Basic setup

The analysis script starts by importing and initializing the Bunsen library:

1 #!/usr/bin/env python3

2  info="""Detect nondeterministic testcases that yield different outcomes when tested
3  multiple times on the same configuration."""

4  from bunsen import Bunsen, BunsenOptions

5  if __name__=='__main__':
6     BunsenOptions.add_option('source_repo', group='source_repo',
7          cmdline='source-repo', default=None,
8          help_str="Use project commit history from Git repo <path>",
9          help_cookie="<path>")

10    BunsenOptions.add_option('branch', group='source_repo', default=None,
11        help_str="Use project commit history from <branch> in source_repo",
12        help_cookie="<branch>")

13    BunsenOptions.add_option('project', group='filtering', default=None,
14        help_str="Restrict the analysis to testruns in <projects>",
15        help_cookie="<projects>")
16
17 import git
18 import tqdm
19 from common.utils import * # add_list, add_set

20 if __name__=='__main__':
21
22     b, opts = Bunsen.from_cmdline(info=info)
23     projects = opts.get_list('project', default=b.projects)
24     repo = git.Repo(opts.source_repo)

       for # Iterate all testruns in projects
               # Collect information from the testrun
        # Print the results

A Bunsen analysis script is a Python program that imports the bunsen module. Lines 5-15 of the preceding script define the following options:

  • source_repo identifies a Git repository containing up-to-date source code for the project. The commit history of this repository identifies the relative version order of test runs.
  • branch identifies a branch within source_repo.
  • project names a project within the Bunsen repository, and is present because the Bunsen repository can store test results from more than one project. Test results from separate projects are stored in separate branches, and the analysis script can be instructed to scan and compare test results from a single project or from a subset of the projects. If this option is omitted, all test runs in the Bunsen repository will be scanned.

Options for an analysis script can be passed as command-line arguments or specified in the Bunsen repository's configuration file. For example, if the Bunsen repository is stored under /path/to/bunsen/.bunsen, the configuration file is located at /path/to/bunsen/.bunsen/config.

The second part of the script (lines 20-24) instantiates the following objects:

  • b, an instance of the Bunsen class providing access to the Bunsen repository
  • opts, an instance of the BunsenOptions class providing access to the script's options
  • repo, an instance of the git.Repo class from the GitPython library, providing access to the version history of the project in the source_repo repository.

Collecting the test results

A test case is considered to be nondeterministic if it was tested more than once on the same source_repo commit and system configuration with varying outcomes—a PASS outcome in one test run and a FAIL outcome in another, for instance. To determine which test cases produce varying outcomes, the script gathers a list of test runs for each commit and configuration. The script then iterates through the test runs for each combination and compares the outcomes of each test case for different test runs. The script uses a dictionary named all_testruns to store the list of test runs corresponding to each commit and configuration:

26 all_testruns = {} # maps (commit, config) -> list(Testrun)
27
28 for testrun in b.testruns(opts.projects):

29     commit, config = testrun.get_source_commit(), testrun.get_config()

30     if commit is None: continue

31     add_list(all_testruns, (commit,config), testrun)

for # Iterate all (commit, config)
        # Iterate the set of testruns matching (commit, config),
        # and compare the outcome of each testcase to detect nondeterminism

# Print the results

An instance of the Testrun class in the Bunsen library represents a single test run. The instance provides access to the commit that was tested, the system configuration, and the outcomes of individual test cases. The all_testruns dictionary, defined on line 26, maps a (commit, config) pair to a list of Testrun instances.

For each test run, the loop invokes the utility method add_list at line 31 to add the test run to the dictionary. The add_list method is a simple utility method that appends a value to a list stored at a specified key:

def add_list(d,k,v):
    if k not in d: d[k] = []
    d[k].append(v)

Identifying the nondeterministic test cases

Next, the script iterates over the list of Testrun objects for each commit and configuration. To record the list of test cases that produced varying outcomes, the script uses a second dictionary named known_flakes, whose keys are (testcase, config) pairs:

26 all_testruns = {} # maps (commit, config) -> list(Testrun)
27
28 for testrun in b.testruns(opts.projects):
29     commit, config = testrun.get_source_commit(), testrun.get_config()
30     if commit is None: continue
31     add_list(all_testruns, (commit,config), testrun)
32
33 known_flakes = {} # maps (tc_info, config) -> set(commit)
34 # where tc_info is (name, subtest, outcome)
35
36 for commit, config in tqdm.tqdm(all_testruns, \
37     desc="Scanning configurations", unit="configs"):

    if len(all_testruns[commit, config]) <= 1:
        continue # no possibility of flakes
    commit_testcases = {} # maps tc_info -> list(Testrun)
    for testrun in all_testruns[commit, config]:
        # Gather the list of failing tc_info tuples appearing in testrun
    for # each tc_info tuple that appears in some testrun":
        # Check whether the failing tuple appears in all testruns;
        # If not, mark the tuple as a flake

# Print the results

The second loop, iterating over the commits and configurations, could take a long time. So the script uses the Python tqdm library to display a progress bar (lines 36-37).

After the remaining code is filled in, the second loop appears as follows:

…

36 for commit, config in tqdm.tqdm(all_testruns, \
37     desc="Scanning configurations", unit="configs"):
38
39     if len(all_testruns[commit, config]) <= 1:
40        continue # no possibility of flakes
41
42     commit_testcases = {} # maps tc_info -> list(Testrun)
43
44 for testrun in all_testruns[commit, config]:
45     for tc in testrun.testcases:
46         if tc.is_pass(): continue
47         tc_info = (tc.name, tc.outcome, tc.subtest)
48          add_list(commit_testcases, tc_info, testrun)
49
50 expected_testruns = len(all_testruns[commit, config])
51 for tc_info in commit_testcases:
52     if len(commit_testcases[tc_info]) < n_testruns:
53         # XXX tc_info didn't appear in all runs
54         add_set(known_flakes, tc_info, commit)

…

The second loop skips (commit, config) pairs for which only one test run was found (lines 39-40). For each of the other test runs, the loop iterates over its test case outcomes and gathers a list of test failures that appear in the test run. Test case outcomes are represented by instances of Bunsen's Testcase class. In accordance with the DejaGNU framework's test result model, a Testcase object has fields called 'name' (the name of the top-level Expect file defining the test case), 'outcome' (one of the standard POSIX outcome codes, such as PASS, FAIL, or UNTESTED), and 'subtest' (a string providing additional information about the outcome).

A third dictionary named commit_testcases stores failing test case outcomes. The dictionary maps the (name, outcome, subtest) tuple describing the test failure to a list of test runs where this tuple was found to occur. The script assembles commit_testcases on lines 44-48 and iterates over it on lines 51-54 to collect every (name, outcome, subtest) tuple that appeared in some test runs but not all of them. Such a tuple fits our definition of a varying test outcome and therefore is stored in the known_flakes dictionary. The known_flakes dictionary maps each (testcase, config) combination to a set of commit IDs on which that combination was found to produce varying outcomes.

Reporting the nondeterministic test cases

Having accumulated a list of suspected nondeterministic tests in the known_flakes dictionary, the script iterates through it and prints the nondeterministic tests' outcomes:

56 sorted_tc = []
57 for tc_info in all_testcases:
58     sorted_tc.append((tc_info, all_testcases[tc_info]))
59 sorted_tc.sort(reverse=True, key=lambda tup: len(tup[1]))
60 for tc_info, commits in sorted_tc:
61     print(len(commits),"commits have nondeterministic",tc_info)

The script sorts the test outcomes (on lines 56-59) in decreasing order of frequency: test cases found to produce varying outcomes on a larger number of commits are printed first. An additional loop can be added to print the commits on which the test outcomes were found to be nondeterministic:

60 for tc_info, commits in sorted_tc:
61     print(len(commits),"commits have nondeterministic",tc_info)
62     for hexsha in commits:
63         commit = repo.commit(hexsha)
64         print("*",commit.hexsha[:7],commit.summary)

Lines 63-64 use the GitPython library and the git.Repo object that was instantiated at the beginning of the script to retrieve a summary of the commit message.

The completed analysis script is less than 100 lines of Python code. When tested on a modest laptop (2.3GHz i3-6100U), the script took approximately 42 seconds with a maximum resident memory size of 285MB to scan a Bunsen repository from the SystemTap project containing data from 4,158 test runs across 693 commits. Within that Bunsen repository, 368 (commit, config) pairs were tested by more than one test run and provided useful data for the analysis script. In practice, more complex analysis scripts that compare test case results over time (rather than within the same commit) will tend to have larger RAM requirements.

When run, the analysis script produces output similar to the following:

72 commits have nondeterministic ('systemtap.base/attach_detach.exp', 'FAIL: attach_detach (initial load) - EOF\n', 'FAIL')

72 commits have nondeterministic ('systemtap.base/attach_detach.exp', 'FAIL: attach_detach (initial load) - begin seen (1 0)\n', 'FAIL')

61 commits have nondeterministic ('systemtap.examples/check.exp', 'FAIL: systemtap.examples/io/ioblktime run\n', 'FAIL')

51 commits have nondeterministic ('systemtap.base/utrace_p5.exp', 'FAIL: UTRACE_P5_07 unexpected output (after passing output)\n', 'FAIL')

47 commits have nondeterministic ('systemtap.syscall/tp_syscall.exp', 'FAIL: 32-bit madvise tp_syscall\n', 'FAIL')

40 commits have nondeterministic ('systemtap.base/abort.exp', 'FAIL: abort: TEST 6: abort() in timer.profile (using globals): stdout: string should be "fire 3!\\nfire 2!\\nfire 1!\\n", but got "fire 2!\n', 'FAIL')

39 commits have nondeterministic ('systemtap.syscall/tp_syscall.exp', 'FAIL: 64-bit clock tp_syscall\n', 'FAIL')

39 commits have nondeterministic ('systemtap.syscall/tp_syscall.exp', 'FAIL: 32-bit clock tp_syscall\n', 'FAIL')

38 commits have nondeterministic ('systemtap.syscall/tp_syscall.exp', 'FAIL: 32-bit socket tp_syscall\n', 'FAIL')

37 commits have nondeterministic ('systemtap.onthefly/kprobes_onthefly.exp', 'FAIL: kprobes_onthefly - otf_start_disabled_iter_5 (invalid output)\n', 'FAIL')

37 commits have nondeterministic ('systemtap.onthefly/kprobes_onthefly.exp', 'FAIL: kprobes_onthefly - otf_timer_50ms (invalid output)\n', 'FAIL')

36 commits have nondeterministic ('systemtap.syscall/tp_syscall.exp', 'FAIL: 64-bit madvise tp_syscall\n', 'FAIL')

34 commits have nondeterministic ('systemtap.bpf/nonbpf.exp', 'FAIL: bigmap1.stp unexpected output\n', 'FAIL')

33 commits have nondeterministic ('systemtap.onthefly/kprobes_onthefly.exp', 'FAIL: kprobes_onthefly - otf_timer_10ms (invalid output)\n', 'FAIL')

33 commits have nondeterministic ('systemtap.bpf/bpf.exp', 'FAIL: timer2.stp incorrect result\n', 'FAIL')

33 commits have nondeterministic ('systemtap.bpf/bpf.exp', 'KFAIL: bigmap1.stp unexpected output (PRMS: BPF)\n', 'KFAIL')

33 commits have nondeterministic ('systemtap.bpf/bpf.exp', 'FAIL: stat3.stp incorrect result\n', 'FAIL')

33 commits have nondeterministic ('systemtap.onthefly/kprobes_onthefly.exp', 'FAIL: kprobes_onthefly - otf_timer_100ms (invalid output)\n', 'FAIL')

32 commits have nondeterministic ('systemtap.server/client.exp', 'FAIL: New trusted servers matches after reinstatement by ip address\n', 'FAIL')

32 commits have nondeterministic ('systemtap.unprivileged/unprivileged_myproc.exp', 'FAIL: unprivileged myproc: --unprivileged process.thread.end\n', 'FAIL')

32 commits have nondeterministic ('systemtap.base/procfs_bpf.exp', 'FAIL: PROCFS_BPF initial value: cat: /var/tmp/systemtap-root/PROCFS_BPF/command: No such file or directory\n', 'FAIL')

32 commits have nondeterministic ('systemtap.base/abort.exp', 'FAIL: abort: TEST 6: abort() in timer.profile (using globals): stdout: string should be "fire 3!\\nfire 2!\\nfire 1!\\n", but got "fire 3!\n', 'FAIL')

31 commits have nondeterministic ('systemtap.syscall/nd_syscall.exp', 'FAIL: 32-bit clock nd_syscall\n', 'FAIL')

31 commits have nondeterministic ('systemtap.onthefly/kprobes_onthefly.exp', 'FAIL: kprobes_onthefly - otf_start_enabled_iter_4 (invalid output)\n', 'FAIL')

31 commits have nondeterministic ('systemtap.onthefly/kprobes_onthefly.exp', 'FAIL: kprobes_onthefly - otf_start_enabled_iter_5 (invalid output)\n', 'FAIL')

31 commits have nondeterministic ('systemtap.onthefly/kprobes_onthefly.exp', 'FAIL: kprobes_onthefly - otf_start_disabled_iter_3 (invalid output)\n', 'FAIL')

30 commits have nondeterministic ('systemtap.syscall/syscall.exp', 'FAIL: 32-bit clock syscall\n', 'FAIL')

Conclusion

This article illustrates how Bunsen's Python library can be used to quickly develop analysis scripts to answer questions about a project's testing history. More generally, the example demonstrates the benefit of keeping a long-term archive of test results that can be used to answer questions about a project's testing history.

Last updated: February 5, 2024