Improvements in GHC's testsuite infrastructure

Ben Gamari - 2019-07-08

GHC’s testsuite is our first line of defense against correctness regressions. However, as is often the case, the infrastructure that keeps it running has been long neglected. Our recent efforts in enforcing a CI-cleanliness in all GHC builds has resulted in a few bits of work that I thought would be nice to share.

Improving testsuite driver maintainability

GHC’s testsuite is a collection of Makefiles and Haskell programs all glued together with a clump of Python known as the testsuite driver. The testsuite driver is a clever albeit quirky piece of software which, despite its implementation language, displays the marks of a codebase written by a functional programmer. It defines a small embedded DSL for describing tests , the user-facing interface of which is generally declarative. For instance, a simple test definition might look like:

test('T13618', normal, compile_and_run, ['-v0'])

To provide this succinct language, the implementation relies on a number of clever tricks, often involving global variables, functions returning functions, and good helping of mutation.

Of course, all of these clever tricks are implemented in a dynamically-typed language in a codebase that has evolved over the course of nearly 20 years (the first commit to testlib.py was by Simon Marlow in 2003). In its nearly 5kLoC implementation one will find integers used as booleans, magic strings, undeclared global variables, variables that are sometimes thread-local yet elsewhere considered global, heaps of string concatenation, and numerous other curiosities. Needless to say, comprehending, let alone modifying, the testsuite driver has been getting harder and harder with every accrued feature.

In recent years the Python community has gradually awoken to the promise of statically-checked types. Python 3’s type annotation syntax in conjunction with the mypy typechecker has given us a path to bringing order to this cleverness. mypy implements a pleasantly complete type system in which most any Haskell 98 user would feel at home, supporting for newtypes, sums, and parametric polymorphism.

Over the last few weeks the driver has grown hundreds of lines of type annotations, all checked by mypy. Not only has this made the codebase significantly more readable but the process unearthed a few latent bugs as well. Of course, to ensure that the testsuite driver doesn’t regress, it is now typechecked during the lint stage of GHC’s CI pipeline.

In addition to typechecking, we took this opportunity to perform some long-overdue refactoring to use modern Python interfaces (e.g. pathlib instead of strings, bool instead of int, use of None where appropriate) and added numerous assertions (revealing yet more unnoticed bugs, some of which were silently causing tests to be inappropriately skipped or run).

Fragile and broken tests

GHC’s testsuite has a sizeable configuration space, with over 30 “ways” in which tests may be run (e.g. normal, profiled, with the threaded RTS, etc.) and a few “speeds” which select a subset of tests to be run.

For most of its existence GHC’s CI infrastructure (in its various forms) has tested only a small subset of the testsuite tests (namely the normal speed with around a dozen of the ways enabled). While we have periodically looked at the full slow testsuite output, rarely were we able to make significant headway in fixing the many issues we found.

Recently our new CI infrastructure has placed a renewed emphasis on improving testsuite coverage by regularly (e.g. at least on a nightly basis) testing the entire testsuite (as well as nofib, our performance testsuite). This, of course, meant fixing the hundreds of failing tests in the full testsuite run.

These failures generally broke down into a few classes:

tests that are themselves buggy
tests on which GHC has regressed (but we hadn’t noticed due to the test not being run in the default testsuite configuration)
tests which are broken on in certain ways
tests which fail non-deterministically in some or all ways

To handle cases (1-3) GHC has long had an expect_broken test modifier to mark a test as known to be broken due to a particular ticket, e.g.:

test('T13366',
     [when(opsys('darwin'), expect_broken(16083))],
     compile_and_run,
     ['-lstdc++ -v0'])

This modifier causes the test to be run, failing if it somehow successfully finishes (to ensure that we notice if the test is inadvertently fixed). Moreover, it encodes the fact that #16083 is the ticket where the breakage is documented.

Tracking fragile test outcomes

Sadly, the expect_broken mechanism is not appropriate for fragile tests, which may pass or fail nondeterministically. For this case we have introduced a new fragile modifier which runs the test but merely takes note of whether it passed. We can then report this information in two places:

In the testsuite report printed at the end of the run, ensuring that it shows up in the testsuite log
In the JUnit report produced by the testsuite run.

While fragile is a step up from simply skipping broken tests, we clearly want to ensure that the desigation is eventually removed from tests that have been stabilized (intentionally or not). For a few weeks we tried checking for such tests by hand: manually examining recent testsuite logs and looking for patterns of fragile tests which routinely pass. However, not only was this time-consuming but it was also quite error-prone as some of our fragile tests pass over 90% of the time.

For this reason we have developed a tiny bit of automation to help with this process: a GitLab webhook ingests the JUnit output from each testsuite run into a relational database for later analysis. This turns the previously-arduous task of identifying no-longer-fragile tests into a simple SQL query:

SELECT *
FROM results_view AS x
WHERE message = 'fragile'
  AND NOT EXISTS (SELECT *
                  FROM results_view
                  WHERE test_name != x.test_name
                    AND other->'reason' = 'fragile fail'
                    AND date > now() - interval '4 day';
                 )

Diagnosing fragile tests

One of the challenges in fixing fragile tests is characterising how they fail. Frequently, fragile tests have several failure modes. Knowing the differences and similarities between these modes can be remarkably helpful in localizing the root cause of the failure. Moreover, it’s not uncommon for failure modes to be shared by multiple fragile tests. Unfortunately, identifying and correlating these modes can be quite time-consuming, especially with infrequently-failing tests.

To aid in this we extended the testsuite driver’s JUnit output to include failing test output. This is then persisted in the test tracking database described above for later reference. This information has already proven to be incredibly useful.

An unexpected clean-up

One unexpected source of improvement in our testsuite performance recently arose from our recent addition of an Alpine Linux CI target (#14502). Alpine, unlike the other Linux platforms that we test on, uses the musl C runtime implementation. Due to differences in musl’s file flushing semantics, testing on Alpine immediately shed light on a few subtly fragile testcases which we had previously noticed but had not yet investigated.

This was a helpful reminder that writing and testing software for portability is not only good for users but it can also improve correctness by shedding light on subtly-flawed assumptions.

Closing

In this post we discussed a few of measures we have taken to improve and maintain GHC’s correctness while lowering maintenance overhead. By tackling long-standing technical debt and putting in place tools to correlate test failures across testsuite runs, testcases, and time, we have gained significantly deeper insight into when, where, and how our tests are failing.

We look forward to sharing similar steps being taken on the compiler performance front in a future post.