Having recently worked quite a bit on some testing tools, including tools to parse TAP, I’ve become intimately familiar with its shortcomings. I’m going to write these up here in the hopes that future generations of test output format developers will not repeat these mistakes.

TAP is (mostly) well-suited for human consumption. It’s easy to read, and when tests fail, it’s easy to figure out what failed and why. But the simplicity that makes it easy to read also makes it really difficult to parse properly. This leads to heuristics, and heuristics are the devil.

No Connection Between Connected Output

What happens when this test code fails:

We get this output:

This is nice and readable. It’s very clear what went wrong. But from a parsing perspective it’s a nightmare. There is no indication that the diagnostic output (the lines starting with #) are connected to the previous not ok line. So you have to implement a heuristic in your parser along the lines of “diagnostic output after a test is connected to that test”. That might be correct, or it might not. Consider this code:

Our output looks like this:

Now our heuristic about diagnostic output is clearly wrong.

There are further heuristics we could try. We could parse the diagnostic content and look for leading space, but this really depends on the vagaries of how each test tool formats its output.

Figuring out what the output means is trivial for a human, but incredibly hard to code for.

Multiple Output Handles for One Stream

TAP sends output to both stdout and stderr in parallel. Most of of the output goes to stdout, but diagnostic output goes to stderr. This causes all sorts of problems.

The stdout handle is usually buffered but stderr is not. In practice, this means that the output gets weird, with stderr output sometimes appearing much earlier than its related stdout output. One way to fix this is to enable autoflush on STDOUT in the test code itself.

With prove you can also pass the -m flag to merge the streams, which generally produces saner interleaving. The downside of merging is that you can no longer tell whether a line like # foo is diagnostic-level output, because that same line could be sent to stdout as a note.

Needless to say, having multiple handles contributes to more parsing problems. The problem discussed above of grouping together lines of output becomes much harder when those lines are split across two handles.

The Epic Subtest Hack

The original versions of TAP had no concept of subtests. It may surprise you to know that modern TAP has no concept of subtests either!

Subtests are a clever hack that take advantage of the original TAP spec’s simplicity. A normal TAP parser only cares about lines that start with non-whitespace. For example:

A TAP parser should only see one test here. The second line is ignored because of its leading whitespace.

Subtests take advantage of this to present human-readable output that the TAP parser ignores. Take this code for example:

The output looks like this:

TAP parsers will ignore all of the indented output. All the parser sees is the first and last lines. The lines that start with whitespace are ignored.

This is a great hack, since it let test tools generate subtest output while parsing tools like Test::Harness continue to work unchanged.

Unfortunately, this also means that none of the parsing tools until Test2::Harness::Parser::TAP actually parsed subtest output. If you were trying to transform TAP output into something else you had to write your own subtest parsing, which is pretty hard to get right. I think Test2::Harness::Parser::TAP is pretty good at this point, but getting there was no mean feat.

None of This Matters Any More

With Test2 out, all of my complaints are more or less irrelevant. Internally, Test2 represents testing events as event objects. If we want to understand what’s happening in our tests, we can look at these events. If we want to run tests in a separate process and capture the event output, we can tell those test processes to use the Test2::Formatter::EventStream formatter instead of TAP. This formatter dumps events as JSON objects which we can easily parse and turn back into event objects in another process. As of the 1.302074 release of Test2, events include a cid (context id) attribute. Multiple events with the same cid are related, so a test ok/not ok event followed by diagnostics can be easily grouped together.

The future of testing tools is looking very bright with Test2! TAP still drives me crazy, but now it’s mostly doing that in theory rather than in practice.

3 thoughts on “TAP is Great Except When It’s Not

  1. I have started looking at Test2 after reading this. I have a question. Is there an easy way to run setup before each subtest? I’m a big fan of independent unit tests. Re-creating all setup before each separate test run prevents unpredictable errors due to undetected non-obvious dependencies. Classic JUnit has setUp() and tearDown() for this, and a huge number of other test frameworks have the same or something similar, including Test::Class.

    It’s not hard to do with Test::More type tests. It’s a fairly simple matter of redefining subtest(). Test::More::Hooks does this, adding before() and after() methods. It seems to work with Test2, except that it tries to check that Test::More is loaded.

    I can fool it this way:

    BEGIN { $INC{'Test/More.pm'} = 1; }
    use Test2::Bundle::More;
    use Test::More::Hooks;


Leave a reply

You may use basic HTML in your comments (links, strong, em, etc.). Paragraphs will be added automically.