Monday, August 29, 2011

Why Tests Don't Really Pass or Fail

Most testers think of tests passing or failing. (We often compute statistics immediately when a test suite is run based on pass/fails.) We run a test. Either it found a bug or it didn’t. Mark the checkbox and on to the next test. This is especially true when we run batches of automated tests.

Unfortunately, experience repeatedly shows us that passing a test doesn’t really mean there is no bug in the area being tested. If it “passes” there will be no further action because there’s nothing to look for. Further action is indicated when the test results flag us to an error or we observe something unusual during the running of the test. It is still possible for bugs to exist in the feature being tested in spite of passing the test. The test may miss the conditions necessary to show the bug. It is also quite possible to miss noticing an error even though a test to surfaces it. Passing a test really means that we didn’t notice anything interesting. We’ll probably never know if it passes erroneously.

Likewise, failing a test does not guarantee that a bug is present. The conclusion that further work is required is what most people call failing a test. Determining whether or not there is an underlying bug requires further investigation. There could be a bug in the test itself, a configuration problem, corrupted data, or a host of other explainable reasons that do not mean that there is anything wrong with the software being tested. It could be behaving exactly as expected given the circumstances. Failing really only means that something that was noticed warrants further investigation. Because we follow-up, we’ll probably figure out (and possibly fix) the real root cause.

Pass/Fail metrics don’t really give us interesting information until the cause of every pass and fail is understood. Unless we validate the pass indications, we don’t know which tests missed a bug in the SUT. Because we don’t really know which passes are real (and we are unlikely to investigate to figure out which are real), any measure of the count of passes misrepresents the state of the SUT. Likewise, the count of failures is only meaningful after thorough investigation and identification of which failures are due to SUT errors and which not.

Skepticism is healthy when reporting test results. As much as we’d like to have definitive answers and absolutes, the results from our testing are inconclusive (especially before failures have been completely investigated). Initial test reports should be qualified with the fact that the numbers are subject to change as more information is gathered about failing tests. Just as the government regularly revises past economic indicators when new information becomes available, so should we treat passes and failures as estimates until we have gathered all the evidence we expect to glean from the tests.

(MA paper I wrote on this is available at: Why Tests Don't Pass)

No comments:

Post a Comment