How big can a unit test be? How small can an integration test be?
It’s easy to argue about whether a test is a ‘true’ unit test or not. If we test several classes together, is it still a unit test? If we use a small external API, must we call it an integration test?
The problem is that testing is a spectrum, but our terminology only allows discrete levels.
Putting A Number On It
I propose we treat testing as a numerical scale instead. Let’s introduce a concept of test height, from 0 to 100.
At 0, we have low-level, isolated unit tests. They run quickly, they run in process, and they’re extremely reliable. No setup or teardown is required.
At 100, we have high-level tests of full computer systems. We spin up an elaborate infrastructure of databases, external services, and exercise real protocols. Many processes run, runtime can be very variable, and flakiness is a continuing challenge.
Let’s look at some examples! We’ll look at some code for a wiki website, and discuss different tests we could write.
20: An Isolated Test
At the lowest level, we have code that only depends on its inputs.
We can easily write a small unit test for a function like this. There are no dependencies on external resources, or even external libraries.
This test only requires a single process, and our test assertion is simply looking at runtime values.
40: Running With A Scratch Database
Our wiki stores its data in a database. Let’s look at some view code that ultimately creates database rows.
We can test this after creating a new database to write to.
We’re now running two processes during our tests, and we need to ensure we clean up properly after each test. Assertions are now verifying that data is being written to the database. Our test is definitely slower than level 10.
60: Full Protocol Test
Web frameworks make testing easy, and allow you to call view functions directly. If we want to exercise a full web request, we need something that speaks HTTP. This is usually a headless web browser.
This test is even slower still: we’re running the full webserver stack
and driving a browser from our tests. We have to start worrying about
asynchronous responses: when do we know that the browser has fully
loaded the page, so we can look for
If this test fails, any errors or exceptions are less likely to be useful. We’ll probably need to open a browser ourselves and step through the process to see what happened.
These tests are definitely useful! They’re higher maintenance, but they exercise more of your code in a more realistic way.
80: System Test
Finally, we simulate the whole system. We could spin up VMs, configure HTTPS certificates, automatically building the infrastructure for an entire website.
(But isn’t this just a staging area? In practice, the line between testing and staging is extremely blurry anyway.)
This can give a very high degree of confidence, but it’s even worse than level 60. It’s slower, even more to maintain, and even flakier.
I like to have a couple of sanity check tests at this level. For example, spinning up a full server and checking that the homepage loads is a great way of catching infrastructure issues.
Why Don’t We Use A Scale?
If we think about tests on a scale, then we acknowledge there are extremes, and points in-between. This is why I don’t have any examples with a height of 0 (not even using the standard library?) or 100 (simulating an entire data centre?).
Testing a function that only uses the standard library could be 20, but if our function also calls other helper functions we’ve written, it might be 25.
If our program require a database, but we never written to, we might only give it a height of 35. It requires no teardown and tests can even safely run in parallel.
Our tooling forces discrete levels upon us. The tests exist on a
scale, but we have discrete mocking features we can choose to use or
not. A Django test must decide whether it inherits from
django.test.TestCase, which does database setup/teardown, or
unittest.TestCase, which does not.
Sometimes test levels just mean performance. Higher level tests
are slower, so projects often split tests into
slow/. These are arbitrary
designations, chosen to make testing convenient. They’re not
Beyond Discrete Categories
I hope this scale is a useful tool for you to think about how you test. What’s right for you depends on what problems you’re facing, the number of services you depend on, and the complexity of your stack. Good luck.