How to (Easily) Add Tests to your ML Projects (Part 2)
Intro
In my previous post, we covered the steps in creating a package for an example ML model. It ended with a hypothetical conversation with Jane, the devops person, that the package may be what’s needed to get the project launched into production.
However, things in real life rarely go as smoothly as in stories, right? Most likely, Jane comes back with the following feedback, after reviewing the state of the package that Frank prepared. It goes something like this:
Jane (the DevOps person): Hey Frank, we reviewed the package you put together
and it's quite useful, but it's not ready to be
deployed yet.
Frank (the ML person): Oh, ok. What's missing?
Jane: Well, for the package to be built automatically,
it needs to have tests and have them pass
before it can be deployed. We didn't find the
tests so we need to add them.
Frank: (surprised) But the model works! Did you try the
sample code I added to the Readme page?
Jane: (sighs) Yeah, that was helpful, but we need to
automate the testing as part of the deployment
pipeline, not copy-and-paste code manually. So we
need to convert that sample code into tests
before we can release it into production.
Frank: (not this again) How long would that take?
Jane: (frowns) Let's take this offline. We have a lot on
our plates right now so we just need more time.
Frank: (unhappy again)...
Is the devops side throwing up more gauntlets again? Do ML folks need to now put on QA hats and write tests in addition to devops hats?
The Lowdown
Having tests in our ML models is an insurance for protecting us, the ML side, from being unnecessarily pulled into firefighting mode when something goes wrong in production. To put it another way, the more tests a component has, the lesser likely it’d become a scapegoat when something breaks.
Besides, writing tests isn’t that much work, especially once a project is set up with the basics, similar to Python packaging that we went over. Having tests not only helps with getting ML models deployed, it will be very helpful, if not essential, for future model refinements to be sure we didn’t break something others are dependent on.
Another big benefit of tests is that they help guard against nasty surprises when dependencies are changed, usually the components external to your ML model. These can be new PyTorch or Pandas versions, or newer CUDA versions, or maybe even security patches or OS upgrades. It’d save a great deal of time and sanity to convert our sample code into a test (takes all of 10 minutes), so no one needs to manually verify that the ML model is still working correctly whenever someone updated a package somewhere in the company. Imagine if your company has tens or even hundreds of ML models and the time it’d take to manually test every change! Best to capture the knowledge about our model’s behaviors in tests so we don’t need to worry about it later.
Let’s get started.
Note: The resulting code for this ML project with tests can be found here.
Step 0: The Plan
Software testing is a huge topic on its own, if not an entire industry, so there are lots of knowledge to built upon. Testing for ML models isn’t as mature in comparison, but a little goes a long way. The best part is that we can add more tests as a model matures, so it’s not one big lift but incrementally built up over time.
For this initial phase, we will keep the tests as simple as possible, by capturing the most essential functionalities of our ML model. The goal is to ensure that the tests can tell that our ML model is working correctly, and also verify that it’s failing in an expected way when the model should fail. Think of these tests are a way to ensure that the interface we present to the external world is preserved every time it’s deployed.
So the plan in pseudocode would be:
import franks-ml-model
add test_1:
- instantiate the model
- run prediction on a test image
- check the prediction results are as expected
add test_2:
- instantiate the model
- run prediction with an invalid input
- check for the expect error is returned or thrown
add test_3:
- instantiate the model
- run prediction on the inner PyTorch model
- check the raw outputs are as expected
Step 1: Setting up the Project for Tests
This is quite easy, since the Cookiecutter template we used last time already set up the basics.
The first part is a requirements_dev.txt
file, which includes any additional
package dependencies for development and testing purposes. For packages that’d
be only used during testing and not production, it’s best to add them to the _dev
file so the production environment is as slim and minimal as possible.
We will continue to build up Frank’s ML project for this tutorial, by first
reviewing the contents of the requirements_dev.txt
file:
-r requirements.txt
bumpversion
pytest
First line is to include any packages needed by all environments, and the last
line is the pytest
testing framework we’d be using, but you’re certainly welcome
to explore others as well, like unittest. We won’t need bumpversion
for now, so ignore it
until we come back to it for future posts.
The only other thing you’d do is add a test directory in the project’s root, to be separate from the directory franks_ml_model
(where the “box” we placed
the contents of our project). This way the built package is as compact
as possible, without the tests and test files we’d be adding. So your project
folder should look like the following:
├── README.md
├── franks_ml_model
//
│ ├── infer.py
│ ├── labels_map.txt
│ ├── model.py
│ └── utils.py
├── requirements.txt
├── requirements_dev.txt
//
└── tests
We’re ready to add our first test!
Step 2: Adding the First Test, the Happy Path
The first test would capture the expected behavior of our model when things are working properly. One thing to note before we start is that these type of tests should be relatively quick to run, in the order of a minute or less per test, since some large projects may end up with hundreds or more tests. The goal is to run all the tests before each deployment, so we don’t want these to take hours to run.
Back to our first test. We will effectively turn that sample code on the README page into an automated test, so we don’t need to manually run the code and read the top prediction to be “golden retriever” every time something’s changed. Even better, when something goes wrong, such as a missing or broken dependency got installed, the deployment process is stopped and whoever broke the environment gets to fix it.
Let’s create a file named tests/test_infer.py
, and import the necessary
dependencies:
import pytest
from franks_ml_model import EfficientNetInfer
The body of our first test is pretty much the same as the sample code, with some minor changes. We
will define each test as a function, prefixed by test_
, which will be
picked up by pytest
and run as individual tests. Let’s call our first
test by a descriptive function so we can more easily tell which tests succeeded
and failed.
def test_infer_dog_image_happy_path():
model = EfficientNetInfer()
fn_image = 'tests/test_files/dog.jpg'
top_predictions = model.infer_image( fn_image )
This looks very familiar for a good reason, since we want test writing to be as similar to how clients would use our models via the same interfaces. The only thing we changed here is which test image is used. We want to use a known image and known predictions to validate the model is working as expected.
For this tutorial we are using the same image of a dog (source unknown, sorry) that’s
a golden retriever. It’s added to the tests/test_files
folder so the test can
run locally.
As we saw from the previous packaging post, the expected outputs should look like the following list:
[
[207, 'golden retriever', 0.5610832571983337],
[213, 'Irish setter, red setter', 0.22866328060626984],
...
]
So the second half of our test we can check the returned
top_predictions
against what we
expect. We can check every element of the list, but as we know the probabilities
of ML models can change, we could start simple and just check the first element
as a sanity check. We can do these checks by simple assert
statements (although there are other options also, like hamcrest).
For this example image, we can check that the top prediction is of class id 207
, and
that its probability is at least over 0.5. One can also check that the label
is of the string golden retriever
, but we will leave that out for now. Lastly,
we can check that we did get 5 predictions since the default for topk
for
infer_image()
is 5.
assert top_predictions[0][0] == 207 # class_id for golden retriever
assert top_predictions[0][2] > 0.5
assert len(top_predictions) == 5 # default for topk is 5
Pretty simple, right? The idea is that if any of the assert
statements evaluates
to False
, this test will fail at that statement (and not go further).
We will see how that works in a bit.
Step 3: Running the Test
Now we have our first test, let’s run it and see how it goes. However, before we can run it, first run
# in case you haven't done already
source ~/.virtualenvs/test_franks_model/bin/activate
pip install -r requirements_dev.txt
or you will get a message like python: No module named pytest
when trying to
run the test.
To invoke the test, run this from the project root:
python -m pytest tests/test_infer.py
and if the test ran correctly, you should see something similar to the following:
============================= test session starts ==============================
platform darwin -- Python 3.7.6, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: /Users/gerald/projects/franks_ml_model
collected 1 item
tests/test_infer.py . [100%]
========================= 1 passed, 1 warning in 0.76s =========================
The period .
is what you want to see next to test_infer.py
(subtle
but more obvious as we add more tests), and no failure
at the end. You can also check that the exit code is zero, which is
what automated build systems would be checking, via
echo $?
0
Step 4: Break the Test
This may sound strange, but we want to intentionally make tests fail to ensure
that they are actually testing the right parts of the code. For now we just want
to see what it’d look like if a test we wrote is incorrect, by adding an
assert
to the test we just wrote, and have it expect the second prediction
to have class id of 0
, which should never happen:
assert top_predictions[1][0] == 0 # incorrect expectation
Let’s run the test again and see what happens:
============================= test session starts ==============================
collected 1 item
tests/test_infer.py F [100%]
=================================== FAILURES ===================================
_______________________ test_infer_dog_image_happy_path ________________________
def test_infer_dog_image_happy_path():
model = EfficientNetInfer()
fn_image = 'tests/test_files/dog.jpg'
top_predictions = model.infer_image( fn_image )
assert top_predictions[0][0] == 207 # class_id for golden retriever
assert top_predictions[0][2] > 0.5
assert len(top_predictions) == 5 # default for topk is 5
> assert top_predictions[1][0] == 0 # incorrect expectation
E assert 213 == 0
tests/test_infer.py:14: AssertionError
----------------------------- Captured stdout call -----------------------------
Loaded pretrained weights for efficientnet-b0
========================= 1 failed, 1 warning in 0.88s =========================
Failures are quite obvious, and that’s good. As part of showing the context
of which assert
statement failed, pytest
shows what value was observed (213 in
this case). Conveniently, pytest
also captures the stdout
and stderr
for failed tests to
help with debugging the error, as well as the stack trace if applicable.
Step 5: Expecting the Unexpected: Adding Tests for Error Cases
Just as important to know what to expect when the ML model is working
correctly, it’s just as important to capture the behavior when errors
occur. For our simple interface, we want to let the client know what to
expect when the image file it passed in cannot be loaded. We could return no error and just
a None
from infer_image
, but that would mask what the problem was. Instead,
it’s better to throw an exception, a class of FileNotFoundError
in this
case.
With pytest
, you can capture expected exceptions by the with pytest.raises(<expected exception>)
construct, so the test would pass only
if the <excpected exception>
was actually thrown. However, if the call does not raise the expected exception, pytest
will throw an
error instead, letting the tester know this error interface is broken.
Here’s the actual test:
def test_infer_non_existent_image_expected_error():
model = EfficientNetInfer()
fn_image = 'tests/test_files/not_here.jpg'
with pytest.raises(FileNotFoundError):
top_predictions = model.infer_image( fn_image )
and when you run the tests now, there will be two dots next two
test_infer.py
. You can experiment with changing fn_image
to the dog
image in the first test and see how pytest reports the error, along the
lines of
with pytest.raises(FileNotFoundError):
> top_predictions = model.infer_image( fn_image )
E Failed: DID NOT RAISE <class 'FileNotFoundError'>
With this test, we can ensure that our model’s behavior under this error
condition is guaranteed, since our clients would be expecting and checking for
this exception so they can properly handle such situations.
You can read up more about how to handle exceptions within pytest
here.
Step 6: Why Instantiate the Model per Test?
Some of you might notice that at the start of both tests, we have the model
instantiated per test, i.e., model = EfficientNetInfer()
, which might
be ok for a small model like this one, but what if it’s a much larger, 200M or
even billion+ parameters model? Running tests would be pretty slow for those.
With pytest
, such shared initializations for tests can be managed using “fixtures”,
so they can be easily shared between tests. In our case, we would create
a model
fixture at the top of our test file and it is passed
into any tests that need it. A fixture is created via a pytest.fixture
decorator:
@pytest.fixture(scope="session")
def model():
effnet_infer_model = EfficientNetInfer()
return effnet_infer_model
The decorator @pytest.fixture
effectively creates a global variable with
the name of the function, model
in this example, and the option
scope="session"
says to create it once and reuse it, so our model is only
loaded once. We’d then simplify our two tests by passing in
this model
variable.
Our first test now looks like this:
def test_infer_dog_image_happy_path( model ):
# model is passed in
fn_image = 'tests/test_files/dog.jpg'
...
and second would be:
def test_infer_non_existent_image_expected_error( model ):
fn_image = 'tests/test_files/not_here.jpg'
...
You can re-run the tests and both should pass as before, but now the model is loaded only once. Faster tests are always better!
Step 7: Test the Inner Workings Also
We have now added tests to the external facing interfaces of our model, but we can also add unit tests to the internals of our model as well. Why? They may help with diagnose issues when something goes wrong, by checking whether only the external tests are failing but not the internal ones, or both are failing. If it’s the former, you can focus your research into the outer layers of the model. However, if the internal tests are failing, best to start there since something is seriously amiss!
There are lots of other benefits for adding these unit tests, such as aiding
refactoring and improving code structure. For now let’s add a test for
the internal EfficientNet
model doing a forward pass on an image of all zeros.
We can check that the output shape is what we expect, and the top class id
is consistent over time (457). The latter is not a guarantee though, since we
currently do not have control over when EfficientNet
models may get updated.
That’s a topic saved for another post, but for now we could use this test
as a proxy to detect this internal model may have changed significantly
from what we expect.
def test_effnet_model_forward( model ):
import torch
effnet_model = model.effnet_model
image_size = model.img_size
batch_zero_tensor = torch.zeros(1, 3, image_size, image_size)
with torch.no_grad():
outputs = effnet_model( batch_zero_tensor )
assert outputs.size() == (1, 1000) # test output shape
assert outputs.argmax() == 467 # top index is 467
When you run these tests you should see 3 periods, and a satisfying feeling that your model is now infinitely more tested than when we started!
======================== 3 passed, 1 warning in 0.87s =========================
Conclusion
Now that you have seen the process, I hope you see that adding tests is not that difficult, if not satisfying. I do have to admit that personally, it took me a while before I came around to the importance of tests, having come from a slap-dash background of academic software development. Once you have one test (or three) in place, they become more and more natural to add as you continue to refine your model. Even if it stays at three tests, your ML project will be accepted as “good enough”, instead of a unknown quantity that might blow up at any point.
I hope you can also see that the assertions we added to these tests are much easier for us (the ML side) to write, since we know what the correct and expected behaviors of our own models better than anyone else. Had we handed the model to the DevOps (or any other teams really) to write tests, they’d have to spend a lot more time and energy trying to understand what this model does, and to determine what aspects are important enough to write tests for (and not bother with the rest). Having adding these test ourselves is a win for everyone.
So perhaps the next conversation Frank has with Jane would go something like this:
Frank (the ML person): Hey Jane, based on our last conversation, I
looked into adding tests to my ML project
and now it has unit tests. Does that
help with getting it deployed?
Jane (the DevOps person): Yeah, most definitely! This means that we can
build the package, pass the tests, and upload
it via our continuous deployment pipeline. I'll
have my team take a look and see if we can get
the pipeline set up in the next sprint.
Frank: Sounds great! Looking forward to it.
Let’s find out what Jane comes back with in our next post.