In my previous post, we covered the steps in creating a package for an example ML model. It ended with a hypothetical conversation with Jane, the devops person, that the package may be what’s needed to get the project launched into production.
However, things in real life rarely go as smoothly as in stories, right? Most likely, Jane comes back with the following feedback, after reviewing the state of the package that Frank prepared. It goes something like this:
Jane (the DevOps person): Hey Frank, we reviewed the package you put together and it's quite useful, but it's not ready to be deployed yet. Frank (the ML person): Oh, ok. What's missing? Jane: Well, for the package to be built automatically, it needs to have tests and have them pass before it can be deployed. We didn't find the tests so we need to add them. Frank: (surprised) But the model works! Did you try the sample code I added to the Readme page? Jane: (sighs) Yeah, that was helpful, but we need to automate the testing as part of the deployment pipeline, not copy-and-paste code manually. So we need to convert that sample code into tests before we can release it into production. Frank: (not this again) How long would that take? Jane: (frowns) Let's take this offline. We have a lot on our plates right now so we just need more time. Frank: (unhappy again)...
Is the devops side throwing up more gauntlets again? Do ML folks need to now put on QA hats and write tests in addition to devops hats?
Having tests in our ML models is an insurance for protecting us, the ML side, from being unnecessarily pulled into firefighting mode when something goes wrong in production. To put it another way, the more tests a component has, the lesser likely it’d become a scapegoat when something breaks.
Besides, writing tests isn’t that much work, especially once a project is set up with the basics, similar to Python packaging that we went over. Having tests not only helps with getting ML models deployed, it will be very helpful, if not essential, for future model refinements to be sure we didn’t break something others are dependent on.
Another big benefit of tests is that they help guard against nasty surprises when dependencies are changed, usually the components external to your ML model. These can be new PyTorch or Pandas versions, or newer CUDA versions, or maybe even security patches or OS upgrades. It’d save a great deal of time and sanity to convert our sample code into a test (takes all of 10 minutes), so no one needs to manually verify that the ML model is still working correctly whenever someone updated a package somewhere in the company. Imagine if your company has tens or even hundreds of ML models and the time it’d take to manually test every change! Best to capture the knowledge about our model’s behaviors in tests so we don’t need to worry about it later.
Let’s get started.
Note: The resulting code for this ML project with tests can be found here.
Step 0: The Plan
Software testing is a huge topic on its own, if not an entire industry, so there are lots of knowledge to built upon. Testing for ML models isn’t as mature in comparison, but a little goes a long way. The best part is that we can add more tests as a model matures, so it’s not one big lift but incrementally built up over time.
For this initial phase, we will keep the tests as simple as possible, by capturing the most essential functionalities of our ML model. The goal is to ensure that the tests can tell that our ML model is working correctly, and also verify that it’s failing in an expected way when the model should fail. Think of these tests are a way to ensure that the interface we present to the external world is preserved every time it’s deployed.
So the plan in pseudocode would be:
import franks-ml-model add test_1: - instantiate the model - run prediction on a test image - check the prediction results are as expected add test_2: - instantiate the model - run prediction with an invalid input - check for the expect error is returned or thrown add test_3: - instantiate the model - run prediction on the inner PyTorch model - check the raw outputs are as expected
Step 1: Setting up the Project for Tests
This is quite easy, since the Cookiecutter template we used last time already set up the basics.
The first part is a
requirements_dev.txt file, which includes any additional
package dependencies for development and testing purposes. For packages that’d
be only used during testing and not production, it’s best to add them to the
file so the production environment is as slim and minimal as possible.
We will continue to build up Frank’s ML project for this tutorial, by first
reviewing the contents of the
-r requirements.txt bumpversion pytest
First line is to include any packages needed by all environments, and the last
line is the
pytest testing framework we’d be using, but you’re certainly welcome
to explore others as well, like unittest. We won’t need
bumpversion for now, so ignore it
until we come back to it for future posts.
The only other thing you’d do is add a test directory in the project’s root, to be separate from the directory
franks_ml_model (where the “box” we placed
the contents of our project). This way the built package is as compact
as possible, without the tests and test files we’d be adding. So your project
folder should look like the following:
├── README.md ├── franks_ml_model // │ ├── infer.py │ ├── labels_map.txt │ ├── model.py │ └── utils.py ├── requirements.txt ├── requirements_dev.txt // └── tests
We’re ready to add our first test!
Step 2: Adding the First Test, the Happy Path
The first test would capture the expected behavior of our model when things are working properly. One thing to note before we start is that these type of tests should be relatively quick to run, in the order of a minute or less per test, since some large projects may end up with hundreds or more tests. The goal is to run all the tests before each deployment, so we don’t want these to take hours to run.
Back to our first test. We will effectively turn that sample code on the README page into an automated test, so we don’t need to manually run the code and read the top prediction to be “golden retriever” every time something’s changed. Even better, when something goes wrong, such as a missing or broken dependency got installed, the deployment process is stopped and whoever broke the environment gets to fix it.
Let’s create a file named
tests/test_infer.py, and import the necessary
import pytest from franks_ml_model import EfficientNetInfer
The body of our first test is pretty much the same as the sample code, with some minor changes. We
will define each test as a function, prefixed by
test_, which will be
picked up by
pytest and run as individual tests. Let’s call our first
test by a descriptive function so we can more easily tell which tests succeeded
def test_infer_dog_image_happy_path(): model = EfficientNetInfer() fn_image = 'tests/test_files/dog.jpg' top_predictions = model.infer_image( fn_image )
This looks very familiar for a good reason, since we want test writing to be as similar to how clients would use our models via the same interfaces. The only thing we changed here is which test image is used. We want to use a known image and known predictions to validate the model is working as expected.
For this tutorial we are using the same image of a dog (source unknown, sorry) that’s
a golden retriever. It’s added to the
tests/test_files folder so the test can
As we saw from the previous packaging post, the expected outputs should look like the following list:
[ [207, 'golden retriever', 0.5610832571983337], [213, 'Irish setter, red setter', 0.22866328060626984], ... ]
So the second half of our test we can check the returned
top_predictions against what we
expect. We can check every element of the list, but as we know the probabilities
of ML models can change, we could start simple and just check the first element
as a sanity check. We can do these checks by simple
statements (although there are other options also, like hamcrest).
For this example image, we can check that the top prediction is of class id
that its probability is at least over 0.5. One can also check that the label
is of the string
golden retriever, but we will leave that out for now. Lastly,
we can check that we did get 5 predictions since the default for
infer_image() is 5.
assert top_predictions == 207 # class_id for golden retriever assert top_predictions > 0.5 assert len(top_predictions) == 5 # default for topk is 5
Pretty simple, right? The idea is that if any of the
assert statements evaluates
False, this test will fail at that statement (and not go further).
We will see how that works in a bit.
Step 3: Running the Test
Now we have our first test, let’s run it and see how it goes. However, before we can run it, first run
# in case you haven't done already source ~/.virtualenvs/test_franks_model/bin/activate pip install -r requirements_dev.txt
or you will get a message like
python: No module named pytest when trying to
run the test.
To invoke the test, run this from the project root:
python -m pytest tests/test_infer.py
and if the test ran correctly, you should see something similar to the following:
============================= test session starts ============================== platform darwin -- Python 3.7.6, pytest-5.3.5, py-1.8.1, pluggy-0.13.1 rootdir: /Users/gerald/projects/franks_ml_model collected 1 item tests/test_infer.py . [100%] ========================= 1 passed, 1 warning in 0.76s =========================
. is what you want to see next to
but more obvious as we add more tests), and no failure
at the end. You can also check that the exit code is zero, which is
what automated build systems would be checking, via
echo $? 0
Step 4: Break the Test
This may sound strange, but we want to intentionally make tests fail to ensure
that they are actually testing the right parts of the code. For now we just want
to see what it’d look like if a test we wrote is incorrect, by adding an
assert to the test we just wrote, and have it expect the second prediction
to have class id of
0, which should never happen:
assert top_predictions == 0 # incorrect expectation
Let’s run the test again and see what happens:
============================= test session starts ============================== collected 1 item tests/test_infer.py F [100%] =================================== FAILURES =================================== _______________________ test_infer_dog_image_happy_path ________________________ def test_infer_dog_image_happy_path(): model = EfficientNetInfer() fn_image = 'tests/test_files/dog.jpg' top_predictions = model.infer_image( fn_image ) assert top_predictions == 207 # class_id for golden retriever assert top_predictions > 0.5 assert len(top_predictions) == 5 # default for topk is 5 > assert top_predictions == 0 # incorrect expectation E assert 213 == 0 tests/test_infer.py:14: AssertionError ----------------------------- Captured stdout call ----------------------------- Loaded pretrained weights for efficientnet-b0 ========================= 1 failed, 1 warning in 0.88s =========================
Failures are quite obvious, and that’s good. As part of showing the context
assert statement failed,
pytest shows what value was observed (213 in
this case). Conveniently,
pytest also captures the
for failed tests to
help with debugging the error, as well as the stack trace if applicable.
Step 5: Expecting the Unexpected: Adding Tests for Error Cases
Just as important to know what to expect when the ML model is working
correctly, it’s just as important to capture the behavior when errors
occur. For our simple interface, we want to let the client know what to
expect when the image file it passed in cannot be loaded. We could return no error and just
infer_image, but that would mask what the problem was. Instead,
it’s better to throw an exception, a class of
FileNotFoundError in this
pytest, you can capture expected exceptions by the
with pytest.raises(<expected exception>) construct, so the test would pass only
<excpected exception> was actually thrown. However, if the call does not raise the expected exception,
pytest will throw an
error instead, letting the tester know this error interface is broken.
Here’s the actual test:
def test_infer_non_existent_image_expected_error(): model = EfficientNetInfer() fn_image = 'tests/test_files/not_here.jpg' with pytest.raises(FileNotFoundError): top_predictions = model.infer_image( fn_image )
and when you run the tests now, there will be two dots next two
test_infer.py. You can experiment with changing
fn_image to the dog
image in the first test and see how pytest reports the error, along the
with pytest.raises(FileNotFoundError): > top_predictions = model.infer_image( fn_image ) E Failed: DID NOT RAISE <class 'FileNotFoundError'>
With this test, we can ensure that our model’s behavior under this error
condition is guaranteed, since our clients would be expecting and checking for
this exception so they can properly handle such situations.
You can read up more about how to handle exceptions within
Step 6: Why Instantiate the Model per Test?
Some of you might notice that at the start of both tests, we have the model
instantiated per test, i.e.,
model = EfficientNetInfer(), which might
be ok for a small model like this one, but what if it’s a much larger, 200M or
even billion+ parameters model? Running tests would be pretty slow for those.
pytest, such shared initializations for tests can be managed using “fixtures”,
so they can be easily shared between tests. In our case, we would create
model fixture at the top of our test file and it is passed
into any tests that need it. A fixture is created via a
@pytest.fixture(scope="session") def model(): effnet_infer_model = EfficientNetInfer() return effnet_infer_model
@pytest.fixture effectively creates a global variable with
the name of the function,
model in this example, and the option
scope="session" says to create it once and reuse it, so our model is only
loaded once. We’d then simplify our two tests by passing in
Our first test now looks like this:
def test_infer_dog_image_happy_path( model ): # model is passed in fn_image = 'tests/test_files/dog.jpg' ...
and second would be:
def test_infer_non_existent_image_expected_error( model ): fn_image = 'tests/test_files/not_here.jpg' ...
You can re-run the tests and both should pass as before, but now the model is loaded only once. Faster tests are always better!
Step 7: Test the Inner Workings Also
We have now added tests to the external facing interfaces of our model, but we can also add unit tests to the internals of our model as well. Why? They may help with diagnose issues when something goes wrong, by checking whether only the external tests are failing but not the internal ones, or both are failing. If it’s the former, you can focus your research into the outer layers of the model. However, if the internal tests are failing, best to start there since something is seriously amiss!
There are lots of other benefits for adding these unit tests, such as aiding
refactoring and improving code structure. For now let’s add a test for
EfficientNet model doing a forward pass on an image of all zeros.
We can check that the output shape is what we expect, and the top class id
is consistent over time (457). The latter is not a guarantee though, since we
currently do not have control over when
EfficientNet models may get updated.
That’s a topic saved for another post, but for now we could use this test
as a proxy to detect this internal model may have changed significantly
from what we expect.
def test_effnet_model_forward( model ): import torch effnet_model = model.effnet_model image_size = model.img_size batch_zero_tensor = torch.zeros(1, 3, image_size, image_size) with torch.no_grad(): outputs = effnet_model( batch_zero_tensor ) assert outputs.size() == (1, 1000) # test output shape assert outputs.argmax() == 467 # top index is 467
When you run these tests you should see 3 periods, and a satisfying feeling that your model is now infinitely more tested than when we started!
======================== 3 passed, 1 warning in 0.87s =========================
Now that you have seen the process, I hope you see that adding tests is not that difficult, if not satisfying. I do have to admit that personally, it took me a while before I came around to the importance of tests, having come from a slap-dash background of academic software development. Once you have one test (or three) in place, they become more and more natural to add as you continue to refine your model. Even if it stays at three tests, your ML project will be accepted as “good enough”, instead of a unknown quantity that might blow up at any point.
I hope you can also see that the assertions we added to these tests are much easier for us (the ML side) to write, since we know what the correct and expected behaviors of our own models better than anyone else. Had we handed the model to the DevOps (or any other teams really) to write tests, they’d have to spend a lot more time and energy trying to understand what this model does, and to determine what aspects are important enough to write tests for (and not bother with the rest). Having adding these test ourselves is a win for everyone.
So perhaps the next conversation Frank has with Jane would go something like this:
Frank (the ML person): Hey Jane, based on our last conversation, I looked into adding tests to my ML project and now it has unit tests. Does that help with getting it deployed? Jane (the DevOps person): Yeah, most definitely! This means that we can build the package, pass the tests, and upload it via our continuous deployment pipeline. I'll have my team take a look and see if we can get the pipeline set up in the next sprint. Frank: Sounds great! Looking forward to it.
Let’s find out what Jane comes back with in our next post.