Intro

In my previous post, we covered the steps in creating a package for an example ML model. It ended with a hypothetical conversation with Jane, the devops person, that the package may be what’s needed to get the project launched into production.

However, things in real life rarely go as smoothly as in stories, right? Most likely, Jane comes back with the following feedback, after reviewing the state of the package that Frank prepared. It goes something like this:

Jane (the DevOps person):  Hey Frank, we reviewed the package you put together
                           and it's quite useful, but it's not ready to be
                           deployed yet.

   Frank (the ML person):  Oh, ok. What's missing?

                    Jane:  Well, for the package to be built automatically,
                           it needs to have tests and have them pass
                           before it can be deployed.  We didn't find the
                           tests so we need to add them.

                   Frank:  (surprised) But the model works! Did you try the
                           sample code I added to the Readme page?

                    Jane:  (sighs) Yeah, that was helpful, but we need to
                           automate the testing as part of the deployment
                           pipeline, not copy-and-paste code manually. So we
                           need to convert that sample code into tests
                           before we can release it into production.

                   Frank:  (not this again) How long would that take?

                    Jane:  (frowns) Let's take this offline. We have a lot on
                           our plates right now so we just need more time.

                   Frank:  (unhappy again)...

Is the devops side throwing up more gauntlets again? Do ML folks need to now put on QA hats and write tests in addition to devops hats?

The Lowdown

Having tests in our ML models is an insurance for protecting us, the ML side, from being unnecessarily pulled into firefighting mode when something goes wrong in production. To put it another way, the more tests a component has, the lesser likely it’d become a scapegoat when something breaks.

Besides, writing tests isn’t that much work, especially once a project is set up with the basics, similar to Python packaging that we went over. Having tests not only helps with getting ML models deployed, it will be very helpful, if not essential, for future model refinements to be sure we didn’t break something others are dependent on.

Another big benefit of tests is that they help guard against nasty surprises when dependencies are changed, usually the components external to your ML model. These can be new PyTorch or Pandas versions, or newer CUDA versions, or maybe even security patches or OS upgrades. It’d save a great deal of time and sanity to convert our sample code into a test (takes all of 10 minutes), so no one needs to manually verify that the ML model is still working correctly whenever someone updated a package somewhere in the company. Imagine if your company has tens or even hundreds of ML models and the time it’d take to manually test every change! Best to capture the knowledge about our model’s behaviors in tests so we don’t need to worry about it later.

Let’s get started.

Note: The resulting code for this ML project with tests can be found here.

Step 0: The Plan

Software testing is a huge topic on its own, if not an entire industry, so there are lots of knowledge to built upon. Testing for ML models isn’t as mature in comparison, but a little goes a long way. The best part is that we can add more tests as a model matures, so it’s not one big lift but incrementally built up over time.

For this initial phase, we will keep the tests as simple as possible, by capturing the most essential functionalities of our ML model. The goal is to ensure that the tests can tell that our ML model is working correctly, and also verify that it’s failing in an expected way when the model should fail. Think of these tests are a way to ensure that the interface we present to the external world is preserved every time it’s deployed.

So the plan in pseudocode would be:

import franks-ml-model

add test_1:
 - instantiate the model
 - run prediction on a test image
 - check the prediction results are as expected

add test_2:
 - instantiate the model
 - run prediction with an invalid input
 - check for the expect error is returned or thrown

add test_3:
 - instantiate the model
 - run prediction on the inner PyTorch model
 - check the raw outputs are as expected

Step 1: Setting up the Project for Tests

This is quite easy, since the Cookiecutter template we used last time already set up the basics. The first part is a requirements_dev.txt file, which includes any additional package dependencies for development and testing purposes. For packages that’d be only used during testing and not production, it’s best to add them to the _dev file so the production environment is as slim and minimal as possible.

We will continue to build up Frank’s ML project for this tutorial, by first reviewing the contents of the requirements_dev.txt file:

-r requirements.txt

bumpversion
pytest

First line is to include any packages needed by all environments, and the last line is the pytest testing framework we’d be using, but you’re certainly welcome to explore others as well, like unittest. We won’t need bumpversion for now, so ignore it until we come back to it for future posts.

The only other thing you’d do is add a test directory in the project’s root, to be separate from the directory franks_ml_model (where the “box” we placed the contents of our project). This way the built package is as compact as possible, without the tests and test files we’d be adding. So your project folder should look like the following:

├── README.md
├── franks_ml_model
//
│   ├── infer.py
│   ├── labels_map.txt
│   ├── model.py
│   └── utils.py
├── requirements.txt
├── requirements_dev.txt
//
└── tests

We’re ready to add our first test!

Step 2: Adding the First Test, the Happy Path

The first test would capture the expected behavior of our model when things are working properly. One thing to note before we start is that these type of tests should be relatively quick to run, in the order of a minute or less per test, since some large projects may end up with hundreds or more tests. The goal is to run all the tests before each deployment, so we don’t want these to take hours to run.

Back to our first test. We will effectively turn that sample code on the README page into an automated test, so we don’t need to manually run the code and read the top prediction to be “golden retriever” every time something’s changed. Even better, when something goes wrong, such as a missing or broken dependency got installed, the deployment process is stopped and whoever broke the environment gets to fix it.

Let’s create a file named tests/test_infer.py, and import the necessary dependencies:

import pytest

from franks_ml_model import EfficientNetInfer

The body of our first test is pretty much the same as the sample code, with some minor changes. We will define each test as a function, prefixed by test_, which will be picked up by pytest and run as individual tests. Let’s call our first test by a descriptive function so we can more easily tell which tests succeeded and failed.

def test_infer_dog_image_happy_path():
    model = EfficientNetInfer()
    fn_image = 'tests/test_files/dog.jpg'
    top_predictions = model.infer_image( fn_image )

This looks very familiar for a good reason, since we want test writing to be as similar to how clients would use our models via the same interfaces. The only thing we changed here is which test image is used. We want to use a known image and known predictions to validate the model is working as expected.

For this tutorial we are using the same image of a dog (source unknown, sorry) that’s a golden retriever. It’s added to the tests/test_files folder so the test can run locally.

As we saw from the previous packaging post, the expected outputs should look like the following list:

[
  [207, 'golden retriever', 0.5610832571983337],
  [213, 'Irish setter, red setter', 0.22866328060626984],
  ...
]

So the second half of our test we can check the returned top_predictions against what we expect. We can check every element of the list, but as we know the probabilities of ML models can change, we could start simple and just check the first element as a sanity check. We can do these checks by simple assert statements (although there are other options also, like hamcrest).

For this example image, we can check that the top prediction is of class id 207, and that its probability is at least over 0.5. One can also check that the label is of the string golden retriever, but we will leave that out for now. Lastly, we can check that we did get 5 predictions since the default for topk for infer_image() is 5.

    assert top_predictions[0][0] == 207 # class_id for golden retriever
    assert top_predictions[0][2] > 0.5
    assert len(top_predictions) == 5    # default for topk is 5

Pretty simple, right? The idea is that if any of the assert statements evaluates to False, this test will fail at that statement (and not go further). We will see how that works in a bit.

Step 3: Running the Test

Now we have our first test, let’s run it and see how it goes. However, before we can run it, first run

# in case you haven't done already
source ~/.virtualenvs/test_franks_model/bin/activate
pip install -r requirements_dev.txt

or you will get a message like python: No module named pytest when trying to run the test.

To invoke the test, run this from the project root:

python -m pytest tests/test_infer.py

and if the test ran correctly, you should see something similar to the following:

============================= test session starts ==============================
platform darwin -- Python 3.7.6, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: /Users/gerald/projects/franks_ml_model
collected 1 item                                                               

tests/test_infer.py .                                                    [100%]

========================= 1 passed, 1 warning in 0.76s =========================

The period . is what you want to see next to test_infer.py (subtle but more obvious as we add more tests), and no failure at the end. You can also check that the exit code is zero, which is what automated build systems would be checking, via

echo $?
0

Step 4: Break the Test

This may sound strange, but we want to intentionally make tests fail to ensure that they are actually testing the right parts of the code. For now we just want to see what it’d look like if a test we wrote is incorrect, by adding an assert to the test we just wrote, and have it expect the second prediction to have class id of 0, which should never happen:

    assert top_predictions[1][0] == 0   # incorrect expectation

Let’s run the test again and see what happens:

============================= test session starts ==============================
collected 1 item                                                               

tests/test_infer.py F                                                    [100%]

=================================== FAILURES ===================================
_______________________ test_infer_dog_image_happy_path ________________________

    def test_infer_dog_image_happy_path():
        model = EfficientNetInfer()
        fn_image = 'tests/test_files/dog.jpg'
        top_predictions = model.infer_image( fn_image )

        assert top_predictions[0][0] == 207 # class_id for golden retriever
        assert top_predictions[0][2] > 0.5
        assert len(top_predictions) == 5    # default for topk is 5
>       assert top_predictions[1][0] == 0   # incorrect expectation
E       assert 213 == 0

tests/test_infer.py:14: AssertionError
----------------------------- Captured stdout call -----------------------------
Loaded pretrained weights for efficientnet-b0
========================= 1 failed, 1 warning in 0.88s =========================

Failures are quite obvious, and that’s good. As part of showing the context of which assert statement failed, pytest shows what value was observed (213 in this case). Conveniently, pytest also captures the stdout and stderr for failed tests to help with debugging the error, as well as the stack trace if applicable.

Step 5: Expecting the Unexpected: Adding Tests for Error Cases

Just as important to know what to expect when the ML model is working correctly, it’s just as important to capture the behavior when errors occur. For our simple interface, we want to let the client know what to expect when the image file it passed in cannot be loaded. We could return no error and just a None from infer_image, but that would mask what the problem was. Instead, it’s better to throw an exception, a class of FileNotFoundError in this case.

With pytest, you can capture expected exceptions by the with pytest.raises(<expected exception>) construct, so the test would pass only if the <excpected exception> was actually thrown. However, if the call does not raise the expected exception, pytest will throw an error instead, letting the tester know this error interface is broken. Here’s the actual test:

def test_infer_non_existent_image_expected_error():
    model = EfficientNetInfer()
    fn_image = 'tests/test_files/not_here.jpg'

    with pytest.raises(FileNotFoundError):
        top_predictions = model.infer_image( fn_image )

and when you run the tests now, there will be two dots next two test_infer.py. You can experiment with changing fn_image to the dog image in the first test and see how pytest reports the error, along the lines of

        with pytest.raises(FileNotFoundError):
>           top_predictions = model.infer_image( fn_image )
E           Failed: DID NOT RAISE <class 'FileNotFoundError'>

With this test, we can ensure that our model’s behavior under this error condition is guaranteed, since our clients would be expecting and checking for this exception so they can properly handle such situations. You can read up more about how to handle exceptions within pytest here.

Step 6: Why Instantiate the Model per Test?

Some of you might notice that at the start of both tests, we have the model instantiated per test, i.e., model = EfficientNetInfer(), which might be ok for a small model like this one, but what if it’s a much larger, 200M or even billion+ parameters model? Running tests would be pretty slow for those.

With pytest, such shared initializations for tests can be managed using “fixtures”, so they can be easily shared between tests. In our case, we would create a model fixture at the top of our test file and it is passed into any tests that need it. A fixture is created via a pytest.fixture decorator:

@pytest.fixture(scope="session")
def model():
    effnet_infer_model = EfficientNetInfer()
    return effnet_infer_model

The decorator @pytest.fixture effectively creates a global variable with the name of the function, model in this example, and the option scope="session" says to create it once and reuse it, so our model is only loaded once. We’d then simplify our two tests by passing in this model variable.

Our first test now looks like this:

def test_infer_dog_image_happy_path( model ):
    # model is passed in
    fn_image = 'tests/test_files/dog.jpg'
    ...

and second would be:

def test_infer_non_existent_image_expected_error( model ):
    fn_image = 'tests/test_files/not_here.jpg'
    ...

You can re-run the tests and both should pass as before, but now the model is loaded only once. Faster tests are always better!

Step 7: Test the Inner Workings Also

We have now added tests to the external facing interfaces of our model, but we can also add unit tests to the internals of our model as well. Why? They may help with diagnose issues when something goes wrong, by checking whether only the external tests are failing but not the internal ones, or both are failing. If it’s the former, you can focus your research into the outer layers of the model. However, if the internal tests are failing, best to start there since something is seriously amiss!

There are lots of other benefits for adding these unit tests, such as aiding refactoring and improving code structure. For now let’s add a test for the internal EfficientNet model doing a forward pass on an image of all zeros.

We can check that the output shape is what we expect, and the top class id is consistent over time (457). The latter is not a guarantee though, since we currently do not have control over when EfficientNet models may get updated. That’s a topic saved for another post, but for now we could use this test as a proxy to detect this internal model may have changed significantly from what we expect.

def test_effnet_model_forward( model ):
    import torch

    effnet_model = model.effnet_model
    image_size = model.img_size
    batch_zero_tensor = torch.zeros(1, 3, image_size, image_size)

    with torch.no_grad():
        outputs = effnet_model( batch_zero_tensor )

    assert outputs.size() == (1, 1000) # test output shape
    assert outputs.argmax() == 467 # top index is 467

When you run these tests you should see 3 periods, and a satisfying feeling that your model is now infinitely more tested than when we started!

======================== 3 passed, 1 warning in 0.87s =========================

Conclusion

Now that you have seen the process, I hope you see that adding tests is not that difficult, if not satisfying. I do have to admit that personally, it took me a while before I came around to the importance of tests, having come from a slap-dash background of academic software development. Once you have one test (or three) in place, they become more and more natural to add as you continue to refine your model. Even if it stays at three tests, your ML project will be accepted as “good enough”, instead of a unknown quantity that might blow up at any point.

I hope you can also see that the assertions we added to these tests are much easier for us (the ML side) to write, since we know what the correct and expected behaviors of our own models better than anyone else. Had we handed the model to the DevOps (or any other teams really) to write tests, they’d have to spend a lot more time and energy trying to understand what this model does, and to determine what aspects are important enough to write tests for (and not bother with the rest). Having adding these test ourselves is a win for everyone.

So perhaps the next conversation Frank has with Jane would go something like this:

   Frank (the ML person):  Hey Jane, based on our last conversation, I
                           looked into adding tests to my ML project
                           and now it has unit tests. Does that
                           help with getting it deployed?

Jane (the DevOps person):  Yeah, most definitely! This means that we can
                           build the package, pass the tests, and upload
                           it via our continuous deployment pipeline. I'll
                           have my team take a look and see if we can get
                           the pipeline set up in the next sprint.

                   Frank:  Sounds great! Looking forward to it.

Let’s find out what Jane comes back with in our next post.

References