Automate testing of TensorFlow Lite model implementation

Making sure that your ML model works correctly on mobile app (part 2)

This is the 2nd article about testing machine learning models created for mobile. In the previous post – Testing TensorFlow Lite image classification model, we built a notebook that exports TensorFlow model to TensorFlow Lite and compares them side by side. But because the conversion process is mostly automatic, there are not many places to break something. We can find differences between quantized and non-quantized models or ensure that TensorFlow Lite works similarily to TensorFlow, but the real issues can come up somewhere else – on the client side implementation.
In this article, I will suggest some solutions for testing TensorFlow Lite model with Android instrumentation tests.

Testing automation

Usually, very first implementations rely on manual testing. If we have a model like MobileNet, we can put it to the app, run on a device and see the results just by pointing out a camera on some objects around us. Issues like rotated bitmap or bad cropping will result in outstanding classifications.

Later it won’t be that easy. We will add more complex image preprocessing (e.g. in MNIST classifier blog post, colors were inverted and the contrast was improved). We will struggle with bitmap-to-bytebuffer conversion (should it be [0, 1] value range for each pixel, or [-1, 1], or is it YUV or RGB format?). And what if we update the model with its newer, improved version? How to make sure we haven’t broken anything?

Android instrumented tests for TF Lite model

Testing TensorFlow Lite models on Android, especially on the emulator, isn’t trivial. Image classification can be a multi-step process, similar to this:

  1. Capture image or frame from device’s camera.
  2. Preprocess bitmap (cropping, rotating, transformation, or color enhancements).
  3. Convert bitmap to the format supported by a machine learning model.
  4. Run inference.
  5. Interpret results and display them.

Simple unit tests could help in some places, but to make sure that the entire process works correctly, this flow requires something higher level. And the perfect candidate for it can be UI testing framework – Espresso.

Espresso is designed for instrumented tests that are run on devices or emulators, so they can fully benefit from Android APIs. This is what we need here – access to assets, (load tflite model), or access to UI (validate that results are displayed correctly).
Here we will build some of those tests that will check if the inference process on the mobile app gives the same result as the one on Colab notebook.
But before them, let’s take a look at some limitations.

Instrumented tests limitations

When writing this article, there were some limitations in how Espresso or Android instrumented tests work. Here are some of them:

It is not easy to mock camera preview. Recently, thanks to augmented reality support, it is possible to simulate virtual scene on Android emulator, but neither it can be done automatically nor it is easy to put images there to get fully predictable results.

Virtual Scene in Android emulator can be helpful when validating general models like MobileNet

It is not easy to add debug views or manage test resources without including them into the app’s source code. Here, we will add images from validation batch and build custom Activity to preview results. It’s a bit tricky to do this if we don’t want to add those resources and the code into the app, and keep them in androidTest/ directory.
Sneak peek: we will need to create a separate Android module for the machine learning model that we want to test.

Data for client-side testing

Before we build an Android application, we will extend the Colab notebook code used for TensorFlow and TensorFlow Lite comparison. It should export validation batch data with expected results, so we can compare them later on the mobile app.

We would like to have an archive with 32 images (default batch size), that are self-describing. We will use a filename format n{}_true{}_pred{}.jpg where the first number is index, second – true label index and the last one – label index predicted by the current tflite model.

Export validation data batch to archive

It is important to export images in full quality, so we are sure that tflite model in Colab notebook and the one in mobile app use the same input data.

Generated images will be put into /assets directory in testing code for the Android app.

Espresso tests for TensorFlow Lite model

Preview of Espresso test checking validation data batch.

In the blog post about testing TFLite model, we built a notebook that creates TensorFlow Lite model for flowers classification. Code is available on Github repository: TFLite-Tester.
Now we will add the Android project that implements it, so we can do classification with the device’s camera. After that we will add some Espresso tests to automate QA process.

Example Android app will be composed of two modules:

  • app – application module with camera preview (using CameraView library),
  • ml_model – Android library module with the implementation of the machine learning model created in Colab notebook.

Application code

Application code is simple – there are two classes: MainActivity showing camera preview and classification results:

MainActivity shows camera preview and classification results.

and ClassificationFrameProcessor that is an interface between camera preview and classification logic in ml_model library.

ClassificationFrameProcessor is just the interface between camera preview screen and machine learning classification logic in separate module.

Machine Learning model library

The code in ml_model Android library is a bit more complex. It loads *.tflite file and model labels, transforms bitmap into the proper format, and runs inference process on TF Lite model.
Most of the logic resides in ModelClassificator class.

ModelClassificator loads model and labels, runs inference process and interprets results of it

We won’t go into details here. If you want to learn more, check those articles:

If we run our app, it will classify camera preview real time.

Instrumentation tests code

Important notice: As mentioned in “limitations” paragraph, testing code structure described below is possible only in the Android library project. The reason for it are separate Activity for debugging and AndroidManifest.xml file.
In a standard Android app module, manifest already exists in the code, so it cannot be replaced by the one coming from androidTest/ directory.

In androidTest/ directory we will recreate the entire Android project structure, that includes:

  • assets – for validation data batch,
  • java – code for tests and debugging/preview,
  • AndroidManifest.xml – preview activity needs to be registered here,
  • res – all resources required by Activity and AndroidManifest.
Project structure for ml_model library test

Now we will build ModelTestActivity – Activity for visualizing UI test, but also to give Espresso possibility to assert some layout logic.

ModelTestActivity is not a part of the application code. It resides in androidTest directory and won’t be included in the client app.

ModelTestActivity has a model classificator instance, so it can run the inference process on its own (classifyImage() method). Other methods like setImagePreview() or showClassificationResults() are here only to visualize the testing process.

Tests

Let’s write some tests now. First, we would like to see if the inference process in our ModelTestActivity works and shows the desired results. To do this, we will try to classify the tulips image saved in androidTest/assets/tulip.jpg.

Tulips image taken from TensorFlow flower_photos dataset.
MLModelTest contains instrumantation tests for TensorFlow Lite model.

The code looks simple, but there are some things to notice. First, we use ModelTestActivity, that is part of testing apk (not the tested one).
And it is worth to understand what exactly is tested here. It’s not UI, but ModelClassificator.process() and the logic behind it (bitmap preprocessing, inference process). Also, the method: ResultsUtils.resultsToStr() that makes inference results human-friendly.
While this test gives us a pretty good insight into classification flow, it doesn’t check model accuracy very well.


Let’s create another one – a test that will check a model against the validation data batch. It is worth to mention that here Activity isn’t needed at all, and we will use it just for previewing testing process.

MLModelTest checking validation data batch for our TFLite model

This test assumes that there are images in androidTest/assets/val_batch/ directory.
Those are images that we created in Colab notebook:

va_batch contains 32 images from validation data batch.

ModelClassificator runs inference process on each image and results are put into batchClassificationResults map. Base on filenames we build another map – batchClassificationExepectedResults, that contains expected results (the ones coming from the inference process run on Colab notebook).
At the end we will use Truth – the library for performing more complex assertions in Java and Android code (kudos Dominik for showing me that!). When something goes wrong with the test, here is example output of it:

Truth performs complex assertions like side by side map comparison.

Side note – do you know how this output came up? When implementing model I used different value ranges in bytes buffer ([-1, 1] instead of [0, 1]). Only two images among 32 presented different results in classification. What is the chance that you would caught it in manual tests?

Know what is tested

Tests above cover a big part of the classification process, but it isn’t everything. Here are operations checked by our tests (only happy paths):

  • ❌ Getting frame from camera preview,
  • ❌ Converting camera frame (specific for the library we use) to bitmap,
  • ✅ Bitmap preprocessing and transforming it to model’s input data,
  • ✅ Running inference,
  • ✅ Reading classification results,
  • ✅ Presenting them.

So as you can see, we have something, but definitely, it is far from perfect. Is it enough? It’s usually up to us. But at least we should be aware of what exactly is and isn’t tested.

I hope this article will bring some inspiration for how to make the QA process for TF Lite models more automatic.

Source code, references

Source code for this blog post is available on Github (Colab notebook, and mobile application): https://github.com/frogermcs/TFLite-Tester

Notebook with the entire code presented in this post can be run by clicking at the button below:

If you want to know how to compare your TensorFlow Lite model with its TensorFlow implementation, check this blog post:


Thanks for reading! 🙂
Please share your feedback below. 👇

Leave a Reply

Your email address will not be published. Required fields are marked *