Zheng Hao Tan

Initial Impressions of Microsoft Olive

Initial Impressions of Microsoft Olive

Date published: August 25, 2023

I spent a good chunk of my time last night installing and experimenting with Microsoft Olive - a new machine learning tool that allows for easy, hardware-aware machine learning model optimization. Here are some of my thoughts.

What is Olive and what does it try to solve?

Serving (inference) your machine learning models efficiently and performantly on heterogenous systems remain a huge pain point. In order to squeeze every last ounce of performance from your hardware, you'll need to apply various optimization techniques such as quantization and operator fusions. As of this writing, these techniques are very much state-of-the-art. It moves fast, techniques come and go as it remains an active research area, so it's pretty complex to set up. The way you do it also varies amongst hardware architectures!

Enter Olive, which caveat emptor, does all of this for you using its ONNXRuntime backend. By using a JSON configuration file as its entry point, you'll need to specify a PyTorch or ONNX model, its name, a set of input tensor shapes, desired target hardware and optimization pass(es) to apply to your model.

Here's a disclaimer: I've actually tried building something like this before. I was a former co-founder / CTO of Lexer, a YC backed startup that built collaborative tools to help teams of ML engineers benchmark and design their model graph. We released a Python SDK for this, and we're different than Olive UX wise by allowing users to annotate a PyTorch nn.Module with a @Lexer(...) Python decorator instead of a JSON configuration file. Finally, we'll then print all the inference statistics for you, and even suggest low-hanging fruit optimizations you could apply to your model. [1]

ONNXRuntime as a hard dependency

Olive has a hard dependency on ONNXRuntime, and after sifting through the codebase, I think it's not wrong to assume it's just a wrapper around it. I have my own thoughts on ONNXRuntime but is too long for a deep dive here, so I'll just summarize it as the following: I'd like to think of this product as not just an industry's "first attempt at a high level, framework agnostic DL compiler", but also serve as an on-ramp for Microsoft Azure to win cloud ML business. Customers want to deploy their ML model to run efficiently on a variety of hardware targets, and ONNXRuntime, even with default settings, lets them do so with minimal friction.

Olive's Competition in the ML Tooling Ecosystem

Olive doesn't have a direct competitor, but market forces and emerging user patterns will pose a risk for its relevancy. Let's take both top-down and bottom-up views.

In the top down scenario, many deep learning frameworks have already started adding first class support to model optimization workflows. Albeit Olive having simple and easy to use APIs, there is a possibility that it's made redundant to some ML model optimization APIs already provided by PyTorch, unless you require the ONNX intermediate representation (IR) as part of your ML workflow. Here's a recent example of how to do INT8 quantization with PyTorch 2.0 export and TorchInductor. This ML tooling space moves fast.

From a bottom-up point of view, many of these optimization techniques require understanding the hardware architectures you're running on, as those work hand in hand with the chosen hyperparameters / optimization passes to determine the outcomes of your model optimization methods. That's what the hardware-aware in hardware-aware optimization means.

To define an exhaustive list of these configurations in a high level representation like a Python dict or JSON configuration file is tedious. This is true even across various Nvidia architectures [2], let alone the other dozens of hardware targets. Additionally, lineage tracking may be difficult to work off a single configuration file, especially post productionization.

Parting Thoughts

As a developer, I feel that Olive plays a more strategic role for Microsoft Azure than it is a be-all, end-all solution for the whole industry. But this is the genius of Microsoft: they're hyper focused on ML cloud distribution and it has tons of resources to build community around it. For example, let's look at ONNX. Microsoft is also continuing to invest heavily on ONNX, as it and ONNXRuntime are still the core on-ramps to their cloud (Azure) business. Here's ONNX Script - another tool built by them that allows one to make ONNX graph rewriting in Python!

I applaud Microsoft for taking the first stabs at a problem I'm deeply passionate about: tools that are critical for enabling high performance ML workloads, everywhere. It's still a step in the right direction, albeit it not fully addressing the systems level performance / engineering workflow bottleneck yet.

I think the industry needs to focus more on the toolchains that sit right between the deep learning frameworks (PyTorch, Tensorflow etc.) and hardware vendors (Nvidia, AMD, Intel, ARM, Google TPU etc.) Deep learning compilers will ultimately dictate if we allow all AI workloads run on just Nvidia or several other chips. [3]

Footnotes

  • [1] - The rationale for why we went with the decorator approach was so that these "potentially-made-to-be-production" workflows live alongside your model code. Just like your API documentation (think docstrings etc.), they're version controlled as the model code evolves over time. Unlike separate asset files, it's meant to be "atomic" to model code changes.
  • [2] - At time of this writing and for the Nvidia stack, Hopper architecture is the most state-of-the-art, and with that comes different peak memory bandwidths, number of streaming multiprocessors (SMs), lower bit-width engine precision support across kernels etc.
  • [3] - There are two leading deep learning compiler projects in the space in my opinion: LLVM MLIR and Apache TVM, backed by Modular and OctoML respectively. Side note: MLIR stands for multi-level intermediate representation but found incredible adoption in ML tooling.