Today, I want to wear my software archeology hat and share with you one story about the AI efforts at Microsoft and how Microsoft built its open-source high-performance AI runtime that is saving the company time and money.
A couple of years ago, I decided to create .NET bindings for TensorFlow and later PyTorch to join in all the fun that everyone is having with AI. In the past, creating bindings has been both a soothing exercise, one that I have used to learn new frameworks, but also to learn how people have solved different problems. In this case, I read my share of articles and tutorials, yet something was missing.
It turns out that when it comes to deep learning, binding APIs is just not enough. My coworker Jim Radigan described it better – I had a “cocktail party level of understanding”, which was enough to follow along with the news, but not enough to actually grasp it or solve interesting problems with it.
So last year, I took Jeremy Howard’s fantastic https://Fast.AI course and things finally clicked for me. The course gave me a much-needed perspective and I finally understood how all the pieces worked (I strongly recommend this course for anyone wanting to get into AI).
In the meantime, teams at Microsoft had embraced AI for all sorts of problems, and the work that Microsoft Research had done over the years is being put to use in production. Not only did Microsoft start using AI for services like Office and Bing, but it also began offering AI on-demand in the form of Azure services you can easily consume from your applications. The “Seeing AI” project reused work from Microsoft’s AI research teams to solve real-world problems for users.
An explosion of usage at Microsoft was taking place and everyone used the best tools that were available at the time – either building their own engines or using off-the-shelf components technologies like Fast.AI, Keras, TensorFlow, and PyTorch, and deploying these into production. And we have deployed these on the Azure cloud on more computers than I can count.
The AI world is a bit like the JavaScript world in that there is a tremendous amount of excitement, it feels like a new breakthrough model, clever operators, frameworks, or hardware accelerator comes to life every week.
Doing More with Less
Many AI frameworks tend to be tightly coupled with particular technologies. For example, while PyTorch is great for running your AI models, it really is intended to be used on a PC with an Nvidia GPU. We ended up with an archipelago of solutions, and it was difficult to retarget the code to other systems.
Compiler folks have known this as the “many-to-many problem.” In this scenario, we have deep learning frameworks on the left, and targets on the right:
While the industry has standardized to a large extent on TensorFlow and PyTorch as the high-level frameworks, we are in the early days of AI, and there are many emerging frameworks that try to improve upon these models. New frameworks to solve problems are being written in Julia, Python, Rust, Swift, and .NET, to name a few.
The compiler folks figured out the similarities between many of these frameworks and advocated for an intermediate format that was suitable for representing many of these problems. Rather than maintaining this many-to-many world, we could decouple the front-end languages and frameworks from the backend execution. In 2017, Facebook and Microsoft launched the ONNX project, an intermediate representation suitable as an exchange format for many different runtimes.
The world of ONNX looks a little bit like this:
Today, many frameworks and runtimes support ONNX as an exporting format, or as an input for their accelerator. There are nice tools that have emerged in this ecosystem. One of my favorites is Lutz Roeder’s ONNX model visualizer: Netron.
At Microsoft, we developed an accelerator to run these ONNX models, which the marketing team baptized as “Onnx Enterprise Runtime Professional for Workgroups” until cooler heads prevailed and we settled on the convenient GitHub “onnxruntime” name. Had I been in charge of the naming, it would have been called the “Overload Runtime,” on the thesis that it would give people something to talk about. Anyways, I digress.
This runtime was a gift for our internal teams. They could continue to author their models with their existing frameworks, and a separate team could focus on optimizing the runtime for their hardware and the operations that they used the most. The optimizations made for one team, benefitted all the teams, regardless of the higher-level framework that they were using.
The AI team has made it simple to take your existing code authored in TensorFlow, Keras, or PyTorch and export it to run in the ONNX Runtime. For example, if you are a PyTorch user, you can take your existing PyTorch code, and follow the instructions to run your code with the onnxruntime.
At this point you are wondering, why take this extra step if things are working for you as-is?
In many cases, the answer is that you will save time and money. In a recent post from Tianle Wu and Morgan Funtowicz, we shared some of the results that we see regularly, where they show performance improvements in their training from 20% all the way to 350%. You can imagine what the savings look like in terms of efficiency, time, and energy=, or just plain dollars if you are renting the CPU/GPU time.
We certainly hope that the work that we have put into the ONNX Runtime will be generally useful for anyone looking to reduce their costs. Let us turn our attention to how this runtime was tuned for our models, and perhaps these optimizations will be useful for your own day to day work as well.
Going Faster
While everyone knows that AI runs faster on GPUs than CPUs, but you do not really “know” until you try it out. For me, that time was a workshop I attended a couple of years ago. The instructions for attending the workshop included something like “Make sure you get yourself a GPU on the cloud before you come.” I did not do that and instead showed up with my MacBook Pro.
Everyone was happily exploring their Jupyter Notebooks and doing all the exercises with GPUs on the cloud, while I was doing the work on my laptop. While the speaker guided us through the magnificent world of GANs and people were following along, I was stuck on the first step that took about an hour to complete and drained my battery.
I learned my lesson. You really want to use a hardware accelerator to do your work.
What makes the ONNX Runtime (“Overlord Runtime” for those of you in my camp) run our models faster than plain PyTorch or TensorFlow?
Turns out, there is not one single reason for it, but rather a collection of reasons.
Let me share some of these, which fall in the following areas:
- Graph optimizations
- MLAS
- TVM Code Generator (for some models)
- Pluggable Execution Provider architecture
Graph Optimizations
One of the key differences in performance between PyTorch and the ONNX Runtime has to do with how your programs are executed.
Immediate execution, like the one provided by PyTorch (AI folks call this “eager”), keeps the Python runtime in the middle of the equation, where every mathematical operation is dispatched one by one to the GPU to perform its work.
This approach works and has an immediate quality to it, but it is not exactly speedy. The diagram below represents both my limited drawing skills and a number of operations executed sequentially on the CPU, and then executed on the GPU, one by one:
The ONNX Runtime has a chance to take the entire graph and apply traditional compiler optimizations.
- Some of the optimizations it applies are done at the graph-level, like checking if two operations can be merged into one (this is called “kernel fusion”), or seeing if there are transformations that produce the same result if computed differently:
In the end, this helps to minimize the number of round-trips between the CPU and the GPU to compute the same problem, as well as minimizing the number of copies and data sharing.
Microsoft Linear Algebra Subprogram Library
Training AI networks is typically done on GPUs as it is a lengthy operation that can take from hours to weeks to run, and is a task well suited for GPUs. Meanwhile, using a trained network can be done more efficiently on the CPU.
Given that AI problems are full of math and matrix multiplications done in bulk, the team set out to build optimized versions of the key math components that are used during inferencing on the CPU. and they developed a minimal version of BLAS called MLAS.
MLAS contains a hand-tuned set of linear algebra operations that are often implemented in assembly language and leverages various vector operations on various processors.
Pluggable Execution Provider Architecture
Recently, Andrey Volodin showed a tiny and beautiful runtime for ONNX models that is hardware accelerated using Apple’s Metal APIs called Smelter. It is so small that it takes about 10 minutes to read the whole source code and realize just how simple ONNX is and how you can hardware accelerate individual nodes.
The ONNX Runtime takes this a couple of steps further by providing a pluggable system where new optimizations and backends can be added. These are called “Execution Providers”, and at startup, they register with the runtime both how to move data in and out of their universe, and can respond to the runtime’s request for “How much of this graph would you like to run?”
On the one end of the spectrum, a simple provider could work like the tiny and beautiful runtime above and accelerate just a handful of operations. On the other end of the spectrum, it can take entire programs and send those over to be executed on a hardware accelerator (GPUs, or neural processors).
In this diagram, the graph runner gets to choose which parts of the graph should be executed by which providers:
Today ONNX Runtime ships with twelve providers, including Intel’s OpenVINO and NVidia’s TensorRT and CUDA backends and others can be added in the future. New execution providers can be added for specific hardware accelerators, different sorts of GPUs, or JIT compilers. Personally, I am going to try to add a Metal backend for my personal use.
Just-in-Time Compilation
In the pluggable architecture described above, some kinds of graphs can be sped up if they can be JIT compiled. Today for a limited number of cases, the runtime uses the Apache TVM Deep Learning Optimizing Compiler to produce custom code for those operations.
Last year, various teams at Microsoft started exploring the use of the Multi-Level IR compiler, originally developed by Google, and now part of LLVM to be used both to optimize individual graph operations (“kernels”), certain groups of kernels (“kernel fusion” – described above) and generating code for both CPU, GPUs and other hardware accelerators.
From Inferencing to Training
An interesting detail is that the Microsoft ONNX Runtime originally was designed to be a high-performance runtime for inferencing. This allowed the team to focus on building a compact and nimble runtime purely focused on performance.
This initial focus has paid off because there are more users of the inferencing on a day to day basis than there is for training. So this has helped us reduce the cost of rolling out models in more places, and in more scenarios that users can benefit from.
It was only later that training was added.
This training capability has been under development and use in production for a while, and just this past week announced it as a preview at the Build Conference. Up until now, this post talked about how we made our training on a single machine faster.
It turns out that training can also be a team sport. Computers do not need to work on isolation, they can worth together to train these vast models. At Microsoft, earlier this year we talked about how we trained one of the largest published models, and last week at Build, Kevin Scott talked about Microsoft’s AI supercomputer, a vast distributed system for training.
The technology that powers both of those efforts has now been integrated into the ONNX Runtime that was just released, and you can now also use these capabilities for your own models.
The Future
I am in awe at the success that the AI team at Microsoft has achieved in such a short time. The ONNX Runtime is a pleasure to work with, it is very nice and cleanly architected, and we are looking to extend both its capabilities – driven by our users, as well as adding additional execution providers, and bringing it to new platforms – in particular mobile platforms which now ship with assorted neural network accelerators.
Everything that I have discussed in this post is part of the open-source ONNX Runtime on GitHub.
About the Author: Miguel de Icaza
Miguel de Icaza is a Distinguished Engineer in the Developer Division at Microsoft working on our AI and .NET tooling. Previously he founded Xamarin and lead the Mono, Xamarin and Visual Studio for Mac projects.