In this post, we elaborate on how we measured, on commodity cloud hardware, the throughput and latency of five ResNet-50 v1 models optimized for CPU inference. By the end of the post, you should be able reproduce these benchmarks using tools available in the Neural Magic GitHub repo.
ResNet-50 v1 | Batch = 64 | AWS c5.12xlarge CPU
ResNet-50 v1 | Batch = 1 | AWS c5.12xlarge CPU
Last week we released support for ResNet-50, with YOLOv3 support coming in a few weeks, to be followed by BERT and other transformer models in coming months. We urge you to try unsupported models and report back to us through the GitHub Issue queue as we work hard to broaden our offering of sparse and sparse-quantized models.
For more info on ResNet, how it’s typically used, current limitations, and details on how Neural Magic initially made running ResNet models more performant and cost effective, see our previous post.
Intro to Sparsification
Neural Magic’s Deep Sparse Platform provides a suite of software tools to select, build, and run sparse deep learning models on CPU resources. Taking advantage of “sparsification,” there are multiple ways to plug into the DeepSparse Engine which runs sparse models like ResNet-50 at accelerated speeds on CPUs. So what is sparsification and why should you care?
Sparsification is the process of taking a trained deep learning model and removing redundant information from the overprecise and over-parameterized network resulting in a faster and smaller model. Techniques for sparsification are all encompassing including everything from inducing sparsity using pruning and quantization to enabling naturally occurring activation sparsity. When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. For example, as you will see shortly in our benchmarking exercise, pruning plus quantization can give over 7x improvement in performance while recovering to nearly the same baseline accuracy.
The Deep Sparse Platform builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. Simply put, you would:
1. Download a sparsification recipe and sparsified model from the SparseZoo.
2. Alternatively, create a recipe for your model using Sparsify.
3. Apply your recipe with only a few lines of code using SparseML.
4. Finally, for GPU-level performance on CPUs, you can deploy your sparse-quantized model with the DeepSparse Engine.
Here’s the full Deep Sparse product flow and various paths to sparse acceleration. We will focus this discussion on the path of taking a SparseZoo model, namely the sparse-quantized ResNet-50, and benchmarking it with the DeepSparse Engine:
ResNet-50 v1: Benchmarking with the DeepSparse Engine
Approach
To populate the SparseZoo, we started from a pre-trained baseline ResNet-50 from the torchvision models subpackage. Sparsified and quantized, the models were then fine-tuned using our replicable recipes to recover close to the baseline accuracy.
We define three categories of recoverability to make it easy to understand the trade-offs made during the sparsification process:
1. Conservative: accuracy maintained 100% of the baseline
2. Moderate: accuracy maintained >= 99% of the baseline
3. Aggressive: accuracy maintained >= 95% of the baseline
Each model in the SparseZoo has a specific stub that identifies the category of recoverability. Visit SparseZoo docs on models to learn more about the stub structure:
Hardware Setup and Environment
The DeepSparse Engine is completely infrastructure-agnostic, meant to plug in from edge deployments to model servers. As long as it has the “right” CPUs (80% of the entire Intel offering today) with the correct instruction set for performance such as AVX-512, we can run on any cloud platform. The DeepSparse Engine will automatically utilize the most effective available instruction set for the task.
For this exercise, these benchmarks have been run on an AWS c5.12xlarge instance that has a modern Intel CPU with support for AVX-512 Vector Neural Network Instructions (AVX-512 VNNI). It is designed to accelerate INT8 workloads, making up to 4x speedups possible going from FP32 to INT8 inference.
We used Ubuntu 20.04.1 LTS as the operating system with Python 3.8.5. All the benchmarking dependencies are contained in DeepSparse Engine, which can be installed with
pip3 install deepsparse
More details about DeepSparse Engine and compatible hardware are available.
You can find the Python script used to generate the DeepSparse numbers on the DeepSparse Engine GitHub repo.
Benchmark Measurements
Keeping this as simple as possible, the benchmark measures the full end-to-end time of giving an input batch to the engine and receiving predicted output, with full FP32 precision.
We perform several warm up iterations before measuring the time for each iteration to minimize noise affecting the final results.
Here is the full timing section from deepsparse/engine.py
start = time.time()
out = self.run(batch)
end = time.time()
ResNet-50 v1 Throughput Results
For the throughput scenario, we used a batch size of 64 with random input using all available cores.
This code block replicates the benchmark environment, where SPARSEZOO_MODEL_STUB is replaced from the table above.
from deepsparse import benchmark_model
import numpybatch_size = 64
sample_inputs = [numpy.random.randn(batch_size, 3, 224, 224).astype(numpy.float32)]results = benchmark_model(
“SPARSEZOO_MODEL_STUB”,
sample_inputs,
batch_size=batch_size,
)
print(results)
As an example substitution, this is the benchmark command for the Pruned Moderate FP32 ResNet-50:
results = benchmark_model(
"zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned-moderate",
sample_inputs,
batch_size=batch_size,
)
print(results)
ResNet-50 v1 Latency Results
For the latency scenario, we used a batch size of 1 with random input using all available cores.
This code block replicates the benchmark environment, where SPARSEZOO_MODEL_STUB is replaced from the above table.
from deepsparse import benchmark_model
import numpybatch_size = 1
sample_inputs = [numpy.random.randn(batch_size, 3, 224, 224).astype(numpy.float32)]results = benchmark_model(
“SPARSEZOO_MODEL_STUB”,
sample_inputs,
batch_size=batch_size,
)
print(results)
Try it Now: Benchmark ResNet-50
To replicate this experience and results, here are the instructions. Once you have procured infrastructure, it should take you approximately 15–30 minutes to run through this exercise.
1. Reserve a c5.12xlarge instance on AWS; we used the Amazon Ubuntu 20.04 AMI
2. Install pip and venv if it isn’t already installed with
sudo apt update && sudo apt install python3-pip python3-venv
3. Create and activate a virtual environment for Python
python3 -m venv benchmark-env && source benchmark-env/bin/activate
4. Install the DeepSparse Engine by running
pip3 install deepsparse
5. Clone the DeepSparse Engine repository; it will include the benchmarking script for reproducing ResNet-50 numbers:
git clone https://github.com/neuralmagic/deepsparse.git
6. Replicate the throughput and latency scenarios by running the Python scripts:
python3 deepsparse/examples/benchmark/resnet50_benchmark.py --batch_size=64python3 deepsparse/examples/benchmark/resnet50_benchmark.py --batch_size=1
Both scripts will download the various ResNet-50 sparse-quantized models from SparseZoo, benchmark them for the given batch size, and print out the results of the iterations as follows:
Conclusions
Sparse-quantized models like our ResNet-50 models provide attractive performance results for those with image classification and object detection use cases. With tools readily available in GitHub, as you can see from the results, leveraging models that use techniques like pruning and quantization, can achieve speedups upwards of 7x when using the DeepSparse Engine with compatible hardware.
These noticeable wins do not stop there with ResNet-50. Neural Magic is constantly pushing the boundaries of what’s possible with sparsification on new models and datasets. The results of these advancements are pushed into our open-source repos for all to benefit from including new, performant models consistently being added to the SparseZoo and new techniques being added to Sparsify and SparseML to work with your own models.
Next Step: Transfer Learn
To transfer learn ResNet-50 to your data, visit our example in GitHub.
Resources and Learning More
– Software used in benchmarking: SparseZoo, DeepSparse Engine
– Pruning Primer
– Quantization Technical Paper
– Transfer Learn Your Data with SparseML
– Neural Magic Docs, GitHub
– Subscribe to Neural Magic Updates: Nerd out with us on ML Performance! (We keep the email manageable and do not share your details with anyone, ever.)
Originally posted here. Reposted with permission.