In this post, we elaborate on how we used state-of-the-art pruning and quantization techniques to improve the performance of the YOLOv3 on CPUs. We’ll show that by leveraging the robust YOLO training framework from Ultralytics with SparseML’s sparsification recipes it is easy to create highly pruned and INT8 quantized YOLO models that deliver more than a 6x increase in performance over state-of-the-art PyTorch and ONNX Runtime CPU implementations. Lastly, we’ll show that model sparsification (pruning and quantization) doesn’t have to be a hard and daunting task when using Neural Magic open-source tools and recipe-driven approaches.
Figure 1: Comparison of the real-time performance of YOLOv3 (batch size 1) for different CPU implementations to common GPU benchmarks.
* ONNX Runtime has a known performance regression for quantized YOLOv3 currently on larger core counts.
Figure 2: Comparison of throughput inference costs of YOLOv3 (batch size 64) for different CPU implementations to common GPU benchmarks
A few weeks ago we released support for ResNet-50, showing that pruning and quantization lead to 7x better performance on CPUs over ONNX Runtime. Today we are officially supporting YOLOv3, to be followed by BERT and other popular models in the coming weeks. By the end of the post, you should be able to reproduce these benchmarks using the aforementioned integrations and tools available in the Neural Magic GitHub repo.
Achieving GPU-Class Performance for YOLOv3 on CPUs
In 2018, the creators of YOLO released an insightful paper, YOLOv3: An Incremental Improvement. It was significant at the time because it delivered the best tradeoff between high accuracy and real-time latency. Since then, there have been notable performance improvements enabled by advancements in GPUs.
For real-time inference at batch size 1, the YOLOv3 model from Ultralytics is able to achieve 60.8 img/sec using a 640 x 640 image at half-precision (FP16) on a V100 GPU. This is a 3x improvement on the original paper’s number of 19.6 img/sec using a 608 x 608 image at full precision (FP32) on a Titan X GPU. Better training methods, updates to the CUDA libraries, and better support for half-precision have all contributed to the significant improvement in both mean average precision (mAP) and inference performance on GPUs.
CPU performance, however, has lagged behind GPU performance. Native PyTorch CPU performance today for YOLOv3 at batch size 1 achieves only 2.7 img/sec for a 640 x 640 image on a 24-core server. ONNX Runtime performs slightly better, maxing out at 13.8 img/sec. This poor performance has historically made it impractical to deploy YOLOv3 on a CPU.
By using sparsification and proprietary advancements in the DeepSparse Engine, we are able to bridge the gap between CPU and GPU performance. Applying both to YOLOv3 allows us to significantly improve performance on CPUs — enabling real-time CPU inference with a state-of-the-art model. For example, a 24-core, single-socket server with the sparsified model achieves 46.5 img/sec while a more common 8-core instance achieves 27.7 img/sec. These results deliver the flexibility and cost benefits of CPUs at GPU-like speeds.
YOLOv3 on CPUs: Diving into the Numbers
The numbers presented are benchmarked on readily available servers in AWS and the code is open-sourced as an integration within our SparseML repo. Each benchmark includes pre-and post-processing along with the model execution and was run for 25 warmups and 80 measurements. The mAP at an IOU of 0.5 is reported for all versions of YOLOv3 as an accuracy comparison where a larger value is better. In addition, inference performance is reported in images per second (img/sec) and images per dollar (img/dollar) where larger values are again better. Generally, 20 images per second is considered to be real-time performance at batch size 1 but can vary based on the setup.
The CPU servers and core counts for each use case were chosen to balance between reasonable deployment setups as well as cost-competitive numbers. The AWS C5 servers specifically were used as they are designed for computationally intensive workloads and include both AVX512 and VNNI instruction sets. CPU servers in general are very flexible whether through cloud deployments or VMs. Therefore the number of cores can be varied to better fit the exact deployment needs enabling the user to balance performance and cost with ease. Not only this but because CPU servers are more readily available, models can be deployed closer to the end-user cutting out costly network time.
Unfortunately, the common GPUs available do not have support for speedup using unstructured sparsity. This is due to a lack of both hardware and software support and is an active research area. The new A100s do have limited hardware support for unstructured sparsity but are not readily available at the time of writing. We do look forward to updating our benchmarks as both software and hardware improve on GPUs and CPUs through the coming months and years. Not only this, but we are committed to making it easier for you to create accurate, cheaper, and more environmentally friendly neural networks through model sparsification as the deployment ecosystem evolves.
Latency
For latency measurements, we use batch size 1 to represent the fastest time an image can be detected and returned. A 24-core, single-socket AWS server is used to test the CPU implementations as well. The full table for measured values (and the source for Figure 1) is provided below. We can see that pruning and quantizing take the model from a reasonable CPU implementation all the way to within the GPU range with DeepSparse, equalling a 3.4x increase in end-to-end performance.
Table 1: Latency benchmark numbers (batch size 1) for YOLOv3
Throughput Cost
For throughput measurements, we use batch size 64 to represent a normal, batched use case. Additionally, a batch size of 64 was enough to fully saturate the GPUs and CPUs performance in our testing. To maximize cost savings, a 1-core, single-socket AWS server is used to test the CPU implementations as well. The full table for measured values (and the source for Figure 2) is provided below. We can see that pruning and quantizing take the model from an expensive deployment to beating out everything but the T4 FP16 GPU numbers, a 6x decrease in the deployment cost.
Table 2: Throughput cost benchmark numbers (batch size 64) for YOLOv3
YOLOv3 on CPUs Throughput Performance
Batch size 64 is again used to represent a normal, batched use case for the throughput performance benchmarking. A 24-core, single-socket AWS server is used to test the CPU implementations as well. The full table for measured values is provided below. We can see that the V100 numbers are tough to beat; however, pruning and quantizing combined with DeepSparse beat out the T4. The combination also beats out the next best CPU numbers by over 6.8x!
Table 3: Throughput cost benchmark numbers (batch size 64) for YOLOv3
YOLOv3 on CPUs: How did we do it?
At a high level, we achieved the above results using a combination of sparsification algorithms and a CPU-optimized inference engine:
1. We pruned the model to 83% sparsity using state-of-the-art pruning techniques with SparseML;
2. Running SparseML, we quantized the model to get to INT8 using quantization-aware training;
3. Finally, we ran the model in the DeepSparse Engine that’s optimized to accelerate sparse, quantized models on CPUs to GPU speeds (scroll to the bottom of this page to learn how the DeepSparse Engine works).
All of the tools we used are available in the Neural Magic GitHub repo.
Sparsification
Sparsification is the process of removing redundant information from the overprecise and over-parameterized network. When implemented correctly, sparsification results in significantly more performant and smaller models with limited to no effect on the baseline metrics. For example, as you saw above in our YOLOv3 benchmarking exercise, sparsification (pruning plus quantization) can give over 3.7x (batch size 1) and 4.1x (batch size 64) improvement in performance end-to-end while recovering to nearly the same baseline.
The Deep Sparse Platform
The Deep Sparse Platform builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. At a high level, you would:
1. Download a sparsification recipe and sparsified model from the SparseZoo.
2. Alternatively, create a recipe for your model using Sparsify.
3. Apply your recipe with only a few lines of code to your training setup using SparseML.
4. Retrain your model for a few epochs to sparsify it and recover close to the baseline model.
5. Finally, for GPU-level performance on CPUs, you can deploy your sparse-quantized model with the DeepSparse Engine.
To learn more about pruning and quantizing models with SparseML, take a look at our notebooks and examples for walkthroughs of simple workflows with PyTorch, Keras, and TensorFlow.
How to Prune and Quantize YOLOv3
SparseML is a library that makes it easy to seamlessly apply pruning and quantization to existing training flows. For pruning and quantizing YOLOv3, SparseML provides integration into the PyTorch-based YOLO repository ultralytics/yolov5. By leveraging the robust YOLO training framework from Ultralytics with SparseML’s sparsification recipes it is easy to create highly pruned, INT8 quantized YOLO models.
# clone
git clone https://github.com/ultralytics/yolov5.git
git clone https://github.com/neuralmagic/sparseml.git# copy script
cd yolov5
git checkout c9bda11 # latest tested integration commit hash
cp ../sparseml/integrations/ultralytics/*.py .
cp ../sparseml/integrations/ultralytics/deepsparse/*.py .# install dependencies
pip install sparseml[torchvision] deepsparse
pip install -r requirements.txt
Installing the Integration
To install the integration and its dependencies, follow the integration’s README for the latest steps, or the commands below:
The integration includes a training script that provides the best features of SparseML and Ultralytics/yolov5 as well as a simple server/client example for DeepSparse Engine deployments and a benchmarking script.
Training a Pruned-Quantized Model
The integrated train.py
script provides options for using custom SparseML recipes to sparsify models from pretrained checkpoints. The following example command was used to prune and quantize the YOLOv3-SPP model from its dense baseline.
It loads the recipe and starting weights from SparseZoo. Recipes and weights can also be loaded from local paths. Check out the downloaded recipe for more information on how the model was trained.
python train.py \
--sparseml-recipe zoo:cv/detection/yolo_v3-spp/pytorch/ultralytics/coco/pruned_quant-aggressive_94?recipe_type=original \
--weights zoo \
--cfg ./models/hub/yolov3-spp.yaml \
--data coco.yaml \
--hyp data/hyp.finetune.yaml \
--epochs 242 \
--batch-size 48 \
--name yolov3-spp-leaky_relu-pruned_quant
YOLOv3 on CPUs: Run Our Benchmarks
To reproduce our benchmarks and check DeepSparse performance on new CPUs, the integration also provides the benchmark.py
script. The benchmarking script supports benchmarking YOLOv3 models using DeepSparse, ONNX Runtime (CPU), and PyTorch GPU with 640,640 sized input images.
For a full list of options run python benchmark.py --help
.
As an example, to benchmark DeepSparse’s pruned-quantized YOLOv3 performance on your VNNI enabled CPU run:
python benchmark.py \
zoo:cv/detection/yolo_v3-spp/pytorch/ultralytics/coco/pruned_quant-aggressive_94 \
--engine deepsparse \
--batch-size 1 \
--quantized-inputs
YOLOv3 on CPUs: Conclusion
These noticeable wins do not stop there with YOLOv3. We will be continually maximizing what’s possible with sparsification and CPUs through higher sparsities, better high-performance algorithms, and cutting-edge multicore programming developments. The results of these advancements will be pushed into our open-source repos for all to benefit. The SparseZoo will continuously receive new models and the existing performant model updates; Sparsify and SparseML will be enhanced with sparsification developments as we land them, to be implemented with your models.
Stay current by starring our GitHub repository or subscribing to our weekly ML performance communications here. We will never share your data with anyone. We urge you to try unsupported models and report back to us through the GitHub Issue queue as we work hard to broaden our sparse and sparse-quantized model offerings.
Originally posted here. Reposted with permission.