*This post was originally published by Ziheng Wang at Towards Data Science*

A natural impulse is to make neural networks smaller by cheating. In Song Han’s seminal Deep Compression Paper back in 2016 (which has been cited around 4000 times according to Google), he proposed two ways to cut corners: quantization and pruning. Quantization aims to reduce the number of bits in the weights and activations. Pruning aims to preserve useful weights and set the rest to zero. Both of which have since became vibrant fields of research. Note quantization and pruning are pretty orthogonal methods. They can and are meant to be combined.

We are mostly going to talk about pruning here. In the 4 years since 2016, much of the pruning research has focused on structured pruning. Why structured pruning? After setting useless weights to zero in pruning, we are left with a sparse weight matrix. It didn’t take people long to realize that current commodity hardware/software stack does not like unstructured sparsity (aka random sparsity) patterns, where the nonzeros can be anywhere in the matrix. Instead, current hardware/software likes structured computation.

The quintessential structured computation is a dense matrix multiply, which has been optimized to hell and back by a bunch of really smart people. On Intel CPUs, one can expect around 90% of the hardware FLOPs with Intel MKL. On Nvidia GPUs, one can also expect close to roofline performance. Multiplying unstructured sparse matrices is kind of the polar opposite. Typically, existing state-of-the-art implementations can only achieve (much) less than 10% of the hardware FLOPs.

Structured pruning thus aims to strike some middle ground by introducing some structure in the sparsity pattern, most often in the form of blocks. Block sparsity turns out to be very efficient. OpenAI’s block sparse GPU kernels can achieve almost linear speedup with sparsity ratio and uses the hardware almost as efficiently as dense matrix multiplication.

Unfortunately, it is widely observed that structured pruning causes rather severe accuracy degradations, compared to unstructured pruning. This is rather intuitive, as one is imposing constraints on what weights can survive in the pruning process. Instead of acting as some form of regularization unfortunately, it almost always degrades performance. Recent studies have shown that the pruning method that simple magnitude-based unstructured pruning, similar to what Song Han used 4 years ago, still achieves state-of-the-art sparsity ratio/accuracy loss tradeoff consistently across different tasks.

**Wouldn’t it be great if we can just speedup unstructured sparse matrix multiplications?**

In my recent paper on arXiv: https://arxiv.org/abs/2008.11849, I present a code generator called SparseRT to tackle this challenge. It turns out that clever compilation strategies can in effect do a lot of the work of the sparse matrix multiplication at compile time, accelerating the runtime at execution. In particular, if one knows the sparsity pattern of the sparse matrix at compile time, then one could optimize specifically for this sparsity pattern, in effect tailoring the program for this particular sparse matrix.

I show that at 90% sparsity, many deep learning operations such as 1×1 and 3×3 convolutions can be accelerated by 2 to 3x on GPUs. This is acceptable for speeding up deep learning inference, where people typically do not care about how long it takes to compile a model. The speedups on 20 deep learning problems at 90% sparsity are presented in the following figure.

It’s important to note that SparseRT comes with a bunch of disclaimers:

Disclaimer 1: it only works for inference (right now). People have tried to use unstructured sparsity in training scenarios as well. However, in this case, the sparsity structure of the weight matrix tends to change on-the-fly. In addition, SparseRT currently doesn’t support a lot of the sparse gradient computations you’d need for deep learning (which typically take the form of sampled dense-dense matrix multiplication instead of sparse matrix multiplication.)

Disclaimer 2: it can’t use Tensor Cores. The paper presents results in fp32, which some would consider an obsolete data format for deep learning inference. SparseRT has support for fp16 on GPUs with vector instructions, and can typically provide a speedup over SparseRT running fp32. However, it’s very difficult to reconcile unstructured sparsity with dense matrix multiply accelerators like Tensor Cores. However, it’s not completely hopeless, and Tensor Core support should be available in the future.

I should also mention other great efforts being made in this direction, particularly the one made by a team at Google/Stanford: https://arxiv.org/abs/2006.10901.

**Cool. How much end-to-end speedup can SparseRT achieve?**

The paper itself focuses on single operations (sparse 1×1 and 3×3 convolutions). End to end results are understandably more interesting to most practitioners. Those are tricky because end-to-end inference depends on other things than fast 1×1 and 3×3 convolutions. For example, on MobileNets, once the 1×1 convolutions become three times as fast, the depthwise convolutions become the bottleneck. On BERT, once the matrix multiplications involving the weight matrices become fast, the (dense) self attention unit becomes the bottleneck. Of course, one also needs to fuse element-wise layers etc. These tricky details are probably the reason why Nvidia and Intel have whole teams of engineers working on TensorRT and OpenVINO.

Of course, all of those optimizations are compatible with SparseRT’s approach, and it can achieve pretty good end-to-end speedups. I will present the detailed results in more blog posts and follow-up papers. Just to give a sense of the speedups: SparseRT can achieve 2x end-to-end speedup over TensorRT on Jetson Nano in half precision on MobileNets at 90% sparsity, which typically means a 1.5 -2 % accuracy loss. This allows a Jetson Nano to have the same inference throughput as a Jetson TX2, at a third (!) of the price.

**Until next time…**

*This post was originally published by Ziheng Wang at Towards Data Science*