Microsoft speeds up PyTorch with DeepSpeed

0
101


Microsoft has released , a new deep learning optimization library for PyTorch, that is designed to reduce memory use and train models with better parallelism on existing hardware.

According to  announcing the new framework, DeepSpeed improves PyTorch model training through a memory optimization technology that increases the number of possible parameters a model can be trained with, makes better use of the memory local to the GPU, and requires only minimal changes to an existing PyTorch application to be useful.

It’s the minimal impact on existing PyTorch code that has the greatest potential impact. As machine learning libraries grow entrenched, and more applications become dependent on them, there is less room for new frameworks, and more incentive to make existing frameworks more performant and scalable.

PyTorch is already fast when it comes to both computational and development speed, but there’s always room for improvement. Applications written for PyTorch can make use of DeepSpeed with only minimal changes to the code; there’s no need to start from scratch with another framework.

One way DeepSpeed enhances PyTorch is by improving its native parallelism. In one example, provided by Microsoft in the DeepSpeed documentation, attempting to train a model using PyTorch’s Distributed Data Parallel system across Nvidia V100 GPUs with 32GB of device memory “[ran] out of memory with 1.5 billion parameter models,” while DeepSpeed was able to reach 6 billion parameters on the same hardware.

Another touted DeepSpeed improvement is more efficient use of GPU memory for training. By partitioning the model training across GPUs, DeepSpeed allows the needed data to be kept close at hand, reduces the memory requirements of each GPU, and reduces the communication overhead between GPUs.

, which refers to tuning the parameters or variables of the training process itself, can improve the accuracy of a model but typically at the cost of manual effort and expertise.

To eliminate the need for expertise and human effort, many machine learning frameworks now support some kind of automated hyperparameter optimization. With DeepSpeed, Microsoft claims that “deep learning models with 100 billion parameters” can be trained on “the current generation of GPU clusters at three to five times the throughput of the current best system.”

, but Azure is not required to use DeepSpeed.