August 31, 2023

Executive summary

TL;DR: LoRA enables language model fine-tuning to be greatly accelerated, even without a dedicated GPU cluster.

Parameter-efficient fine-tuning methods such as LoRA are typically aimed at lower-resource settings where GPU memory is scarce. We show that LoRA also makes it possible to scale up data-parallel fine-tuning on nodes that have 100x less GPU-to-GPU bandwidth than a dedicated GPU cluster, making 50x higher training throughput possible on non-interconnected cloud GPU instances.

With 128 non-interconnected g5.xlarge instances, each having one A10G GPU, a LoRA model can be trained 63x faster than with a single A10G, and 17x faster than with a single 40GB A100. Such fast training could unlock fundamentally more interactive and more productive workflows for developing fine-tuned language models. Although GPU-intensive, distributed LoRA fine-tuning is reasonably efficient, preserving close to 50% of single-GPU efficiency even at our largest tested scale of 128 GPUs. With sufficient economies of scale, a commercial service could provide fast fine-tuning at a cost similar to that of single-GPU fine-tuning.

LoRA enables efficient data-parallel fine-tuning at scale, even with little bandwidth. AWS  instances are readily obtainable, but are completely incapable of standard data-parallel language model training; without dedicated interconnect, there is simply not enough GPU-to-GPU bandwidth to communicate the updates of all of the language model’s weights. LoRA drastically reduces the required inter-GPU bandwidth, making efficient data-parallel training possible.

LoRA enables efficient data-parallel fine-tuning at scale, even with little bandwidth. AWS g5.xlarge instances are readily obtainable, but are completely incapable of standard data-parallel language model training; without dedicated interconnect, there is simply not enough GPU-to-GPU bandwidth to communicate the updates of all of the language model’s weights. LoRA drastically reduces the required inter-GPU bandwidth, making efficient data-parallel training possible.

Fine-tuning should be fast

How much faster could language model fine-tuning be?

Compared to other domains of software development that I have experience with, the language model fine-tuning workflow is slow. A developer working on a React application or an API endpoint would be very unhappy if testing each of their code changes took an hour. For developers of fine-tuned language models, though, such a wait is not rare. An AI developer might have an idea for how to better filter their dataset or how to generate a fresh set of training examples, but then be blocked for an hour waiting for their model to train. Why should they have to wait so long?

Why might LoRA enable fast training?

Model training is typically accelerated by distributing the computation across multiple GPUs. Synchronizing the weights of a large model requires very high bandwidth between GPUs, attainable only with specialized networking technologies such as NVLink and InfiniBand. The landmark Megatron-LM work, for example, required 100 GB/s of bandwidth to each server to achieve a <25% overhead in training throughput with 512 GPUs. Such a well-networked GPU clusters is quite difficult to obtain today, given the high demand for any GPU cluster that is suitable for large-scale language model training.

On-demand cloud GPU instances enjoy far less bandwidth. For example, as estimated by iperf3, two independent g5.xlarge EC2 instances in the same availability zone have only 600 to 700 MB/s of bandwidth between each other. This is a >100x disadvantage relative to Megatron’s bandwidth figure from several years ago. With such low bandwidth, conventional distributed training of language models is not feasible.

With LoRA, however, models can be trained with >1000x fewer trainable parameters; for example, when using a LoRA rank of 1, Llama-2-7B can be fine-tuned with less than 1 million trainable parameters. With such a small number of parameters to synchronize, might standard EC2 instance networking be sufficient for distributed training?

Experimental approach

The key thing to estimate is how fast training can be with LoRA. Training speed can be measured in several different ways. To make measurements relevant to the broadest possible range of fine-tuning use cases, I made a few choices about what to measure:

Experimental details