August 31, 2023

Executive summary

TL;DR: LoRA enables language model fine-tuning to be greatly accelerated, even without a dedicated GPU cluster.

Parameter-efficient fine-tuning methods such as LoRA are typically aimed at lower-resource settings where GPU memory is scarce. We show that LoRA also makes it possible to scale up data-parallel fine-tuning on nodes that have 100x less GPU-to-GPU bandwidth than a dedicated GPU cluster, making 50x higher training throughput possible on non-interconnected cloud GPU instances.

With 128 non-interconnected g5.xlarge instances, each having one A10G GPU, a LoRA model can be trained 63x faster than with a single A10G, and 17x faster than with a single 40GB A100. Such fast training could unlock fundamentally more interactive and more productive workflows for developing fine-tuned language models. Although GPU-intensive, distributed LoRA fine-tuning is reasonably efficient, preserving close to 50% of single-GPU efficiency even at our largest tested scale of 128 GPUs. With sufficient economies of scale, a commercial service could provide fast fine-tuning at a cost similar to that of single-GPU fine-tuning.

LoRA enables efficient data-parallel fine-tuning at scale, even with little bandwidth. AWS g5.xlarge instances are readily obtainable, but are completely incapable of standard data-parallel language model training; without dedicated interconnect, there is simply not enough GPU-to-GPU bandwidth to communicate the updates of all of the language model’s weights. LoRA drastically reduces the required inter-GPU bandwidth, making efficient data-parallel training possible.

Fine-tuning should be fast

How much faster could language model fine-tuning be?

Compared to other domains of software development that I have experience with, the language model fine-tuning workflow is slow. A developer working on a React application or an API endpoint would be very unhappy if testing each of their code changes took an hour. For developers of fine-tuned language models, though, such a wait is not rare. An AI developer might have an idea for how to better filter their dataset or how to generate a fresh set of training examples, but then be blocked for an hour waiting for their model to train. Why should they have to wait so long?

Why might LoRA enable fast training?

Model training is typically accelerated by distributing the computation across multiple GPUs. Synchronizing the weights of a large model requires very high bandwidth between GPUs, attainable only with specialized networking technologies such as NVLink and InfiniBand. The landmark Megatron-LM work, for example, required 100 GB/s of bandwidth to each server to achieve a <25% overhead in training throughput with 512 GPUs. Such a well-networked GPU clusters is quite difficult to obtain today, given the high demand for any GPU cluster that is suitable for large-scale language model training.

On-demand cloud GPU instances enjoy far less bandwidth. For example, as estimated by iperf3, two independent g5.xlarge EC2 instances in the same availability zone have only 600 to 700 MB/s of bandwidth between each other. This is a >100x disadvantage relative to Megatron’s bandwidth figure from several years ago. With such low bandwidth, conventional distributed training of language models is not feasible.

With LoRA, however, models can be trained with >1000x fewer trainable parameters; for example, when using a LoRA rank of 1, Llama-2-7B can be fine-tuned with less than 1 million trainable parameters. With such a small number of parameters to synchronize, might standard EC2 instance networking be sufficient for distributed training?

Experimental approach

The key thing to estimate is how fast training can be with LoRA. Training speed can be measured in several different ways. To make measurements relevant to the broadest possible range of fine-tuning use cases, I made a few choices about what to measure:

Measure training throughput, rather than training loss. Different use cases and tasks and datasets may have very different values of the critical batch size, making the rate of improvement in training loss and task performance difficult to predict from the training hardware configuration alone. Training throughput, on the other hand, is far more predictable across tasks, and a reliable way of estimating the achievable training throughput at any given batch size can be an invaluable data point for designing an efficient training setup for one’s task.
Only use instance-to-instance bandwidth. It is already known that well-interconnected GPUs can train large language models at very high throughput with high efficiency. LoRA as a bandwidth-saving technique is not of much value in such a context. The more interesting question is to what extent networking requirements can be relaxed. Can high throughput be achieved with high efficiency without any specialized networking hardware? To be relevant to this question, all measurements (except where specified) were taken using one of the most easily obtainable cloud GPU instance types, AWS’s g5.xlarge instance (1x A10G), with no interconnect beyond the standard network bandwidth available to all AWS GPU instances. This hardware configuration can be easily provisioned by anyone who has an AWS account. In particular, it is currently straightforward to provision 128 A10G instances, or even more, on demand.
Minimize per-device batch size. Small per-device batch sizes are very common when fine-tuning language models. Small batch sizes are absolutely required when fine-tuning with GPUs that have limited VRAM (<30 GB), and small batch sizes allow a greater degree of data parallelism to be exploited while keeping the total batch size within one’s task-specific critical batch size. Large-batch throughput figures are likely to be unrealistically optimistic in the (common) situations where the per-device batch size is small, because small per-device batch sizes imply (much) higher inter-node communication overhead. (Going in the other direction, there is no problem — if multi-node training is efficient even at a small per-device batch size, a larger batch size can only be more efficient.) As such, all measurements were taken at a very small per-device batch size of 2.

Executive summary

Fine-tuning should be fast

Why might LoRA enable fast training?

Experimental approach

Experimental details