Introduction

Low-Rank Adaptation (LoRA) is a method developed to fine-tune very large deep learning models in a parameter-efficient way. As the size of language models increases, such as reaching around 1.8 trillion parameters in GPT-4, fine-tuning them becomes a challenging task due to the massive computational resources required.

Background

Model Scaling vs. Fine-Tuning
Precision & Quantization
Parameter-Efficient Fine-Tuning

LoRA (Low-Rank Adaptation)

LoRA addresses these limitations by performing rank decomposition on the weight matrices.
The rank of a matrix indicates the number of independent row or column vectors. A low-rank matrix has a smaller rank than its dimensions, providing compact representation and reduced complexity.
LoRA is based on the concept that large models have a low intrinsic dimensionality, meaning effective fine-tuning can be achieved by adjusting a smaller subset of parameters. This approach is supported by research showing that larger models have lower intrinsic dimensions.
LoRA fine-tunes models by decomposing the weight update matrix into low-rank matrices. The original model weights remain unchanged, and only the low-rank matrices (B and A) are trainable.

For a 200x200 weight matrix, it’s more efficient to fit two 200x2 matrices instead of the full 200x200 matrix.

Implementation:

The low-rank decomposition is applied typically on attention weights in Transformer models. During the forward pass, inputs are multiplied by both the original weights and the low-rank matrices, and the results are added together.

Results

Experiments show that a small rank already achieves good performance, and increasing the rank doesn’t necessarily improve results. The optimal rank depends on whether the foundation model has seen similar data or if the dataset is substantially different.

Benefits of LoRA

Computational Efficiency: Less memory and computational resources are needed for training due to fewer trainable weights.
Inference Efficiency: The fine-tuned weights can be merged with the original weights, avoiding overheads during inference.
Flexibility: Different LoRA weights can be used for various downstream tasks, creating a modular approach to model fine-tuning.

QLoRA (Quantized LoRA)

Combining quantization techniques with LoRA can further reduce hardware requirements. QLoRA applies 4-bit quantization to the pre-trained model weights, reducing memory usage and computational costs.

Conclusion

LoRA presents an efficient method for fine-tuning large language models by leveraging low-rank decomposition, balancing computational efficiency with performance. The Hugging Face library and techniques like QLoRA enhance its practical applicability, making it accessible for training large models on single GPUs.

<aside> 💡 FLOPS (Floating Point Operations Per Second) FLOPS measure the speed of hardware in executing floating-point operations. As GPU performance increases, it enables faster training of neural networks with smaller precision models, improving training speed by reducing the time to read data.

</aside>

<aside> 💡 Adapter Layers:

Introduced in a 2019 Google research paper, adapter layers insert new modules between the model layers, leading to increased inference latency but enabling fine-tuning of internal model representations.

Prefix Tuning:

A lightweight alternative by optimizing the input vector for language models, essentially adding context by prepending specific vectors to the input. This approach has limitations in controlling the model, making some parameter tuning necessary.

</aside>

<aside> 💡 Scaling Factor (Alpha): The output of B and A is scaled by a factor (alpha divided by the rank) to control the amount of change added to the original weights. This balancing helps maintain the pre-trained model’s knowledge while adapting to new tasks.

</aside>