Introduction

Low-Rank Adaptation (LoRA) is a method developed to fine-tune very large deep learning models in a parameter-efficient way. As the size of language models increases, such as reaching around 1.8 trillion parameters in GPT-4, fine-tuning them becomes a challenging task due to the massive computational resources required.

Background

LoRA (Low-Rank Adaptation)

For a 200x200 weight matrix, it’s more efficient to fit two 200x2 matrices instead of the full 200x200 matrix.

Implementation:

Results

Benefits of LoRA

QLoRA (Quantized LoRA)

Combining quantization techniques with LoRA can further reduce hardware requirements. QLoRA applies 4-bit quantization to the pre-trained model weights, reducing memory usage and computational costs.

Conclusion

LoRA presents an efficient method for fine-tuning large language models by leveraging low-rank decomposition, balancing computational efficiency with performance. The Hugging Face library and techniques like QLoRA enhance its practical applicability, making it accessible for training large models on single GPUs.

<aside> 💡 FLOPS (Floating Point Operations Per Second) FLOPS measure the speed of hardware in executing floating-point operations. As GPU performance increases, it enables faster training of neural networks with smaller precision models, improving training speed by reducing the time to read data.

</aside>

<aside> 💡 Adapter Layers:

Introduced in a 2019 Google research paper, adapter layers insert new modules between the model layers, leading to increased inference latency but enabling fine-tuning of internal model representations.

Prefix Tuning:

A lightweight alternative by optimizing the input vector for language models, essentially adding context by prepending specific vectors to the input. This approach has limitations in controlling the model, making some parameter tuning necessary.

</aside>

<aside> 💡 Scaling Factor (Alpha): The output of B and A is scaled by a factor (alpha divided by the rank) to control the amount of change added to the original weights. This balancing helps maintain the pre-trained model’s knowledge while adapting to new tasks.

</aside>