Skip to content

What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that allows you to adapt large language models (LLMs) without modifying all the model parameters. Instead of updating billions of parameters, LoRA introduces small "adapter" matrices that can be trained efficiently.

This one weird trick ...

LoRA's key trick is that a large matrix can be created by multiplying two smaller matricies. This reduces the number of trainable parameters from M x N down to about M + N. For example, in the animation below a 5 x 2 matrix (10 elements) is multiplied by a 2 x 8 matrix (16 elements). That's 10 + 16 = 26 numbers. But the result is a 5 x 8 matrix which is 40 numbers.

lora multiplcation

Key Benefits

  • Memory Efficient: Only ~1% of parameters need training
  • Fast Training: Significantly reduced training time (minutes to hours, not days to weeks)
  • Modular: Adapters can be swapped without retraining the base model
  • Consumer Hardware Friendly: Can run on GPUs with as little as 8GB RAM

Modular adapters

Modular adapters are essential for fast inference. The large, base LLM can be loaded into memory once. The smaller LoRA adapter weights can be quickly loaded and unloaded based on the inference task.

How LoRA Works

Instead of updating the full weight matrix \(W\), LoRA decomposes the weight update into two smaller matrices:

\[W' = W + \Delta W = W + \frac{\alpha}{r} \cdot A \cdot B\]

Where:

  • \(A \in \mathbb{R}^{d \times r}\) and \(B \in \mathbb{R}^{r \times k}\) are low-rank matrices
  • \(r\) is the rank (typically 8, 16, 32, or 64)
  • \(\alpha\) is a scaling parameter
  • Only \(A\) and \(B\) are trained, keeping the original weights \(W\) frozen

trick

Mathematical Intuition

The core insight is that weight updates during fine-tuning often have low "intrinsic rank". Instead of learning a full \(d \times k\) matrix update, we can approximate it as the product of two much smaller matrices:

  • Matrix \(A\): \(d \times r\) (where \(r \ll d\))
  • Matrix \(B\): \(r \times k\) (where \(r \ll k\))

This reduces parameters from \(d \times k\) to \(r \times (d + k)\).

Analogs to LoRA?

This is similar (though not mathematically equivalent) to using SVD or PCA to reduce a large matrix’s dimensionality by projecting it into a lower-dimensional subspace.

Similar tricks are used for Winograd convolution and depthwise convolution.

Example

For a typical attention layer with \(d = k = 4096\) and rank \(r = 16\):

  • Full fine-tuning: \(4096 \times 4096 = 16.7M\) parameters
  • LoRA: \(16 \times (4096 + 4096) = 131K\) parameters
  • Reduction: 99.2% fewer parameters!

LoRA Configuration Parameters

Rank (\(r\))

The intrinsic dimensionality of the weight updates:

  • Lower values (8-16): Fewer parameters, faster training, less expressive
  • Higher values (32-64): More parameters, slower training, more expressive
  • Sweet spot: Usually 16-32 for most tasks

Alpha (\(\alpha\))

Scaling factor controlling the magnitude of LoRA updates:

  • Typical values: 16, 32, 64
  • Common pattern: \(\alpha = 2 \times r\)
  • Higher values: Stronger adaptation, risk of overfitting

Target Modules

Which parts of the model to adapt:

  • Attention layers:
  • q_proj = Query projection
  • k_proj = Key projection
  • v_proj = Value projection
  • o_proj = Output projection

  • Feed-forward layers: gate_proj, up_proj, down_proj

  • Common choice: Only attention layers for efficiency

Comparison with Other Methods

Method Parameters Memory Speed Quality
Full Fine-tuning 100% High Slow Excellent
LoRA 0.1-1% Low Fast Very Good
Other Adapters 2-4% Medium Medium Good
Prefix Tuning 0.01% Very Low Very Fast Limited

When to Use LoRA

Ideal for:

  • Limited computational resources
  • Quick experimentation
  • Domain adaptation
  • Multiple task variants
  • Rapid prototyping

Consider alternatives when:

  • Maximum performance is critical
  • Unlimited computational resources
  • Completely new capabilities needed
  • Large-scale architectural changes required

Next: Learn about the prerequisites for getting started with LoRA fine-tuning.