Fine-tuning

LoRA and Efficient Fine-Tuning (PEFT): Adapting a Model Without Training the Whole Thing

LoRA and Efficient Fine-Tuning (PEFT): Adapting a Model Without Training the Whole Thing

When you want to adapt a large language model to your own task, the obvious path is to retrain it end to end. But updating billions of parameters is both expensive and slow; for most teams it is simply out of reach. This is exactly where LoRA (Low-Rank Adaptation) and the broader PEFT (Parameter-Efficient Fine-Tuning) family come in: without ever touching the model’s large body, you train small, cheap “adapter” parts and achieve the same adaptation with far less memory and cost. In this post we explain the intuition behind LoRA, why it works, and how it is used in practice, with everyday analogies.

The problem: why is training the whole model expensive?

A modern language model holds billions of parameters. Retraining all of them (full fine-tuning) brings two heavy burdens:

  • Memory: During training you must keep not only the parameters but also their gradients and optimizer states (for example the two extra values Adam tracks) in memory. That amounts to roughly several times the model’s size in GPU memory.
  • Storage and management: You store a full copy of the model for every task. If you have ten tasks, you have ten giant model files.
In full fine-tuning, the hard part is not “teaching” the model; it is finding the hardware to train it and managing the results.

What is PEFT? In one sentence

PEFT (Parameter-Efficient Fine-Tuning) is the approach of freezing the vast majority of a model’s parameters and training only a small subset, or small new parts added on top. The goal is to reach quality close to full fine-tuning at a small fraction of its cost.

PEFT is an umbrella term covering several methods (adapter layers, prefix-tuning, prompt-tuning, and others). The most common and practical one today is LoRA.

The intuition behind LoRA: low-rank adapters

LoRA’s core observation is this: when you adapt a model to a new task, the change that the weights need does not actually carry very rich information; it can be approximated well with a “low-rank” structure. In other words, you can represent a large change matrix as the product of two much smaller matrices.

In practice, LoRA freezes an existing weight matrix W and adds two small matrices next to it: an A (down-projection) and a B (up-projection). During training, only A and B are updated. The model’s final behavior is the sum of the original output and this small add-on.

The magic word here is rank (r). If the matrix size is very large (say 4096×4096), choosing A and B with only a small inner dimension such as r=8 or r=16 drops the number of trained parameters from billions to a few million.

Tip: Think of rank as “how much freedom you grant.” Lower rank means fewer parameters and faster training; choose it too low and the model may struggle to learn the new task. r=8–16 is a solid starting point for most tasks.

The transparency-sheet analogy

Imagine you have an old textbook. It is beautifully printed, but its content is generic. You want to adapt it to your own class. You have two options:

  • Full fine-tuning: Erase every page and rewrite the whole book. It takes enormous time and ink; worse, your version replaces everyone’s copy and you lose the original.
  • LoRA: Lay a transparent overlay sheet on top of the book and write your notes there. The book stays intact; the overlay is small, cheap, and easy to swap. You prepare a different overlay for each class, all working on top of the same book.

LoRA adapters are exactly those overlays: small, portable, and snap-on/snap-off over the same base model.

A short code example

The conceptual Python example below shows how LoRA wraps a layer:

class LoRALinear:
    def __init__(self, W, r):
        self.W = W              # frozen original weight (not trained)
        d_out, d_in = W.shape
        self.A = randn(r, d_in) * 0.01   # small, trainable
        self.B = zeros(d_out, r)         # small, trainable

    def forward(self, x):
        # Original path + low-rank correction
        return x @ self.W.T + (x @ self.A.T) @ self.B.T

# In training only A and B receive gradients;
# instead of billions of parameters, just r * (d_in + d_out).

Note that B is initialized to zero, so at the very start of training the adapter’s effect is zero and the model begins learning without drifting from its original behavior.

The memory and cost advantage

LoRA’s payoff is not just “fewer parameters”; that shrinkage creates a chain of advantages:

  • Optimizer memory drops: You keep optimizer states only for the small trained adapters. For the frozen body you need to store neither gradients nor optimizer state.
  • Storage shrinks: For a task you store only the adapter weights, which are tiny files next to the full model. Instead of ten giant models for ten tasks, you keep one base model and ten small adapters.
  • Deployment flexibility: You can keep the same base model in memory and swap different adapters in and out on the fly. A single server can serve many specialized behaviors in turn.
In short: the frozen body is shared, and the real savings come from the small size of the trained part and the disappearance of its gradient/optimizer burden.

QLoRA: one step further

LoRA is powerful on its own, but the base model itself still has to fit in memory. This is where QLoRA comes in: it loads the base model in a low-precision (quantized, for example 4-bit) form, then trains ordinary LoRA adapters on top of it.

The result is that model sizes previously feasible only on large clusters can be adapted on a single GPU. The body stays quantized and frozen; only the small adapters are trained at full precision. This was the step that opened the idea of “accessible fine-tuning” to a wide audience.

Practical use and tips

When you apply LoRA in your own project, a few practical decisions await you:

  • Which layers should it target? A common approach is to target the projection matrices of the attention layers. Targeting more layers usually improves quality while still keeping the parameter count low.
  • Rank and alpha: Rank sets the capacity, while alpha sets the adapter’s scale. Low rank suffices for simple tasks; for complex, style-heavy tasks, try increasing the rank.
  • Data quality is everything: Like full fine-tuning, LoRA still works on good, clean examples. A little high-quality data beats a lot of noisy data.
  • Merging: Once training is done you can permanently merge the adapter into the base model (no extra inference cost), or keep it separate to stay flexible.

LoRA made fine-tuning accessible to small teams. Training a small adapter instead of the whole model both speeds up experimentation and lowers cost, which means more experiments and faster learning.

Key takeaways

  • PEFT freezes the model’s body and trains only small parts, reaching near-full-fine-tuning quality at a fraction of the cost.
  • LoRA represents the weight change in low-rank form via two small matrices (A and B); trained parameters drop from billions to millions.
  • The real savings come not just from parameter count but from the vanished gradient and optimizer burden.
  • Adapters are small and portable: one base model + many adapters beats many giant model files.
  • QLoRA adds LoRA on top of a quantized base model, making large models adaptable on a single GPU.
Does LoRA perform as well as full fine-tuning?

On many practical tasks the results are quite close to full fine-tuning, and the gap is usually small next to the memory/cost advantage it brings. Still, it depends on the task, the data, and the chosen rank; in cases that require very broad retraining, full fine-tuning may come out ahead.

Does LoRA slow down inference?

If you keep the adapter separate, there is a small extra computation. But after training you can merge the adapter into the base model, in which case inference runs at the same speed as the original model.

How should I choose the rank value?

Starting with r=8 or r=16 is a solid baseline for most tasks. If the model cannot learn the new task well enough, raise the rank; choosing it higher than necessary can increase both parameter count and the risk of overfitting.


LoRA and PEFT have largely broken the idea that “powerful models are only for teams with big budgets.” With the right adapter strategy, even a small team can adapt a model to its own domain. If you are curious how we use efficient adaptation in Turkish AI products in practice, take a look at EcoFluxion; you can explore how these methods apply in the legal domain through İçtiHub.

İsmail Tarık Şenkal

EcoFluxion Teknoloji A.Ş. · Co-Founder

A developer and entrepreneur working on Turkish-focused AI products — the name behind EcoFluxion and İçtiHub.

← Previous
Knowledge Distillation and Small Models: Transferring Knowledge from a Large Teacher to a Small Student