Why LoRA Isn't Always Best: Explore Better Fine-Tuning

When you’re looking to fine-tune a model in a memory-efficient way, it’s easy to default to LoRA. After all, it’s the undisputed champion of Parameter-Efficient Fine-Tuning (PEFT) techniques, widely adopted and deeply integrated into various frameworks. But what if there’s a better option for your specific use case?

This article dives deep into the world beyond LoRA, exploring whether this popular technique truly offers the best performance or if its dominance is simply a result of its early popularity. We’ll examine the tools available to make informed decisions and how broadening your perspective beyond LoRA could unlock significant advantages.

Understanding Parameter-Efficient Fine-Tuning (PEFT)

In today’s AI landscape, countless open models are readily available, yet they often fall short of specific application requirements. While prompt engineering can help, it rarely suffices for highly specialized tasks. Instead of building a new model from the ground up, fine-tuning an existing one is generally a more practical approach.

However, traditional fine-tuning is resource-intensive, demanding enough memory to store multiple copies of the entire model. While quantization can reduce a model’s footprint, these quantized models typically can’t be fine-tuned directly. This challenge led to the emergence of PEFT techniques, designed to drastically cut down the memory needed for fine-tuning.

With PEFT, you can fine-tune models using just a fraction of the usual memory, even working with previously quantized models. These techniques offer several additional benefits, including smaller checkpoint sizes, enhanced resistance to catastrophic forgetting, and the ability to deploy multiple fine-tuned versions from a single base model.

Hugging Face develops the PEFT library, which provides a unified API for numerous PEFT techniques and integrates seamlessly with popular ecosystems like Transformers and Diffusers. It also supports various quantization methods, making parameter-efficient fine-tuning more accessible than ever. Whether you’re fine-tuning on custom data or researching new PEFT methods, the PEFT library is an excellent starting point.

LoRA: The Reigning Queen of Fine-Tuning

Among the many parameter-efficient fine-tuning techniques, “Low Rank Adaptation,” or LoRA, emerged early and quickly proved its effectiveness. It operates by adding a small number of parameters on top of the base model, freezing the original weights, and training only these newly introduced parameters.

LoRA has become the most widely adopted PEFT technique by a significant margin. While exact figures vary, estimates suggest that its usage far surpasses any other method in both public repositories and research papers.

This widespread adoption could indicate that LoRA is genuinely the superior choice for most applications. However, another possibility exists: LoRA’s early popularity may have created a self-reinforcing cycle. Its high visibility, abundance of tutorials, and excellent support in downstream packages could be driving its continued dominance, rather than purely its performance.

This raises a crucial question: Are we inadvertently sacrificing potential performance gains by overlooking other, potentially better techniques? Many researchers publish papers claiming their novel methods outperform LoRA. Is this not compelling enough reason to explore beyond the familiar?

The Challenges of Choosing PEFT Based on Research Papers

Dozens of papers propose fine-tuning techniques that aim to surpass LoRA. The PEFT library alone includes over 40 distinct techniques, not even counting their numerous variations. Almost all of these papers claim their method beats LoRA on specific benchmarks.

However, relying solely on these academic claims can be problematic. Researchers often face pressure to demonstrate results that outperform existing benchmarks, which can unintentionally introduce bias. For instance, a study found that LoRA could match supposedly superior techniques simply by careful learning rate tuning, suggesting that comparisons aren’t always conducted under optimal conditions for all methods.

Further complicating matters, each paper selects a unique set of PEFT techniques for comparison and employs different benchmarks. Even when comparing the same technique on the same benchmark, the code is frequently unavailable or difficult to execute, making independent reproduction of results challenging. This makes it incredibly hard to confidently determine the best PEFT technique for your specific needs based solely on published research.

Hugging Face’s Approach to Objective PEFT Benchmarking

At Hugging Face, we recognized the need to help users make more informed decisions about PEFT techniques. Our PEFT library already provides a unified API for many methods, and the next logical step was to offer objective benchmarks.

We’ve maintained a benchmark for fine-tuning Large Language Models (LLMs) on a math dataset for some time. This benchmark evaluates how well an LLM can learn chain-of-thought reasoning to solve mathematical problems and adapt its output format, starting from a base model that isn’t instruction-tuned.

To broaden our findings across modalities, we also introduced an image generation benchmark. This test assesses a model’s ability to learn a new concept, such as a “cat plushy,” and then generate it in various new contexts without forgetting its existing knowledge. All PEFT techniques are evaluated under identical conditions, using the same base model, dataset, training code, and hardware.

Beyond just test performance, we track vital metrics like VRAM usage, catastrophic forgetting/drift, runtime, and checkpoint size, catering to diverse user needs. Our benchmarks are designed to run on consumer-grade hardware, and adding new experiments simply involves a new PEFT configuration and a script execution. Since we compare all techniques fairly without bias, we believe these benchmarks offer an objective view of their effectiveness.

Our Findings: LoRA is Good, But Not Always the Best

After running our comprehensive benchmarks, we concluded that while LoRA performs admirably, other PEFT methods can indeed surpass it in one or more areas. This means exploring alternatives is definitely worthwhile.

When interpreting results, it’s helpful to consider tradeoffs, such as the balance between test performance and memory consumption. A PEFT technique lies on the Pareto Frontier if no other technique can simultaneously offer better performance and lower memory usage. In simpler terms, to achieve better accuracy, you might need more memory, and vice-versa.

For the LLM Math dataset benchmark, LoRA does reside on the Pareto frontier when considering test accuracy versus memory. It achieves 53.2% test accuracy with a peak VRAM usage of 22.6 GB. However, other techniques also populate this frontier. For example, BEFT reaches 32.9% accuracy with only 20.2 GB of memory, while Lily achieves 54.9% accuracy but requires 25.6 GB. Depending on your priorities, LoRA might not offer the optimal tradeoff.

It’s also crucial to note that even LoRA’s strong performance on this task comes from enhanced versions, not plain vanilla LoRA. LoRA with rank-stabilized initialization, which scales LoRA’s contribution differently, provides excellent test accuracy (53.2%). Conversely, LoRA-FA, using a specialized optimizer, is more memory efficient (20.2 GB). Standard LoRA achieves only 48.1% accuracy at 22.5 GB memory, suggesting these optimized variants are clearly preferable.

Turning to the image generation benchmark, which aims to teach the model a new concept like a “cat plushy” and generalize it to new prompts, we find a different picture. The primary metric here is “dino similarity,” measuring how closely generated images match a holdout test dataset. Higher values indicate better resemblance, alongside the ever-important memory usage.

On this task, LoRA falls below the Pareto frontier. Specifically, LoRA achieves a similarity score of 0.697 while requiring 9.97 GB of memory. In contrast, OFT (Orthogonal Fine-Tuning) scores 0.708 similarity with just 9.01 GB of memory. This indicates that OFT strictly dominates LoRA on these metrics for this particular task.

Remember to always check other PEFT methods near the Pareto frontier, as small variations can occur due to randomness. Furthermore, consider other metrics relevant to you, such as runtime performance or checkpoint size, as these can significantly alter which technique appears most favorable. For image generation, don’t forget to visually inspect generated samples to gauge the fine-tuned model’s capabilities.

Limitations and Workarounds

One valid criticism of any benchmark, including ours, is the choice of hyperparameters, which might inadvertently favor one technique. While exhaustive and fair hyperparameter sweeps across so many techniques are challenging, we’ve made it easy for anyone to contribute their own experiments to PEFT. If you believe a technique can be improved with different hyperparameters, we encourage you to submit a pull request.

Another limitation is that benchmarks cannot fully capture every capability of a PEFT technique. We offer comparisons across many dimensions, but some unique features, like Cartridges’ ability to compress long prompts, aren’t measured. Other factors, such as the model’s base architecture or the specific dataset’s characteristics, can also influence the optimal choice. While our benchmarks offer valuable pointers, they don’t eliminate the need for your own specific research and testing.

A common hurdle with non-LoRA PEFT techniques is their limited support in downstream packages like vLLM, which primarily load LoRA checkpoints. Thankfully, the PEFT library now offers a solution: the ability to convert other adapters into LoRA format. This allows you to fine-tune with a superior non-LoRA method and then convert the adapter to LoRA for deployment with tools that only support it.

We’ve tested this conversion process by transforming an image adapter trained with the GraLoRA technique into a standard LoRA checkpoint. The test scores remained virtually identical post-conversion (similarity 0.702 → 0.694). This functionality ensures you can leverage the best PEFT technique for training without sacrificing compatibility for deployment.

Source: Hugging Face Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.