How Ettin Rerankers Boost Your Embedder Performance

Today marks an exciting release: six brand-new Sentence Transformers CrossEncoder rerankers, setting new benchmarks for their respective sizes. Built upon the powerful Ettin ModernBERT encoders, these models come complete with the data and comprehensive training recipe used to create them.

Each reranker was developed using a sophisticated distillation recipe. We leveraged pointwise Mean Squared Error (MSE) on scores from the robust mixedbread-ai/mxbai-rerank-large-v2 model, applied over a specialized dataset called cross-encoder/ettin-reranker-v1-data. This dataset itself is a meticulously curated blend of lightonai/embeddings-pre-training and a reranked subset of lightonai/embeddings-fine-tuning.

We’ve meticulously evaluated these six rerankers, showcasing their performance when paired with various embedders, including the impressive google/embeddinggemma-300m. This comprehensive testing ensures you have a clear understanding of their real-world capabilities.

For those eager to dive into the “why” behind rerankers, we’ll explore their fundamental role and how they enhance embedder performance. If you’re ready to integrate these models, jump straight to the usage section. And for the ambitious, our training guide reveals how you can build your own bespoke rerankers.

What is a Reranker, and Why Pair One with an Embedder?

A reranker, also known as a pointwise cross-encoder, is a sophisticated neural model designed to evaluate a (query, document) pair and output a single, precise relevance score. Unlike traditional embedding models, which process queries and documents independently to generate separate embedding vectors, a reranker allows both texts to interact and “attend” to each other across every transformer layer.

This joint encoding process significantly boosts accuracy. However, this increased precision comes with a trade-off: higher computational cost. The model must be executed once for every (query, document) pair, making it impractical for scanning an entire document corpus.

To strike a balance between accuracy and efficiency, the industry standard is a “retrieve-then-rerank” pipeline. First, a fast embedding model quickly identifies the top-K candidate documents. Then, a highly accurate cross-encoder efficiently reorders only those K candidates, providing a much-improved final ranking at a manageable total cost. Throughout this article, “reranker” and “cross-encoder” are used interchangeably to refer to this powerful tool.

Seamless Usage with Ettin Rerankers

Our newly released models are standard Sentence Transformers CrossEncoder models, making them incredibly easy to integrate into your projects. You can get started with just three lines of Python code, enabling quick predictions for relevance scores.

For a given query and a list of candidate documents, you can use the .rank method to effortlessly obtain sorted indices and scores. This functionality streamlines the process of identifying the most relevant documents from a larger pool.

from sentence_transformers import CrossEncoder
model = CrossEncoder("cross-encoder/ettin-reranker-32m-v1")
scores = model.predict([
    ("Where was Apple founded?", "Apple Inc. was founded in Cupertino, California in 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne."),
    ("Where was Apple founded?", "The Fuji apple is an apple cultivar developed in the late 1930s and brought to market in 1962."),
])
print(scores) # [11.393298 2.968891] <- larger means more relevant

ranked = model.rank(
    query="Which planet is known as the Red Planet?",
    documents=[
        "Venus is often called Earth's twin because of its similar size and proximity.",
        "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
        "Jupiter, the largest planet in our solar system, has a prominent red spot.",
        "Saturn, famous for its rings, is sometimes mistaken for the Red Planet.",
    ],
    top_k=4,
    return_documents=True,
)
for r in ranked:
    print(f"({r['score']:.2f}): {r['text']}")
# (10.82): Mars, known for its reddish appearance, is often referred to the Red Planet.
# (9.86): Saturn, famous for its rings, is sometimes mistaken for the Red Planet.
# (8.55): Jupiter, the largest planet in our solar system, has a prominent red spot.
# (6.21): Venus is often called Earth's twin because of its similar size and proximity.

You can easily swap out cross-encoder/ettin-reranker-32m-v1 for any other size within the Ettin family to optimize for quality versus speed. All six models boast support for up to 8K tokens of context, a crucial feature for efficiently reranking long documents, thanks to ModernBERT's advanced long-context pre-training.

For peak performance and throughput, we highly recommend installing necessary kernels and setting model_kwargs={"dtype": "bfloat16", "attn_implementation": "flash_attention_2"}. This optimization can deliver a substantial speedup, ranging from 1.7x to 8.3x over default loading, depending on the model size and sequence length, as detailed in our Speed section.

from sentence_transformers import CrossEncoder
model = CrossEncoder(
    "cross-encoder/ettin-reranker-32m-v1",
    model_kwargs={"dtype": "bfloat16", "attn_implementation": "flash_attention_2"},
)

Building an End-to-End Retrieve-Then-Rerank Pipeline

To illustrate the full power of the retrieve-then-rerank paradigm, let's look at a complete example combining a fast embedder for initial retrieval with an Ettin reranker for precise final ordering. This pipeline mimics the architecture of most modern search systems, where retrieval casts a wide net and reranking refines the results.

from sentence_transformers import SentenceTransformer, CrossEncoder

# Step 1: Initialize a fast embedder for efficient retrieval
embedder = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1")
# Step 2: Initialize an Ettin reranker for precise reordering
reranker = CrossEncoder("cross-encoder/ettin-reranker-68m-v1")

corpus = [
    "Apple Inc. was founded in Cupertino, California in 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne.",
    "The Fuji apple is an apple cultivar developed in the late 1930s.",
    "Steve Jobs introduced the iPhone in 2007 at Macworld.",
    "Macintosh computers were sold by Apple from 1984 onward.",
    # ... thousands or millions more documents in a real production environment
]
query = "Where was Apple founded?"

# Encode the query and corpus for similarity search
query_emb = embedder.encode_query(query, convert_to_tensor=True)
corpus_emb = embedder.encode_document(corpus, convert_to_tensor=True)

# Retrieve the top-K candidates (e.g., top 100) based on embedding similarity
scores = embedder.similarity(query_emb, corpus_emb)[0]
top_k_idx = scores.topk(min(100, len(corpus))).indices.tolist()

# Extract the actual top-K documents
top_k_docs = [corpus[i] for i in top_k_idx]

# Rerank the top-K documents using the Ettin reranker
ranked = reranker.rank(query, top_k_docs, top_k=5, return_documents=True)

# Print the final reranked results
for r in ranked:
    print(f"({r['score']:.2f}): {r['text']}")
# (11.63): Apple Inc. was founded in Cupertino, California in 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne.
# (4.71): Steve Jobs introduced the iPhone in 2007 at Macworld.
# (1.96): The Fuji apple is an apple cultivar developed in the late 1930s.
# (1.49): Macintosh computers were sold by Apple from 1984 onward.

This retrieve-then-rerank pattern is foundational to modern search engines. The initial retriever efficiently filters a vast corpus, narrowing down the potential results, while the reranker precisely reorders these candidates. This two-stage approach ensures both speed and high relevance for the end-user.

Under the Hood: Architecture and Performance

All six Ettin rerankers share a consistent architectural foundation, differing only in the size of their backbone encoder. These backbones are derived from Johns Hopkins University's sophisticated Ettin suite, leveraging ModernBERT-style models. Key features include unpadded attention, RoPE positional encodings, GeGLU activation functions, and an impressive 2 trillion tokens of open-license pre-training, enabling support for up to 8192 tokens of context.

Perched atop each encoder is a specialized 4-module classification head, mirroring the design of ModernBertForSequenceClassification but constructed using Sentence Transformers' modular components. Crucially, the underlying Transformer is a plain AutoModel rather than AutoModelForSequenceClassification. This design choice allows us to employ sequence unpadding for variable-length inputs, which is essential for leveraging Flash Attention 2. This optimization delivers a substantial speedup, ranging from 1.7x to 8.3x over fp32+SDPA for medium-document sequence lengths.

Transformer (Flash Attention 2)
Pooling (CLS)
Dense (Hidden, Hidden, bias=False, GELU)
LayerNorm (Hidden)
Dense (Hidden, 1, scores)

Interestingly, ablation studies revealed that CLS pooling consistently outperformed mean pooling. While ModernBERT typically uses global attention only every third layer, the empirical evidence suggests that these few global layers provide sufficient signal to make CLS the superior pooling choice. All six Ettin reranker models are released under the Apache 2.0 license, aligning with the licensing of their base Ettin encoders.

Benchmarking Results: MTEB(eng, v2) Retrieval

We rigorously evaluated each Ettin reranker on the comprehensive MTEB(eng, v2) Retrieval benchmark, which includes 10 diverse tasks with top-100 reranking. Following MTEB's two-stage reranking flow, we paired each reranker with six different embedding models, covering a broad spectrum of speed and quality profiles.

The dashed "retriever-only" line in our charts represents the baseline performance without reranking; any model performing below this line would actively detract from the pipeline's average effectiveness. Our results demonstrate significant improvements across the board.

The smallest model in our lineup, the 17M Ettin reranker, surpasses the 33M ms-marco-MiniLM-L12-v2 by a notable +0.051 NDCG@10 on MTEB (0.5576 vs 0.5066) and +0.038 on NanoBEIR (0.6746 vs 0.6369), all while using roughly half the parameters. Similarly, our 32M model outperforms the 568M BAAI/bge-reranker-v2-m3 by +0.025 NDCG@10 on MTEB (0.5779 vs 0.5526), a truly impressive feat given the 17x parameter difference. If you're currently relying on older MiniLM rerankers, upgrading to our 17M or 32M models offers a straightforward and impactful quality boost.

Moving up the scale, our 150M model stands out as the strongest reranker tested in the sub-600M category on MTEB. It slightly edges out the recent Qwen/Qwen3-Reranker-0.6B (596M) by +0.005 NDCG@10 (0.5994 vs 0.5940) and significantly outperforms all BAAI bge-reranker variants by 0.03 to 0.05. The 68M model also deserves recognition, achieving a remarkable 0.5915 NDCG@10, nearly matching Qwen3-Reranker-0.6B while utilizing a mere ninth of its parameters.

At the top of the released range, our 1B model closely mirrors the performance of its teacher, the 1.54B mxbai-rerank-large-v2. It achieves within 0.0001 NDCG@10 on MTEB (0.6114 vs 0.6115) and within 0.008 on NanoBEIR, despite being distilled from a model 54% larger. This successful distillation process effectively closes the performance gap to the teacher, validating our training approach.

While the Qwen/Qwen3-Reranker-4B currently holds the top spot at 0.6367 MTEB NDCG@10, our 1B model remains a highly practical choice for most retrieve-then-rerank applications, offering compelling performance at a quarter of the parameters. To close the remaining performance gap, future iterations might explore distilling from an even stronger teacher model.

Optimizing for Speed

In the world of rerankers, quality is only half the equation; the other crucial factor is speed. We've meticulously measured the latency of our models to ensure they fit within the tight budgets often allocated between initial retrieval and presenting results to users.

Our benchmarks were conducted on a single NVIDIA H100 80GB GPU, comparing all six Ettin rerankers against thirteen strong public baselines up to approximately 1B parameters. We used queries and documents from sentence-transformers/natural-questions, reflecting real-world document length distributions (mostly short, with some longer ones). Documents were truncated at a max_length=512 to ensure fair comparison with older models.

Each model was configured to use its optimal attention implementation: Flash Attention 2 for architectures that support it (BERT, XLM-RoBERTa, ModernBERT, Qwen2), SDPA for others, and eager attention for DeBERTa-v2 (which currently lacks FA2 or SDPA support in transformers). For every model, an auto-batch search started at batch size 8 and progressively doubled until GPU memory limits were reached. We recorded the median throughput from three timed passes at each batch size, mitigating the impact of any single outlier run.

Source: Hugging Face Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.