Reachy Mini Goes Fully Local: Boosts Privacy & Control

Get ready to transform your interaction with Reachy Mini! We’re thrilled to announce a significant update that brings the entire conversation stack, from speech recognition to text-to-speech, directly onto your local machine. This means no more reliance on external servers or cloud services for your robot’s voice interactions.

Until now, engaging with your Reachy Mini involved sending your audio data to a remote server. This exciting shift allows for a fully contained, private, and customizable experience, putting you in complete control of your robot’s “brain” and “voice.” It’s all powered by our innovative speech-to-speech pipeline, designed for speed and flexibility.

Unleash Your Reachy Mini: Go Fully Local

Running Reachy Mini’s conversation app entirely locally unlocks a world of benefits, primarily revolving around privacy, control, and unparalleled flexibility. Your data never leaves your machine, ensuring complete confidentiality for all interactions. This setup is ideal for sensitive applications or simply for those who prefer to keep everything in-house.

At the heart of this local setup is our cascaded VAD → STT → LLM → TTS pipeline, accessible via a Realtime API-compatible `/v1/realtime` WebSocket. The entire `speech-to-speech` repository provides a single, streamlined command-line interface (CLI) to boot this powerful WebSocket server, enabling seamless communication with your Reachy Mini.

Quick Start: Setting Up Your Local Conversation Stack

Getting started is straightforward. First, you’ll need to set up your Large Language Model (LLM) locally. We recommend llama.cpp for its ease of use and broad compatibility across different operating systems.

To serve the LLM, simply install `llama.cpp` (e.g., via `brew` or `winget`) and run the server. The first launch will download the necessary model, like `gemma-4-E4B-it-GGUF`, with subsequent launches being significantly faster for a smooth workflow.

Next, install the `speech-to-speech` library using `uv pip install speech-to-speech`. With the LLM server running in one terminal, you can then launch `speech-to-speech` in local mode to start conversing directly through your terminal. This initial setup also involves a quick download of Parakeet and Qwen3TTS for speech processing.

Once both `llama.cpp` and `speech-to-speech` are active, connecting your Reachy Mini is the final step. Start the robot’s desktop app and launch the conversation app. Within the UI, simply select the local mode by clicking “edit connection” in the HF backend, and you’re ready to engage with your fully local robot!

The Power of Cascades: Customizing Your Voice Pipeline

The beauty of a cascaded voice pipeline lies in its modularity. It breaks down the voice interaction into four distinct stages: Voice Activity Detection (VAD), Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS). This architecture offers unparalleled flexibility, allowing you to swap out components as new, more performant models emerge weekly.

While you have complete freedom to customize, we provide robust defaults for three of these stages to get you started quickly. Our choices are carefully balanced for performance and quality, optimized for multilingual capabilities, but can be tailored to your specific needs.

VAD (Voice Activity Detection): We use silero-vad, a robust and efficient model for accurately detecting when speech begins and ends, crucial for real-time interactions.
STT (Speech-to-Text): Our default is Parakeet-TDT, chosen for its excellent balance of speed and accuracy across multiple languages.
TTS (Text-to-Speech): We integrate Qwen3TTS, which delivers high-quality, natural-sounding speech synthesis for a lifelike conversational experience.

The LLM layer is where you’ll find the most significant impact on system latency and overall performance. You have two primary approaches for deploying your LLM: running it directly on your machine (in-process) or connecting to a separate inference engine (Responses API).

Flexible LLM Deployment: Responses API vs. In-Process

To tackle the bottleneck of LLM inference latency, our `speech-to-speech` engine supports a decoupled architecture using the Responses API protocol. This means your LLM can live in a separate process or even on a remote server, communicating with the voice loop over HTTP.

Responses API Examples

This approach allows for incredible flexibility, enabling you to leverage various LLM serving solutions. Here are a few popular configurations:

Llama.cpp with `speech-to-speech`: Run `llama-server` in one terminal and `speech-to-speech` in another, pointing to the `llama.cpp` endpoint. This is a great starting point for local LLM inference.
vLLM with `speech-to-speech`: For high-throughput and optimized serving, use vLLM (version 0.21.0 or newer for full tool-call streaming support). Launch your vLLM server with specific flags like `–enable-auto-tool-choice` and then configure `speech-to-speech` to connect to its API.
Hugging Face Inference Endpoints: Seamlessly integrate with any chat model deployed as a managed GPU endpoint on Hugging Face. Simply point your `speech-to-speech` configuration to your endpoint URL and provide your HF token.
Hugging Face Inference Providers: For even simpler access to powerful models, utilize Hugging Face’s Inference Providers. This routes your request to a third-party backend (e.g., Together, Fireworks) via a single, convenient URL.
OpenAI (or Compatible Providers): Need access to frontier models with zero infrastructure overhead? Point your `speech-to-speech` engine to OpenAI or any OpenAI-compatible provider like OpenRouter or Fireworks by simply swapping the base URL and API key.

Running the LLM In-Process

For tightly integrated and often lower-latency solutions, you can run the LLM directly within the `speech-to-speech` process. This eliminates network overhead between the voice loop and the LLM.

Local LLM on MLX (Apple Silicon): If you’re on a Mac, MLX offers an incredibly low-friction way to run models like `Qwen3-4B-Instruct-2507` with impressive speed, making conversations feel instant.
Local LLM on Transformers (CUDA / CPU / MPS): For CUDA-enabled machines, Linux environments, or those who prefer swapping models freely without reconverting weights, the vanilla Transformers backend is an excellent choice. You can point it at any supported Hugging Face model, such as Gemma, Qwen, or Mistral variants.

Seamless Integration: Laptop to Robot

If you’re running the voice engine on your laptop and the conversation app on a Reachy Mini Wireless, the only adjustment needed is the connection URL. Ensure your engine binds to a LAN address (not just `127.0.0.1`) and then use your laptop’s IP address from the robot’s UI when selecting the local backend.

Finding your local IP address is usually straightforward. On macOS, use `ipconfig getifaddr en0` or `en1`. For Linux, `hostname -I` will give you a list of your IP addresses, and on Windows, `ipconfig` in the command prompt will display your “IPv4 Address” under your active adapter. Look for an address starting with `192.168.x.x` or `10.x.x.x`.

You now have a complete, fully local voice loop for your Reachy Mini, offering unparalleled privacy, control, and customization. This setup empowers you to experiment with cutting-edge open-source models and tailor your robot’s intelligence to your exact specifications.

We encourage you to star the huggingface/speech-to-speech and pollen-robotics/reachy_mini_conversation_app repositories. Share your experiences and let us know which open-source cascade you’ve built for your robot in the community discussions!

Source: Hugging Face Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.