How to Run a Private LLM Server on HF Jobs (1 Command)

How to Run a Private LLM Server on HF Jobs (1 Command)

Ever wished you could spin up a powerful, private Large Language Model (LLM) server with just a single command? On Hugging Face infrastructure, this is not just possible but incredibly straightforward. Using HF Jobs, you can instantly deploy an OpenAI-compatible LLM endpoint, paying only for the compute you use, moment by moment.

This approach is perfect for quick tests, evaluations, or efficient batch generation, providing a dedicated environment without the hassle of server provisioning or Kubernetes. Once live, your private endpoint is accessible from your laptop, a Jupyter notebook, or any authorized location. If you’re eyeing a production-grade, managed solution, Inference Endpoints are typically the better choice, but for rapid prototyping, HF Jobs is a game-changer.

Spin Up Your LLM Server in a Flash

Getting started requires just a few simple prerequisites to ensure smooth deployment. First, make sure you have a valid payment method linked to your Hugging Face account or a positive prepaid credit balance, as Jobs are billed per-minute based on hardware usage. You’ll also need huggingface_hub version 1.20.0 or newer installed, which you can update via `pip install -U “huggingface_hub>=1.20.0″`.

Finally, ensure you’re authenticated locally by running `hf auth login` in your terminal. With these steps completed, you’re ready to launch your server. The core command for launching a job on Hugging Face infrastructure is `hf jobs run`, similar to how `docker run` operates for containers.

To deploy our vLLM server, we’ll use the official vllm/vllm-openai:latest Docker image, request a suitable GPU flavor, and expose vLLM’s default port 8000. Here’s how you can launch a Qwen/Qwen3-4B model on an A10G-Large GPU with a 2-hour timeout: `hf jobs run –flavor a10g-large –expose 8000 –timeout 2h vllm/vllm-openai:latest vllm serve Qwen/Qwen3-4B –host 0.0.0.0 –port 8000`.

After executing the command, the system will provide a unique job ID and a public URL through which your server is reachable. This URL routes the container’s exposed port via Hugging Face’s secure public jobs proxy. Keep a close watch on the job logs; your server is fully live once you see “Application startup complete,” typically after a few minutes for weights to download and the service to initialize.

Interact with Your Private LLM Endpoint

Your newly launched vLLM server is designed to speak the OpenAI API protocol, making integration seamless. Every request to your endpoint requires your Hugging Face token as a bearer token for authentication. The quickest way to test connectivity is with a simple `curl` command:

You can send a chat completion request by specifying your job ID in the URL and including your authorization token. Alternatively, for Python-based applications, the official OpenAI client can be easily configured to point to your custom endpoint. Simply set the `base_url` to your job’s exposed URL and pass your Hugging Face token as the `api_key`.

It’s crucial to understand that your endpoint is gated and not public; every request must carry an HF token with read access to the job’s namespace. This means access is strictly scoped to you or your organization, making it ideal for private internal use. If public access or finer-grained permissions are needed, a dedicated API gateway or Hugging Face Inference Endpoints would be more appropriate.

Before sending heavy workloads, a quick health check can confirm your model is ready: `curl https://–8000.hf.jobs/v1/models -H “Authorization: Bearer $(hf auth token)”` should list your deployed model. Remember, Jobs are billed per second, so always cancel your server using `hf jobs cancel ` when you’re finished. While the `–timeout` acts as a safety net, explicit cancellation is the most cost-effective approach, as an a10g-large instance runs at approximately $1.50/hour.

Scale Up and Expand Your Capabilities

The flexibility of HF Jobs extends to running much larger models, simply by selecting a beefier `–flavor` and utilizing vLLM’s `tensor-parallel-size` argument to shard the model across multiple GPUs. For instance, deploying a colossal Qwen3.5-122B-A10B mixture-of-experts model on two H200 GPUs would involve specifying `–flavor h200x2` and `–tensor-parallel-size 2`.

For such large models, you might also need to adjust `max-model-len` and `max-num-seqs` to manage memory usage effectively. These parameters ensure that even models with immense context lengths, like Qwen3.5-122B, fit within your allocated GPU memory. If you encounter out-of-memory errors during startup, these are the first settings to tweak downwards.

Debugging a persistent startup failure or simply wanting to monitor GPU memory in real-time? HF Jobs offers SSH access directly into your running server. Launch your job with the `–ssh` flag, ensuring your public key is registered in your Hugging Face settings. Then, connect directly using `hf jobs ssh ` to gain a shell inside the container for advanced diagnostics.

Beyond direct querying, your endpoint can power interactive interfaces or advanced agent backends. For example, a few lines of Gradio code can create a simple chat UI, streaming responses from your vLLM server directly to a web browser. Similarly, the powerful Pi coding agent can leverage your self-hosted model, provided you launch vLLM with tool-calling capabilities enabled via `–enable-auto-tool-choice` and a matching `–tool-call-parser`.

Hugging Face Jobs vs. Inference Endpoints: Which to Choose?

While both Hugging Face Jobs and Inference Endpoints facilitate model serving, they cater to different use cases. HF Jobs offers unparalleled flexibility and control; it’s essentially a `docker run` command on Hugging Face’s robust infrastructure. This makes it ideal for experiments, one-off evaluations, or batch generation, where you have full command over the Docker image, vLLM flags, and hardware, all billed per second.

In contrast, Inference Endpoints are designed for production-readiness, providing managed services with operational niceties crucial for long-lived services. These include finer-grained access control (allowing public, protected, or private endpoints) and crucial scale-to-zero functionality, ensuring you’re not billed during periods of inactivity. For durable, always-on endpoints that require robust management and cost optimization for idle periods, Inference Endpoints are the go-to solution.

Source: Hugging Face Blog

Kristine Vior

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.

More Posts - Website

Scroll to Top