Hugging Face adds one-command vLLM deployment on serverless Jobs

Hugging Face streamlines private vLLM hosting

Hugging Face has introduced a faster way to launch vLLM servers on its serverless Jobs platform, allowing developers to start a private, OpenAI-compatible model endpoint with a single command. The setup removes the need to provision machines or manage Kubernetes, and it bills by the second based on hardware usage.

The company says the workflow is aimed at quick experiments, tests, evaluation runs, and batch generation. For teams that need a managed production service, Hugging Face points users to its separate Inference Endpoints offering.

The new setup uses the hf jobs run command together with the official vllm/vllm-openai container image. Users select a GPU class with the --flavor flag, expose the server port with --expose, and set a timeout for the job. In the example provided by Hugging Face, a Qwen model is started on an A10G-based instance and exposed through the platform’s jobs proxy.

Once the server is running, the endpoint can be queried from a laptop, notebook, or other client using the OpenAI-compatible API that vLLM provides. Requests require a Hugging Face access token, which the company says also serves as the access gate for the job. The endpoint is not public by default, and browser access without a valid token is rejected.

Hugging Face also published examples showing how to connect to the server with curl or the OpenAI Python client. The workflow is the same as working with a standard OpenAI-style endpoint, except that the base URL points to the jobs proxy address generated for the running container.

The company says the endpoint is intended to stay private to the user or organization that launched it. It advises users not to treat the jobs URL as a public link and to avoid sharing tokens in untrusted environments. Those who need more granular access control or a public API should place another gateway in front of the service.

Larger models, UI access and agent use

The announcement also describes how the same process can be used for larger models by choosing a bigger GPU configuration and adjusting vLLM’s tensor parallelism. Hugging Face highlights an example using a much larger Qwen model on two H200 GPUs, along with tuning parameters that reduce memory pressure when a model’s default settings would otherwise exceed available resources.

For users who prefer a graphical interface, Hugging Face showed how a Gradio app can connect to the same running endpoint and stream reasoning output separately from the final response. The company also outlined how developers can open an SSH session into a live job for debugging, memory checks, or interactive inspection, provided they have registered an SSH key and use a recent version of the Hugging Face Hub client.

The post further suggests the server can serve as the backend for coding agents. In that setup, the model needs tool calling enabled and a matching tool-call parser. Hugging Face provided an example of configuring the endpoint for use with Pi, a provider-agnostic agent harness, by adding the running job as a custom provider.

The pricing model is another part of the pitch. Jobs are billed in short increments, and Hugging Face says users should stop the server when they are finished to avoid unnecessary cost. The company recommends choosing the smallest GPU flavor that can support the model and checking its hardware pricing list before launching larger workloads.

Overall, the release positions serverless Jobs as a quick path to standing up private model infrastructure without the overhead of a full deployment stack, while still leaving production-grade, managed serving to Hugging Face’s dedicated endpoint product.