Data Modelling & AI Big data Guest Blogs Vector Database

GPU-accelerated ML inference in Vespa Cloud

15 June 2025

0

In machine learning, computing model inference is a good candidate for being
accelerated by special-purpose hardware, such as GPUs. Vespa supports
evaluating multiple types of machine-learned models in stateless
containers, for
example TensorFlow,
ONNX,
XGBoost and
LightGBM models. For many use-cases
using a GPU makes it possible to perform model inference with higher
performance, and at a lower price point, compared to using a general purpose
CPU.

Today we’re introducing support for GPU-accelerated ONNX model inference in
Vespa, together with support for GPU instances in Vespa Cloud!

Vespa Cloud

If you’re using Vespa Cloud, you can get started with
GPU instances in AWS zones by updating the <nodes> configuration in your
services.xml file. Our cloud platform will then provision and configure GPU
instances automatically, just like regular instances. See the services.xml
reference documentation for
syntax details and examples.

You can then configure which models to evaluate on the GPU in the
<model-evaluation> element, in services.xml. The GPU device number is
specified as part of the ONNX inference
options
for your model.

See our pricing page for details on GPU
pricing.

Open source Vespa

GPUs are also supported when using open source Vespa. However, when running
Vespa inside a container, special configuration is required to pass GPU devices
to the container engine (e.g. Podman or Docker).

See the Vespa documentation
for a tutorial on how to configure GPUs in a Vespa container.

CORD-19 application benchmark

While implementing support for GPUs in Vespa, we wanted to see if we could find
a real-world use-case demonstrating that a GPU instance can be a better fit than
a CPU instance. We decided to run a benchmark of our CORD-19
application – a
Vespa application serving the COVID-19 Open Research Dataset. Its source code is
available on GitHub.

Our benchmark consisted of a query where the top 30 hits are re-ranked, using a
22M Transformer model using batch inference. The measured latency is end-to-end,
and includes retrieval and inference.

See our recent blog
post for
more information about using a Transformer language model to re-rank results.

We compared the following node configurations:

GPU: 4 vCPUs, 16GB memory, 125 GB disk, 1 GPU with 16GB memory (Vespa Cloud
cost: 1.87$/hour)
CPU: 16 vCPUs, 32GB memory, 125 GB disk (Vespa Cloud cost: $2.16/hour)

Results

Instance	Clients	Re-rank (batch)	Avg. latency (ms)	95 pct. latency	QPS	GPU util (%)	CPU util (%)
GPU	1	30	94	102	10.2	41	15
GPU	2	30	160	174	12.5	60	19
GPU	4	30	212	312	18.8	99	30
CPU	1	30	454	473.6	2.2	–	27
CPU	2	30	708	744	2.84	–	33
CPU	4	30	1011	1070	3.95	–	47
CPU	8	30	1695	1975	4.73	–	72

Conclusion

The GPU of the GPU instance was saturated at 4 clients, with an average
end-to-end request latency at 212 ms and a throughput of 18.8 QPS. The CPU
instance had a higher average latency, at 1011 ms with 4 clients and a
comparatively low throughput of 3.95 QPS.

So, in this example, the average latency is reduced by 79% when using a GPU,
while costing 13% less.

GPU-accelerated ML inference in Vespa Cloud

Vespa Cloud

Open source Vespa

CORD-19 application benchmark

Results

Conclusion

Power high performance RAG for GenAI with HPE Alletra Storage MP + Milvus

Hands-On Cybersecurity Training: How the Centre For Cybersecurity is Shaping the Workforce by Petar Vojinovic

Beyond Context Overload: How Parlant × Milvus Brings Control and Clarity to LLM Agent Behavior

LEAVE A REPLY Cancel reply

Most Popular

Gemini Live model update makes conversations a lot more human-like

Valve invites Android developers to Steam ahead of huge VR push

My Pixel felt slow until I tried this automation workflow

Samsung bets a slimmer Galaxy Z Flip 8 will bring more buyers

EDITOR PICKS

Gemini Live model update makes conversations a lot more human-like

Valve invites Android developers to Steam ahead of huge VR push

My Pixel felt slow until I tried this automation workflow

POPULAR POSTS

Gemini Live model update makes conversations a lot more human-like

Valve invites Android developers to Steam ahead of huge VR push

My Pixel felt slow until I tried this automation workflow

POPULAR CATEGORY

ABOUT US

FOLLOW US