Local LLMs on the NVIDIA DGX Spark: Performance Test and Alternatives

NVIDIA introduced the DGX Spark with a touch of gold and the marketing slogan “Supercomputer for your desk.” We have reviewed and summarized the community reviews.

The idea is revolutionary: developers should get the same architecture as the giant DGX data centers directly on their desks. With 128 GB Unified Memory and a powerful GB10 chip, this seemed to be the end of memory limits for local Large Language Models (LLMs).

However, as initial detailed benchmarks and critical community reactions show, the golden facade has a decisive catch. We illuminate the architecture, analyze the performance, and determine: Who is this “supercomputer” really built for?

The bottleneck is not the compute core, but the data highway. The DGX Spark suffers from a memory bandwidth that is too low compared to its GPU performance, severely limiting the speed of token generation (decode performance) on large models.

The Architecture Trap: Compute vs. Bandwidth

The DGX Spark is based on the Blackwell chip GB10. It delivers an impressive performance of up to 1 Petaflop (Sparse FP4) of compute. The central selling point is the 128 GB coherent LPDDR5X Unified Memory, which allows even massive models (such as Llama 3.1 70B) to be fully loaded into memory.

The Sweet Spot and the Hard Stop

Benchmarks confirm that the Spark is excellent in Prefill (loading data and processing the initial prompt) and in handling small to medium-sized models (up to 20B). The scaling efficiency across batches is also very good, thanks to software optimizations like speculative decoding.

Prefill Performance: Impressive, thanks to strong compute.
Memory Size: 128 GB Unified Memory eliminates the VRAM limit for almost all open-source LLMs.

However, the hard stop occurs with Token Generation (Decode Performance) for models from 70B upwards. Here, the limited bandwidth of 273 GB/s comes into play—a value that is simply insufficient for moving the gigantic model weights during continuous inference. The community cynically compares this speed to the P40 era of older data center cards.

This is an example: The Llama 3.1 70B generates at approx. 2.7 Tokens per Second (tps) on the Spark, while a well-optimized RTX 6000 (with higher bandwidth) can achieve over 240 tps on a 120B model.

✨

💡

Key Insight

The difference between Prefill and Decode: Prefill performance is important when processing very long prompts. Decode performance (Tokens/Second) is crucial for how quickly you receive a response from the model—the most important metric for chat interaction.

The Direct Showdown: Spark vs. Alternatives

The local AI ecosystem community agrees: measured by the price-to-performance ratio for pure inference, the DGX Spark is far behind established alternatives.

Why Multi-GPU Setups Win

People on a tighter budget can buy an AI-Box setup with three to four RTX 3090 cards (a total of 72–96 GB VRAM) or one RTX 5090 card (32GB GDDR7) for a similar price that NVIDIA charges for the DGX Spark.

Performance Advantage: These setups achieve a decode performance of 90 to 120 tps on models like GPT-OSS 120B—almost ten times the Spark’s performance.
Flexibility: Competing setups run on standard x86-64 platforms and offer full compatibility for gaming, other productivity apps, and the broader CUDA ecosystem.

✨

💡

Key Insight

Before buying hardware test the performance via infernce: On the apertus AI-Platform many differen t open-source AI-Models are available for testing to get an impression how the performance feels. We already talked about the differences between local models and cloud inference.

Apple M-Chips as the Secret Winner

Perhaps the sharpest competitor in desktop AI is Apple Silicon.

Superior Bandwidth: M-chips have significantly higher memory bandwidth than the 273 GB/s of the Spark, which directly translates into faster token generation. A Mac Studio M4 Max already delivers approx. 60 tps on the GPT-OSS 120B model (depending on the source).
Efficiency: Apple offers this in an extremely energy-efficient and quiet environment. The verdict is clear: the “supercomputer” marketing of the Spark came too late, as Apple has already taken the lead in desktop AI inference speed.

Comparison (GPT-OSS 120B):
1. AI-Box Multi-GPUs 3090 → ~100 tps
2. Mac Studio M4 Max → ~60 tps
3. NVIDIA DGX Spark → ~11.66 tps

The Spark’s performance for GPT-OSS 120B is 11.66 tps at Batch 1. This shows that the optimization for FP4/Sparse Compute cannot overcome the fundamental bandwidth limitation.

Conclusion: The DGX Spark is a Developer Kit

The DGX Spark is, as some commentators rightly noted, primarily a developer kit. It was not built to compete with an RTX 5090 or a multi-GPU setup. Its purpose is to replicate the NVIDIA data center architecture and software stack (SGLang, DGX OS) in mini format.

For companies looking to seamlessly and losslessly scale their locally developed AI models to the large NVIDIA DGX servers in the data center, it is the perfect golden key. However, for anyone seeking maximum inference performance per Euro for local LLM applications, it is overpriced and too slow.

Our Tip: Before investing in high-end AI hardware, analyze your primary use case: Do you need hardware to test your software stack for later use in the data center, or to efficiently and securely run AI models locally? The answer determines whether you should focus on efficiency (AI-Box, Apple) or the ecosystem (DGX Spark).

📚 Sources

NVIDIA DGX Press Release: NVIDIA DGX Spark Arrives for World’s AI Developers, NVIDIA 2025

NVIDIA DGX Spark In-Depth Review: Video: Review from LMSYS Org Official, LMSYS Org 2025

DGX Spark Reddit-Diskussion: DGX Spark review with benchmark : r/LocalLLaMA, Reddit

NNVIDIA DGX Spark In-Depth Review: In-Depth Benchmarking and Test, Jerry Zhou and Richard Chen 2025