Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

TL;DR

Thorsten Meyer AI published a guide arguing that GPU power limits and undervolting should be an early tuning step for high-power local AI workstations. The site cites RTX 4090 data showing large heat reductions with smaller throughput losses, while warning that results vary by card, model, quantization, and workload.

Thorsten Meyer AI has published a GPU tuning guide for local inference that says users can cut heat, power draw, and fan noise by applying power limits or undervolting, with measured RTX 4090 examples showing smaller losses in tokens per second than in watts.

The guide presents power limiting as the first step for owners of high-power AI workstations, ahead of buying a new cooler, changing a case, or rearranging fans. It says the simplest method is to lower a GPU’s power limit, such as moving a card from stock behavior to about 70% power, using MSI Afterburner on Windows or tools such as nvidia-smi or LACT on Linux.

According to the source material, a sustained RTX 4090 workload kept 93.4% of tokens-per-second performance at a 70% power limit while cutting draw to 300 watts and lowering GPU temperature to 67 degrees Celsius. The same table lists stock operation at 390 watts, 72 degrees Celsius, and 100% speed, while a 60% cap is shown at 260 watts, 62 degrees Celsius, and 91.5% speed.

The guide distinguishes between power limiting and undervolting. It describes power limiting as a one-slider change that restricts the card rather than pushing it harder. Undervolting, by contrast, changes the voltage-frequency curve directly and is described as requiring more care, with the site suggesting a starting target around 0.9 to 0.95 volts and testing under the user’s real workload.

Why It Matters

The article matters for readers running local large language models because heat and noise are practical limits in home and office AI workstations. A GPU that draws hundreds of watts can raise room temperature, increase fan noise, and reduce comfort during long inference sessions.

The source’s central claim is that local inference often has more to gain from efficiency tuning than gaming workloads because many LLM runs are constrained by VRAM bandwidth rather than raw core clock speed. If that claim holds for a user’s setup, reducing power can lower heat faster than it lowers throughput.

The cost angle is also direct: the guide says this is a no-cost setting change before hardware spending. For readers considering new cooling hardware or a larger case, the reported RTX 4090 figures suggest a power cap may solve part of the problem without buying parts.

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

16.384 NVIDIA CUDA Core

As an affiliate, we earn on qualifying purchases.

Background

The guide is presented as the first lever in a broader Thorsten Meyer AI series on reducing heat and noise in high-power AI workstations. It includes an interactive infographic and a power-limit table showing how speed, watts, temperature, and efficiency change as the cap is reduced.

The source says modern GPUs ship with conservative voltage curves because manufacturers need stable operation across many chips, including weaker samples. It argues that the final slice of voltage can create a large heat penalty for limited added performance.

The guide also cites broader 2025-2026 power-cap testing for cards including RTX 4090 and RTX 5090, but it says the numbers are illustrative and vary by card, workload, model, and quantization. It also carries an affiliate disclosure and tells readers to confirm current specs and availability before buying hardware.

“This is the first thing you should do to a high-power AI workstation, and it costs nothing.”

— Thorsten Meyer AI guide

“Local inference is memory-bound — the GPU core spends much of its time waiting on VRAM, not maxing out compute.”

— Thorsten Meyer AI guide

“Power limiting moves one slider and can’t damage anything.”

— Thorsten Meyer AI guide

“Data: published RTX 4090 fine-tuning power-scaling measurements; RTX 5090/4090 power-cap tests, 2025–2026.”

— Thorsten Meyer AI guide

MINISFORUM AMD Ryzen 9 9955HX MS-A2 Mini PC (16C/32T, up to 5.4GHz), 32GB DDR5 1TB SSD, PCIe×16, HDMI/2x USB-C (8K@60Hz), 2X SFP+ 10G, 2X 2.5G LAN, 3X SSD M.2 (2280/22110/U.2)

Powerful Ryzen 9 9955HX: The MS-A2 mini PC is equipped with an AMD Ryzen 9 9955HX processor (16…

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is not clear from the source material how broadly the reported RTX 4090 numbers apply across all local inference setups. Actual tokens-per-second results can change with model size, quantization, batch size, context length, driver version, cooling, case airflow, and the individual GPU chip.

The guide says power limiting needs little testing, but manual undervolting can still create instability under long workloads. A voltage curve that appears stable during a short test may fail after hours of inference, according to the source.

The source also includes affiliate links, which it discloses. The performance figures should be read as reported guide data rather than an independent lab review in the material provided.

‌SCCCF Graphics Card Cooler with Dual 90mm & 92mm PWM Fans – PCI Slot Mountable VGA/GPU Cooling System, High Airflow Quiet Cooling

[Easy to Install]: Equipped with 3 92MM long-life double ball fans, PCI design, easy to assemble and use.

As an affiliate, we earn on qualifying purchases.

What’s Next

The next step for readers is to test their own workload rather than rely only on synthetic benchmarks. The guide recommends setting a power limit, running a sustained real inference job, measuring temperature, held clock, power draw, and actual tokens per second, then saving the setting so it persists after reboot.

For users who want more tuning after the basic cap, the source points to manual undervolting with workload-specific stability testing. It also says power caps in the 60% to 80% range are often the high-value zone, while pushing too far can cause throughput to fall sharply.

NVIDIA Shield Android TV Pro | 4K HDR Streaming Media Player High Performance, Dolby Vision, 3GB RAM, 2X USB, Works with Alexa, Model:945-12897-2500-101

The Best of the Best. SHIELD TV delivers an amazing Android TV streaming media player experience, thanks to…

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the actual development?

Thorsten Meyer AI published a guide and interactive infographic advising local AI workstation users to reduce GPU heat and noise through power limiting or undervolting.

Does undervolting always keep the same tokens per second?

No. The source reports cases where throughput stays high, but it also says results vary by card, model, quantization, and workload. Users need to measure their own inference jobs.

The guide recommends starting with a power limit around 70%. In its RTX 4090 example, that setting drew 300 watts, ran at 67 degrees Celsius, and kept 93.4% of speed.

Is power limiting the same as undervolting?

No. Power limiting caps how much power the card may use and lets the GPU adjust clocks and voltage automatically. Undervolting directly edits the voltage-frequency curve and needs more testing.

What remains unconfirmed?

The source does not prove that every GPU or inference workload will see the same gains. Long-run stability and exact throughput effects remain setup-specific.

Source: Thorsten Meyer AI

Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

Up next

A successful Japanese trial of a ramjet engine designed for Mach‑5 aircraft

Author

The Genius Factory Team

Share article

Why It Matters

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

Background

MINISFORUM AMD Ryzen 9 9955HX MS-A2 Mini PC (16C/32T, up to 5.4GHz), 32GB DDR5 1TB SSD, PCIe×16, HDMI/2x USB-C (8K@60Hz), 2X SFP+ 10G, 2X 2.5G LAN, 3X SSD M.2 (2280/22110/U.2)

What Remains Unclear

‌SCCCF Graphics Card Cooler with Dual 90mm & 92mm PWM Fans – PCI Slot Mountable VGA/GPU Cooling System, High Airflow Quiet Cooling

What’s Next

NVIDIA Shield Android TV Pro | 4K HDR Streaming Media Player High Performance, Dolby Vision, 3GB RAM, 2X USB, Works with Alexa, Model:945-12897-2500-101

Key Questions

What is the actual development?

Does undervolting always keep the same tokens per second?

Is power limiting the same as undervolting?

What remains unconfirmed?

No leap second will be introduced at the end of December 2026

Build vs Buy a Prebuilt AI Workstation

The Compute Reckoning: Anthropic Finally Admits What Customers Suspected for Ten Months

OpenAI ships enterprise fine-tuning tier with sub-second routing

How Genomics Is Changing Preventive Medicine

15 Best Physics Reference Textbooks in 2026

Six Slightly Skew Boogeymen

Why Achieving Correct Results Doesn’t Solve AI’s Management Problems

Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

Up next

Author

The Genius Factory Team

Share article

Why It Matters

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

Background

MINISFORUM AMD Ryzen 9 9955HX MS-A2 Mini PC (16C/32T, up to 5.4GHz), 32GB DDR5 1TB SSD, PCIe×16, HDMI/2x USB-C (8K@60Hz), 2X SFP+ 10G, 2X 2.5G LAN, 3X SSD M.2 (2280/22110/U.2)

What Remains Unclear

‌SCCCF Graphics Card Cooler with Dual 90mm & 92mm PWM Fans – PCI Slot Mountable VGA/GPU Cooling System, High Airflow Quiet Cooling

What’s Next

NVIDIA Shield Android TV Pro | 4K HDR Streaming Media Player High Performance, Dolby Vision, 3GB RAM, 2X USB, Works with Alexa, Model:945-12897-2500-101

Key Questions

What is the actual development?

Does undervolting always keep the same tokens per second?

What setting does the guide recommend trying first?

Is power limiting the same as undervolting?

What remains unconfirmed?

You May Also Like