The AI energy demand paradox: data center build-out vs. efficiency gains

Two trajectories are running in parallel across the AI build-out, and most coverage covers only one at a time. The first is a buildout story: hyperscalers signing decade-spanning power purchase agreements at scales not seen since the manufacturing boom of the early 2000s, and utilities responding with capacity plans that explicitly assume AI workloads grow well above prior demand baselines. The second is an efficiency story: per-token inference cost has fallen by roughly two orders of magnitude in under three years, with no fundamental physics suggesting the trend has exhausted itself.

Both are real. Both have implications. And the net effect on global electricity demand depends almost entirely on which dominates, and when. The discipline this requires — holding two large trajectories in mind simultaneously rather than picking the one that confirms your prior — is exactly the analytical move that's missing from most published forecasts.

This piece works through what we actually know, where the genuine uncertainty sits, and what serious buyers of AI-exposure forecasts should track.

What we know about training compute

Frontier training runs are documented, batched, and concentrated. GPT-3's 2020 training consumed roughly 3×10²³ floating-point operations. GPT-4 has been estimated at around 2×10²⁵ FLOP — a hundred-fold increase in three years. Frontier models trained in 2025–2026 are widely estimated to be approaching the 10²⁶ FLOP scale, though specific figures remain undisclosed by labs.

Two structural features matter more than the absolute numbers.

First, training is concentrated. Fewer than ten organisations globally do the overwhelming majority of frontier-scale training. That concentration means total training energy is a function of a small number of capacity decisions, not a diffused demand curve.

Second, training is discrete. A frontier training run lasts weeks to months, then ends. The data center capacity provisioned for that training run does not disappear, but the compute load shifts to inference once the model deploys. Most reporting treats “AI energy demand” as if training were the ongoing workload. It isn't.

Inference compute is where the operational energy lives

Once a model is deployed, inference becomes the recurring load — and for widely-used frontier models, inference now consumes substantially more energy annually than the original training run that produced them.

The asymmetry runs deep. A single training event represents a fixed, predictable energy budget. Inference, by contrast, is highly distributed: millions of individual queries, each consuming a small amount of energy, aggregating to a workload that compounds with user adoption. ChatGPT alone reportedly handles in excess of a billion queries per week. Multiply that across the major chatbot platforms, code assistants, search integrations, enterprise inference, and the API-driven ecosystem of derivative applications, and inference becomes the dominant share of operational AI energy by a wide margin.

This is where the efficiency story matters most, because efficiency gains affect inference in two ways simultaneously: they reduce energy per query, and they make additional applications economically viable. Both effects compound.

The efficiency gains are genuine and unusual

Three improvement vectors are running at once.

Hardware. NVIDIA's H100 to B100 transition delivered roughly 2–3x performance-per-watt improvements on representative inference workloads, with comparable jumps queued for successor generations. Custom inference silicon — Google TPUs, AWS Inferentia, Cerebras, Groq — extends this further for specific workload shapes. The semiconductor industry has not seen sustained per-watt gains at this pace since the early 2000s.

Model architecture. Mixture-of-experts routing reduces active parameters per token. Attention optimisations — flash attention, grouped-query attention — cut memory bandwidth costs. Specialised smaller models trained for specific tasks often match general-purpose model quality at a fraction of the inference cost. Combined, these architectural improvements have reduced compute per token of equivalent output quality by an order of magnitude over the past 18 months.

Quantization. Running inference at lower precision — FP16 to FP8, increasingly to INT4 — cuts both memory and compute requirements substantially with minimal quality degradation on most workloads. This is now standard practice for production deployments, where it was research-only as recently as 2023.

The aggregate effect: cost-per-token on standard capability benchmarks has fallen by roughly two orders of magnitude since 2023. That rate of improvement is unusual in any mature technology. The closest analogue is the early-stage semiconductor cost curve of the late 1960s and early 1970s. It will not continue forever, but the runway for further gains appears longer than most casual observers assume.

Where the paradox bites

The paradox is straightforward once the dynamics are named: Jevons applies. As compute becomes cheaper, applications that were uneconomic at higher prices become viable, and aggregate demand expands faster than per-unit consumption shrinks.

So far, the expansion has outpaced the efficiency gains. Hyperscaler capacity additions reflect the demand side winning; published utility plans support the same conclusion. The IEA's most recent global forecast has data center electricity demand roughly doubling by 2030 from a 2024 baseline, with AI workloads driving most of that growth.

But the gap between expansion and efficiency is narrowing, and several second-order shifts deserve attention:

The geographic distribution of new builds is changing. Power-constrained markets — Northern Virginia, Dublin, parts of Singapore — are seeing deferrals and capacity reallocations. New builds increasingly target markets with available renewable capacity and grid headroom: Iowa, Texas, the Nordics, the Pacific Northwest. This dispersion shifts the local grid impact even when global totals continue to grow.

The shape of deployment is also shifting. Specialist smaller models running on edge hardware or constrained cloud inference budgets are taking share from frontier-scale model calls for routine tasks. This shift, if it continues, would slow the inference compute growth curve without reducing application diversity.

The honest range on 2030 AI data center electricity consumption, after reading across credible analyses, lies somewhere between 3% and 12% of global electricity demand. Anyone offering a point estimate within that band is presenting a forecast as a fact. The width of the band reflects what's actually known.

What serious buyers should track

For investors, corporates, and policymakers attempting to anchor decisions to credible signals rather than press cycles, five indicators carry more weight than the headline forecasts:

Hyperscaler PPA structure. Watch whether new agreements are firm renewable (24/7 carbon-free energy matching, the harder commitment) versus annualised renewable matching (the easier commitment). The shift toward firm matching is a stronger demand signal than the volume of new contracts.
Inference-to-training compute ratio. As models deploy at scale, the ratio shifts further toward inference. The slope of that shift is a leading indicator of operational energy growth.
Cost-per-token trajectory on fixed-capability benchmarks. This is the cleanest measurable proxy for whether efficiency gains are continuing at the recent rate or beginning to plateau.
Geographic concentration of new builds. Permits and substation upgrade announcements in emerging hubs reveal where the next inflection in regional demand will land.
Specialist vs. general-purpose model deployment. API revenue breakdowns and enterprise inference contract terms reveal whether the industry is consolidating on a few frontier models or fragmenting across specialised smaller ones. The latter outcome reduces aggregate compute demand meaningfully.

The framing that survives scrutiny

Both narratives currently in circulation are individually true and jointly misleading. AI energy demand is real, large, and growing. AI efficiency is improving at rates that are themselves historically unusual. The net effect — and therefore the right basis for capacity planning, grid investment, and disclosure scrutiny — sits in the interaction between the two, not in either one taken alone.

For now, the demand side is winning. That position is not stable. Anyone forecasting 2030 outcomes who hasn't worked through the efficiency side at comparable depth is presenting half the picture as the whole.