ESG data extraction with LLMs: the quality ceiling and where it breaks

The promise of large language models for ESG data extraction is straightforward: corporate sustainability reports run to hundreds of pages of unstructured prose, and the relevant numbers buried inside are precisely the kind of pattern recognition problem LLMs handle well. The reality is more complicated. Extraction quality varies enormously across data categories — from near-human accuracy on simple disclosed metrics to systematic failure on the disclosures that matter most for credibility scrutiny.

This piece works through what LLM extraction does well, where it breaks, and what data buyers should understand about the difference.

Where extraction works well

For well-defined, single-value metrics in standardised reporting formats, LLM extraction now performs near-human accuracy on most documents. Scope 1 emissions totals, scope 2 emissions totals, total energy consumption, water withdrawal, employee headcount, gender breakdown of leadership, and similar discrete numerical disclosures are extracted reliably from CDP submissions, annual reports, and standalone sustainability reports.

The reason this category works is that the data is genuinely there to be extracted. It's disclosed prominently, with clear units, in a small number of conventional locations within the document structure. LLMs find it because finding it doesn't require interpretation.

This is real value. Manual extraction of these data points across a large corporate universe is slow and expensive. LLM-driven pipelines have made it routine, fast, and inexpensive.

Where extraction starts to break: scope 3

Scope 3 emissions disclosure is where the quality gap becomes substantial.

The issue isn't that LLMs fail to find scope 3 totals when reported. Total scope 3 numbers are extracted fine. The problem is what's beneath the total. Scope 3 is composed of 15 categories under GHG Protocol, each with its own methodological choices. Companies typically disclose at the category level inconsistently — some categories with detailed methodology, others with single numbers and no methodology, others omitted entirely.

LLM extraction tends to flatten this complexity. A pipeline extracting scope 3 totals across a corporate universe will often produce numbers that are technically correct but materially incomparable. Company A's scope 3 total might cover 8 categories with rigorous supplier engagement; Company B's might cover 4 categories with spend-based estimation; Company C's might be reported as “not yet measured” for most categories with only category 1 (purchased goods) disclosed.

The data buyer who treats these three totals as comparable is making a substantial methodological error. The LLM extracting them doesn't flag the methodological differences because the differences aren't always disclosed at the extraction point.

Asset-level data: the second gap

For sectors where asset-level emissions matter — oil and gas, power generation, cement, steel, real estate — the relevant data isn't always in the headline sustainability report. It sits in operational reports, regulatory filings, engineering submissions, or simply never gets publicly disclosed.

LLM extraction pipelines reading sustainability reports systematically underestimate facility-level emissions detail. The corporate report may say “our European operations achieved a 12% intensity reduction” without naming individual facilities, capacity factors, or operational characteristics. The data needed for facility-level analysis is elsewhere — sometimes in EPRTR, sometimes in TRI, sometimes in EU ETS verified emissions, sometimes nowhere public.

This isn't an LLM failure. The data isn't where the LLM is reading. But it means that data products built primarily on LLM extraction from sustainability reports cannot provide the asset-level granularity that some downstream use cases require.

Forward-looking commitments: the third gap

Net-zero targets, science-based targets, interim milestones, scenario analysis — forward-looking disclosures are where extraction quality is most variable.

The structural problem: forward-looking commitments are often disclosed in qualitative prose that conveys real meaning to a human reader but doesn't reduce cleanly to a structured field. “We aim to reduce scope 1 and 2 emissions by 42% from a 2019 baseline by 2030, subject to technology availability and regulatory environment” contains a number that's extractable, but also a qualifier that materially changes the commitment's strength. LLM extraction will usually capture the number; it may or may not capture the qualifier.

Worse, similar commitments are stated with very different levels of accountability across companies. “Target” vs. “aspiration” vs. “ambition” vs. “goal” vs. “commitment” convey different binding force to a careful reader. Most LLM extraction pipelines collapse these distinctions into a single “target year + percentage” field.

Controversies and incidents: the fourth gap

For ESG users tracking controversies — safety incidents, environmental violations, labor disputes, governance failures — sustainability reports are systematically incomplete sources. Companies disclose what regulation requires and what reputation management permits; they do not disclose comprehensively.

LLM extraction from sustainability reports therefore captures the company's narrative about controversies. Capturing what actually happened requires augmentation with regulatory filings, news monitoring, NGO reports, court records, and similar external sources.

This is the gap most often missed by buyers of LLM-extracted ESG data products. The extraction quality may be high; the source completeness is low; the user experience is “clean data product with systematic blind spots.”

What extraction quality buyers should require

Three properties separate credible LLM-based ESG extraction from extraction that creates analytical hazard:

Methodology transparency at field level. A scope 3 total reported in a data product should reveal which categories the source company actually disclosed. A target reported should reveal whether the source company used “commitment”, “ambition”, or qualified language. Field-level provenance is the minimum standard for downstream use.

Source diversity. Pipelines reading only sustainability reports cannot match pipelines that also integrate regulatory filings, news monitoring, and NGO sources. Single-source extraction systematically reproduces the disclosure gaps in the source. Buyers should ask which sources are integrated and at what depth.

Verification sampling. Any LLM extraction pipeline should be subject to ongoing human-verified sampling, with disclosed error rates by field type. Pipelines that don't disclose error rates are pipelines whose error rates haven't been measured.

The honest framing

LLM extraction has genuinely transformed the economics of ESG data infrastructure. Tasks that required dozens of analysts now run as pipelines. Coverage that was previously concentrated on the largest corporates can now extend to mid-cap and below.

But the quality varies more by field type than most data products acknowledge. Headline emissions totals: high. Scope 3 detail: variable. Asset-level data: limited. Forward-looking commitments: methodology-sensitive. Controversies: structurally incomplete from corporate sources alone.

Buyers who use these data products without understanding the quality gradient by field type are making decisions on data that looks more comparable than it is. The pipelines aren't lying; the underlying disclosures are uneven, and the extraction inherits that unevenness. Recognizing this is the first step toward using the data products well.