The Observation Post logo

The Observation Post

Archives
Log in
Subscribe

The End of the Billion-Dollar Survey? How LLMs Are Becoming Synthetic Consumers

The Observation Post

Tech · AI · Cyber · Defence

The End of the Billion-Dollar Survey? How LLMs Are Becoming Synthetic Consumers

12 Jun 2026 · 8 min read

Business professionals in a modern office setting reviewing data on tablets

"The problem was never the model — it was how we asked the question."

Consumer research costs companies billions of dollars annually. Every year, corporations spend staggering sums recruiting panels, designing surveys, fielding studies, and analysing results — all to answer a single question: Would people buy this?

The process is slow, expensive, and riddled with biases. Panel fatigue, satisficing, acquiescence bias — by the time you have an answer, your competitor may have already shipped.

Enter large language models as synthetic consumers. The idea is elegant: instead of paying 300 humans to evaluate a product concept, you prompt an LLM with demographic personas and ask it what it thinks. Faster. Cheaper. Scalable.

There was just one problem: it didn't work.

Ask an LLM for a rating on a scale of 1 to 5, and it freezes. It plays it safe. It gives you a "3" — the conversational equivalent of a shoulder shrug. The distributions are unrealistically narrow, systematically skewed, and utterly useless for ranking product concepts.

Most researchers concluded LLMs were simply incapable of simulating human survey responses.

They were wrong.

The SSR Breakthrough

A team of researchers — Benjamin Maier, Ulf Aslak, and colleagues — recently published a paper that flips this narrative on its head. Their insight is deceptively simple: the problem was never the LLM. It was the elicitation method.

Their solution, called Semantic Similarity Rating (SSR), changes how we ask the question:

  1. Prime the LLM with demographic attributes — age, gender, income level
  2. Show it the product concept (image + description)
  3. Ask for a free-text response: "How likely are you to purchase this product?"
  4. Embed that text using a model like OpenAI's text-embedding-3-small
  5. Compare the embedding against reference anchor statements for each point on a 1-5 Likert scale
  6. Convert the similarity scores into a probability distribution

The results are striking. Tested on 57 real consumer surveys from a leading personal care corporation — spanning 9,300 human responses — SSR achieves 90% of human test-retest reliability. The distributional similarity to real human data exceeds 0.85 on the Kolmogorov–Smirnov metric, compared to a baseline of 0.26 for direct rating.

No training. No fine-tuning. Fully plug-and-play.

Why It Works

LLMs have consumed vast amounts of human discourse about products — Reddit threads, Amazon reviews, forum discussions, YouTube comments. What they're good at is generating language that sounds like a person. What they're bad at is assigning a number to an abstract feeling.

SSR bridges this gap by doing what LLMs do best (language generation) and delegating the quantification to an embedding model — a task embeddings are specifically designed for.

This is a textbook harness engineering insight: the wrapper around the model — how you prompt it, what you ask, how you process the output — matters more than which model you use.

But Ask for a Number and You'll Get Lies

This brings us to the critical counter-argument, and the methodological trap that most teams fall into.

If you ask an LLM to "rate this from 1 to 5", you're going to get garbage.

The paper demonstrates this conclusively. Under direct Likert rating (DLR), models regress to the centre — answering "3" almost every time. They almost never use the extremes (1 or 5). The correlation with human data that does appear is purely a result of occasional "2" and "4" responses providing just enough variance to avoid a flat line.

Even more telling: when researchers tried to nudge models toward the upper end of the scale via system prompt modifications, they got slightly better distributions but lost the signal in product ranking. The models over-corrected, producing distributions that looked human but no longer ranked concepts correctly.

The lesson extends far beyond market research. Any task that asks an LLM to produce a numerical rating — satisfaction scores, content moderation severity, quality assessments — is vulnerable to the same failure mode. The model isn't thinking in numbers; it's thinking in language. If you force it into a numerical straitjacket, it defaults to the safest possible answer.

SSR avoids this by preserving what LLMs actually produce — language — and applying a separate, purpose-built quantification layer.

What This Means for Companies — and Who Wins Most

The implications are profound. But the impact isn't uniform — different types of companies will benefit in very different ways.

Consumer Packaged Goods (CPG) & FMCG — This is the sweet spot. The paper itself was tested on personal care products, and the principle extends to any mass-market consumer good. Companies like P&G, Unilever, L'Oréal, Nestlé, and Colgate-Palmolive run thousands of concept tests per year at significant cost. SSR lets them screen 10x more concepts for the same budget, killing weak ideas earlier and investing human panels only on the strongest candidates. For a company testing 500 concepts annually, even a 20% reduction in human panel costs represents millions in savings.

E-commerce & DTC Brands — Amazon, Flipkart, Meesho, and direct-to-consumer brands constantly A/B test product descriptions, packaging designs, and pricing tiers. SSR can simulate consumer response to copy variants, feature descriptions, and visual treatments without running expensive user studies. A DTC brand launching 50 SKUs a year could test purchase intent on all 50 synthetically, then validate only the top 10 with human panels.

Pharmaceutical & Healthcare (OTC) — Over-the-counter drug and supplement companies face unique challenges: their concept tests must navigate regulatory constraints, and panel recruitment is expensive (you need specific conditions, demographics, etc.). SSR can pre-screen concepts for purchase intent before committing to regulated human studies. Companies like Haleon (Advil, Centrum), Johnson & Johnson, and GSK's consumer division stand to gain significantly.

Automotive — Car companies spend heavily on concept clinics to gauge consumer reaction to new designs, features, and pricing. With vehicle development cycles of 3-5 years and concept testing happening at multiple gates, SSR could compress the early-stage screening process dramatically — testing interior layouts, infotainment UI preferences, or even colour options synthetically before building physical mockups. For every F-150 that Ford tests, there are dozens of variants that never make it to clinics; SSR could give those variants a voice.

Gaming & Entertainment — Game studios test concepts — art styles, gameplay hooks, monetisation models — via focus groups and surveys. SSR could screen dozens of game concepts for purchase/download intent before committing development resources. Indie studios with limited budgets benefit most: SSR gives them consumer insight that was previously only available to AAA publishers.

Quick-Service Restaurant (QSR) & Food & Beverage — McDonald's, Starbucks, and PepsiCo test menu items and flavour concepts constantly. SSR lets them rapidly iterate on product concepts before committing to supply chain changes, test kitchen runs, or limited-time trials. A chain testing 100 potential menu items across 20 markets could narrow to the top 10 synthetically before spending a rupee on actual production.

How to actually use it operationally:

  1. Concept screening pipeline — Run all new concepts through SSR first. Set a purchase-intent threshold (e.g., mean PI > 3.8). Concepts that pass go to human panels. Concepts that don't get iterated or killed.
  2. Iterative refinement — SSR takes minutes per concept. Use it to test variations — "what if we change the price point?" / "what if we emphasise this benefit instead?" — and converge on the strongest formulation before spending human panel budget.
  3. Demographic filters — SSR replicates age and income patterns well. Use it to test whether a concept appeals differently to Gen Z vs Boomers, or to high-income vs budget-conscious segments. This lets companies pre-emptively identify positioning problems.
  4. Qualitative signal — Synthetic consumers explain why. Mine those rationales for themes — common objections, unexpected attractions, specific feature mentions. These feed directly into product development and marketing messaging.
  5. Global scaling — Running human panels in 20 countries costs 20x what it costs in one. SSR scales at near-zero marginal cost. Companies can test concepts across markets without the logistical and financial overhead of international panel recruitment.

Who benefits least? Companies in niche B2B domains (industrial equipment, enterprise software, specialised medical devices) where the LLM has limited training data. Also, any product where purchase decisions involve complex contractual, regulatory, or multi-stakeholder dynamics that can't be captured in a single purchase-intent question.

And for solopreneurs and startups? The same pipeline works at pocket-change cost. A solo founder testing a product idea can run SSR in an afternoon for the price of API credits — instead of spending $10K on a survey panel they can't afford. This is the real democratisation: not just cheaper research, but research that was previously impossible for small teams now becomes routine.

The Caveats

SSR isn't magic. It has real limitations:

  • Training data dependency — it works for personal care products because LLMs have seen millions of consumer discussions. For niche or novel domains, there's no such corpus to draw from
  • Demographic blind spots — age and income patterns replicate well, but gender, region, and ethnicity don't. Subgroup analysis from synthetic panels needs caution
  • Hallucination risk — LLMs fabricate confidently about things they don't understand. SSR doesn't eliminate this
  • Reference statement design — different anchor sets produce slightly different mappings. The method needs standardisation

As the researchers put it: "It is important not to view synthetic surveys as universally reliable, but rather as tools whose validity depends on the alignment between training data and the survey domain."

The Bottom Line

SSR does something rare in AI research: it takes a well-known failure mode — LLMs can't do ratings — and shows it was a methodological artefact, not a fundamental limitation. The fix is elegant, inexpensive, and immediately applicable.

For solopreneurs, startup founders, and product builders, the takeaway is straightforward: you can now get consumer-quality signal without the consumer price tag. The billion-dollar survey industry just got a very credible challenger.

—

Tags: AI · Tech

—

Sources: LLMs Reproduce Human Purchase Intent via SSR — Maier et al.

Read on web →

The Observation Post — daily posts on tech, AI, and what matters.

Don't miss what's next. Subscribe to The Observation Post:
Powered by Buttondown, the easiest way to start and grow your newsletter.