The Scalar Trap

The Case for Pluralistic Alignment

ESSAYDEC 24, 2025
M

Maximilian Ruess

The prevailing orthodoxy in the development of artificial general intelligence (AGI) rests on a mathematically elegant but sociologically brittle assumption: that the messy richness of human values, ethics, and preferences can be compressed into a single scalar quantity. Building on Sutton's "Reward Hypothesis," this assumption holds that an agent's goals can be fully captured by a scalar reward 𝓇 (𝒶, 𝓼) returned when it takes action 𝒶 in state 𝓼, and that the agent's sole objective is to maximise the expected cumulative sum of these rewards.

For the better part of a decade, this reductionist framework—implemented via RLHF-style preference optimisation—has underpinned the alignment of frontier models. Supervised fine-tuning plus scalar reward learning turned raw next-token predictors into helpful assistants, capable of following natural-language instructions and obeying safety constraints at scale.

As models approach AGI‑level capabilities, however, the scalar assumption is revealing itself as a structural limitation. A single reward model trained on pooled human feedback must collapse diverse and often conflicting values into a kind of "mean" preference that over‑represents majorities and systematically under‑serves minority preferences. Recent theoretical work shows that, under standard RLHF assumptions, no single scalar reward can simultaneously align to heterogeneous groups without leaving a provable alignment gap, and that KL‑regularised optimisation tends toward preference collapse, where minority preferences are effectively ignored. The result is systems that are sycophantic, homogenised, and mathematically ill‑suited to represent the pluralistic nature of human society.

The friction between monolithic scalar rewards and genuinely plural human values has catalysed a new paradigm: Pluralistic Alignment. This essay traces the mathematical and technical roots of the scalar trap, analysing the impossibility theorems that clarify why single-reward systems fail for diverse populations. It surveys emerging solutions, from multi-objective optimisation to group-robust methods and model ensembles. And it argues that the shift from scalar to pluralistic alignment is not merely a technical fix but a fundamental reframe: away from a God Model that claims to have solved ethics, toward infrastructure capable of serving the full diversity of human values.

Historical Analysis: How the Scalar Assumption Became Default

The scalar-reward worldview that dominates contemporary alignment did not arrive as a deliberate philosophical choice. It was inherited from mid-20th century decision theory, classical reinforcement learning, and the practical constraints of training very large models. This section traces that lineage: from von Neumann and Morgenstern's single-agent utility theorem, through Arrow's warning that social preferences resist scalarisation, to Sutton's Reward Hypothesis and the Bradley-Terry reward models at the heart of RLHF for LLMs.

Von Neumann–Morgenstern and the Scalar Foundation (1944)

John von Neumann and Oskar Morgenstern's expected utility theorem is the bedrock. They showed that if an individual's preferences over options with uncertain outcomes satisfy four reasonable axioms—completeness, transitivity, continuity, and independence—then there exists a real-valued utility function u such that the agent's preferences are equivalent to maximising expected utility 𝔼[u].

This was profound for decision theory: it means rational preferences can always be represented as a scalar. But the theorem applies to one agent, not many. Over decades, this became the default intellectual model for rational choice:

PreferenceScalar UtilityMaximise 𝔼[u]

It was elegant, mathematically tractable, and led to tools and mental habits that treated scalar objectives as the natural language of goals.

Arrow's Impossibility and the Social Choice Warning (1951)

Kenneth Arrow asked the next question: What happens when you try to aggregate many individuals' preferences into a single social ordering? To answer this question he required a few minimal fairness conditions:

  • Unrestricted domain (the rule works for any preference profile)

  • Pareto efficiency (unanimous preferences become social preferences)

  • Independence of irrelevant alternatives (social choices between A and B depend only on how people rank A vs. B)

  • Non-dictatorship (no single person decides)

Arrow's impossibility theorem proved that no aggregation rule can satisfy all four conditions simultaneously when there are three or more options and arbitrary preference profiles.

This is where the vNM picture breaks. Scalar utility worked beautifully for individuals but it fails for societies. Yet this lesson remained siloed in social choice theory and largely absent from engineering and RL culture. Social choice theorists explored workarounds like restricting preference domains (single-peaked preferences), using cardinal information (median grades, utilitarian sums, max-min fairness), or embracing randomisation (probabilistic aggregation). The upshot: aggregation is possible, but only if you are explicit and principled about which axioms you sacrifice.

Sutton's Reward Hypothesis and Classical RL

As reinforcement learning matured, it crystallised around a simple design principle. Richard Sutton has long articulated it as the "Reward Hypothesis": that all goals can be understood as the maximisation of expected cumulative scalar reward. Silver et al.'s "Reward is Enough" (2021) elevated this from a computational convenience to a claimed sufficient condition for general intelligence, arguing that reward maximisation in sufficiently rich environments could produce perception, language, and reasoning as emergent byproducts.

For many classical RL domains like Atari, Go, robotic control, the scalar reward was natural. Game scores, task completion rates, and trajectory costs fit the formalism cleanly, and the results were spectacular. The scalar paradigm also offered genuine engineering advantages: with a single objective, gradient descent has a clear direction. With multiple objectives, the optimisation landscape becomes a Pareto frontier and improvements become ambiguous. Which direction is "down" when objectives conflict?

What went unexamined was the implicit generalisation: that all preferences, across all agents and contexts, could fit the same scalar mold. RL infrastructure, algorithms, and intuitions all solidified around this assumption.

Critiques emerged but gained little traction against the momentum of empirical success. Vamplew et al. (2022) issued a direct rebuttal to Silver's hypothesis, arguing that scalar reward is mathematically insufficient for safety-critical systems requiring multi-objective trade-offs.

One of their key observations was that scalarisation destroys information. When multiple objectives are collapsed into a single number before the agent sees them, distinctions vanish. An agent optimising a weighted sum of Speed + Safety cannot distinguish "moderately fast and moderately safe" from "very fast and very unsafe" if they sum to the same total value. The trade-off is invisible to the agent—resolved by annotators or designers before training, and never recoverable after.

This creates an explainability deficit with direct implications for alignment. A scalar agent can only justify its actions by saying "this had the highest value." A multi-objective agent that tracks separate value functions can report: "I chose this route because it reduced collision risk by 50%, even though it increased travel time by 10%." In high-stakes domains like medicine, law, or autonomous systems, the inability to explain trade-offs is a fundamental barrier to trust and oversight.

Skalse and Abate (2023) went further, proving formally that the Reward Hypothesis is false for broad classes of tasks. They showed that multi-objective problems, risk-sensitive problems, and safety constraints cannot be represented by any scalar Markovian reward, even in principle. The formalism itself is structurally inadequate.

Preference Learning and the Bradley-Terry Turn

As RL researchers confronted the difficulty of hand-specifying good reward functions, a solution emerged: learn rewards from human preferences. Christiano et al. (2017) showed how to train a neural network reward model on pairwise comparisons using a Bradley-Terry loss, then use that learned scalar reward to train an RL agent via PPO. This was the pipeline that would become RLHF.

The approach used the Bradley-Terry-Luce model: human annotators see two outputs and pick the better one, and a reward model learns to predict a scalar score (x,y) such that:

P(ywylx)=σ(r(yw)r(yl))P(y_w \succ y_l \mid x) = \sigma(r(y_w) - r(y_l))

where 𝝈 is the logistic function and r is a scalar reward.

The model treats annotator disagreement as noise around a shared ground truth, not signal about genuinely different values. This framing was inherited without much scrutiny.

The result was a pipeline where all friction—safety vs. helpfulness, verbosity vs. conciseness, Western vs. Eastern cultural norms— gets resolved during annotation, before the model ever sees the data. The model receives only the final scalar verdict.

RLHF and Institutional Lock-In

OpenAI's InstructGPT (Ouyang et al., 2022) scaled this approach to large language models, establishing a three-step pipeline that would define the industry:

  • Supervised fine-tuning (SFT): on instruction-response pairs, creating a base policy.

  • Reward modelling: train a scalar reward model rᵩ(x,y) on pairwise human preference labels using Bradley-Terry loss.

  • RL fine-tuning: optimise the policy π to maximise 𝔼[rᵩ] subject to a KL penalty that keeps outputs close to the distribution of the SFT model.

The pipeline became the industry standard with remarkable speed. Within eighteen months, GPT, Claude, Gemini, and most open-source chat models had adopted variants of the same recipe. Direct Preference Optimisation (DPO) and its successors removed the explicit RL step but retained the scalar conceptual core—optimising the exact same implicit scalar reward objective, simply by solving for the optimal policy directly.

By 2025, the scalar assumption was part of the institutional infrastructure. Tooling (Transformers, TRL, OLMo), benchmarks (RewardBench), and shared mental models all assumed a scalar reward head and scalar loss. "Alignment" had become synonymous with "learning a scalar reward and optimising it".

The Mathematics of Impossibility

So far, the argument has been conceptual: vNM was designed for individuals, Arrow proved aggregation is fraught, and RLHF inherited both problems without examination. But how bad is it exactly? Can clever engineering route around this issue?

A wave of recent theoretical results has formalised exactly why single-reward systems fail to represent diverse populations.

MaxMin-RLHF: The Impossibility of Equitable Alignment (2024)

The most damning formalisation of the scalar trap comes from the work on "MaxMin-RLHF" by Chakraborty et al., which introduces an Impossibility Theorem of Alignment with Single Reward RLHF. This theorem effectively proves that you cannot align a single model to a diverse population without systematically failing minority groups.

The theorem considers a population composed of diverse subgroups (e.g., different cultures, political affiliations, or expertise levels). It proves that for any single scalar reward function rϕ used to align a policy π, there exists an Alignment Gap between the model's performance and the optimal policy for a specific subgroup H.

Formally, the theorem states that the Alignment Gap satisfies the following lower bound:

Align-Gapλψ64β2Lπϵ(1η(u))D2\mathbf{Align\text{-}Gap} \geq \frac{\lambda_{\psi}}{64\beta^2 L_{\pi}} \cdot \frac{\epsilon(1-\eta(u))}{D^2}

Let us break down the components of this inequality to understand its implications:

  • ε (Diversity/Distinctiveness): This term represents how different the subgroup's (H) preferences are from the rest of the population. If a subgroup has unique values (e.g. distinct cultural or ethical commitments ), ε is large.
  • η(u) (Representation): This denotes the proportion of the subgroup in the total population. If the group is a small minority, η(u) is close to 0, making the term (1−η(u)) close to 1.
  • Align-Gap: This measures the difference in utility between what the subgroup wants and what the scalar model delivers.

The Insight: The theorem proves that the Alignment Gap is proportional to ϵ(1−η). This means that as a subgroup becomes more unique (ϵ increases) or less represented (η(u) decreases), the alignment gap must increase. It is impossible to align a single scalar model to a diverse population without systematically disenfranchising the minority groups whose preference vectors are orthogonal to the majority's.

This is not a data quality problem; it is a structural impossibility of projecting high-dimensional preference manifolds onto a 1D line. The scalar reward function acts as a low-pass filter, smoothing out the distinct "high-frequency" values of specific communities.

Preference Collapse: The KL Bias (2024)

Even if we could learn a perfect reward model, the optimisation process itself introduces bias. Xiao et al. (2024) analysed the standard RLHF objective, which maximises reward subject to a KL-divergence constraint.

They discovered a phenomenon they termed Preference Collapse. Because KL regularisation penalises the model for deviating from the base distribution, the optimal policy tends to concentrate probability mass on the single "safest" or most majority-preferred response. It does not match the distribution of human preferences. It collapses to the mode.

If 60% of humans prefer A and 40% prefer B, a KL-regularised model will often output A nearly 100% of the time, effectively erasing the minority view. RLHF amplifies majority dominance. It turns a lean in the training data into a monopoly in the deployed model.

More balancedMore skewed
Human preferences
A: 60%
B: 40%
Model output after RLHF
A: 60%
B: 40%

Adjust the initial preference split, then click 'Run RLHF' to see how KL regularisation collapses minority preferences.

The Alignment Trilemma (2025)

These individual results were recently unified into a broader complexity-theoretic framework: the Alignment Trilemma (2025). This work formalises the tension between three distinct goals we want from any alignment system:

  • Representativeness (ε-Representative): The system accurately reflects the diverse values of the entire user population.
  • Tractability (Polynomial Complexity): The system can be trained and run with reasonable compute and data.
  • Robustness (δ-Robust): The system is stable against adversarial attacks and distribution shifts

The theorem proves it is impossible to satisfy all three simultaneously for global-scale populations. If you want representativeness and robustness, the complexity becomes super-polynomial. You need exponentially more data to robustly distinguish diverse preferences. Current RLHF systems solve this by sacrificing representativeness. They achieve tractability and partial robustness by collapsing the population into a homogeneous proxy: the average rater.

The scalar trap is a mechanism for navigating this trilemma by implicitly discarding diversity. Acknowledging this allows us to stop pretending we can have it all, and start designing systems that make principled, transparent trade-offs.

The Rise of Pluralistic Alignment

Recognising the mathematical dead‑end of the scalar assumption, the field is pivoting toward Pluralistic Alignment. The goal is shifting from "aligning with human values" (singular) to "aligning with human value systems" (plural): not one God-model for everyone, but systems that can represent and navigate multiple, sometimes conflicting, normative frames.

Sorensen et al. (2024) provide a useful taxonomy in "A Roadmap to Pluralistic Alignment", distinguishing three operational modes: Overton pluralism, Steerable pluralism, and Distributional pluralism. These are not mutually exclusive; they are different ways of operationalising pluralism in deployed systems.

Overton pluralism

Named after the Overton window of political discourse, Overton pluralism requires a model to present a range of reasonable positions rather than selecting a single "correct" answer.

  • Mechanism: Given a controversial prompt ("Is taxation theft?"), an Overton-pluralistic model does not return a single verdict. It describes major perspectives: "Libertarians often argue that taxation is theft based on strong property rights; social democrats view it as a legitimate instrument for funding public goods; socialists may frame it as partial redistribution of surplus value."
  • Goal: Comprehensiveness and neutrality. The model behaves like a librarian or explainer of viewpoints, not an arbiter of truth.
  • Limitation: It pushes the cognitive and moral synthesis back onto the user. For pure information access this is ideal, but for action-critical systems (autonomous vehicles, trading agents, medical tools), "here are five valid options" is not enough. A self-driving car cannot say "some prefer stopping, others prefer speeding up." It must select a policy and act.

Overton pluralism is best seen as a presentation layer: crucial for transparency and civic discourse, but incomplete without mechanisms for choosing and steering.

Steerable pluralism

Steerable pluralism is currently the most commercially active area. A steerably pluralistic model can adopt a specific persona, value framework, or demographic perspective on demand.

  • Mechanism: The model is conditioned on a control signal: a system prompt, control tokens, or a latent variable z representing a value vector. Examples include "As a utilitarian ethicist, analyse this dilemma," "From a deontological perspective, analyse this dilemma," or "Answer as a Japanese doctor speaking to a patient's family." Architecturally, this can be implemented by conditioning on a learned embedding z, or by adding a small preference module (as in Group Preference Optimisation) that modulates the base model's logits given a group or persona.
  • Goal: User agency over scalarisation. Instead of one hidden scalar combining all objectives, the user or downstream system chooses which scalarisation to apply in a given interaction.
  • Technical challenge: The model's parameters must encode multiple, potentially conflicting policy manifolds without catastrophic interference. It must be capable of operating as "fully utilitarian" in one context and "fully deontological" in another, without each mode dragging the other toward a bland average. This motivates methods like GPO and adaptive reward features that explicitly model group or persona conditioning, rather than relying on prompts alone.

Steerable pluralism is where most near-term product work sits: enterprise assistants with "tone" sliders, models that adapt to jurisdictions, and role-conditioned systems for professions.

Distributional pluralism

Distributional pluralism is the most ambitious and mathematically demanding form. Rather than selecting a persona per query, it aims to have the model's overall output distribution match a target distribution over human beliefs or preferences.

  • Mechanism: Suppose the target population distribution over answers to some question is 60% in favour of A and 40% in favour of B. The distributionally pluralistic system aims for its responses, over repeated draws or via stochastic decoding, to approximate that same 60/40 split. Formally, for a query x, if the human population induces a distribution P_human(y|x), we want:
Pmodel(yx)Phuman(yx)P_{\text{model}}(y|x) \approx P_{\text{human}}(y|x)

This can be conditioned on jurisdiction, platform, or stakeholder weights.​

  • Goal: Representation and calibration. Human variation is treated as signal. The model becomes a representative sampler from a pluralistic society rather than a homogenising filter.

  • Contrast with scalar RLHF: Standard RLHF, especially under KL regularisation, tends toward preference collapse. If 60% prefer A and 40% prefer B, training drives the model toward outputting A nearly 100% of the time. Distributional pluralism tries to preserve that 40%, for example via calibrated reward ensembles, MaxMin-RLHF, GRPO, or representative social choice frameworks.

Distributional pluralism is the natural endpoint for a mathematical theory of plural alignment. The model approximates a distribution over values rather than collapsing to a single social utility.

Technical Solutions: Engineering Pluralism

If the scalar trap is a structural failure, the solution has to be structural engineering. New methods are moving away from "one PPO run on one reward head" toward multi-objective, group-robust, and ensemble-based optimisation.

Below are three mathematically grounded examples that connect directly to the pluralism modes above.

Rewarded Soups: Cheap Pareto Steering

Rewarded Soups (Ramé et al., 2023) offer a simple but powerful approach to Steerable Pluralism without retraining for every preference setting.

The key observation is linear mode connectivity: fine-tuned models for related tasks often lie in a connected low-loss region of weight space. This allows interpolation:

  • Train separate expert models. θ_safe is fine-tuned to maximise a safety reward (avoiding harmful content). θ_helpful is fine-tuned to maximise a helpfulness reward (detailed assistance, even if edgy).
  • Form a soup at inference time by linearly interpolating weights:
θsoup=λθsafe+(1λ)θhelpful\theta_{soup} = \lambda \theta_{safe} + (1 - \lambda) \theta_{helpful}

where λ ∈ [0,1] .

As λ varies, θ_soup traces a path in weight space. Empirically, many such paths approximate the Pareto frontier between the two objectives. Intermediate λ values produce models that score well on both safety and helpfulness. A soup with λ=0.5 can outperform a single model trained on a 50/50 mixture of safety and helpfulness data, because each expert was allowed to specialise before being blended.

This approach assumes linear mode connectivity: that the two optima are connected by a low-loss corridor. When objectives are too orthogonal, interpolating can cross a high-loss barrier and degrade performance. The method is currently most reliable for closely related objectives. For deeply conflicting norms, more structured multi-objective methods like MODPO may be needed.

Rewarded Soups are mathematically simple, but they embody an important idea: maintaining multiple optima and interpolating between them at deployment time. This is Steerable Pluralism in weight space.

Safety ScoreHelpfulness Scoreθ_safeθ_helpfulλ = 0.50
θ_helpful (λ=0)θ_safe (λ=1)
Safety: 80%Helpfulness: 43%

Drag the slider to interpolate between safety-optimised and helpfulness-optimised models. The dot traces the Pareto frontier.

Group Robust Preference Optimisation (GRPO): Worst-Case Groups First

While Rewarded Soups address steering along objectives, Group Robust Preference Optimisation (GRPO) targets the fairness problem raised by the MaxMin-RLHF impossibility result: single-objective training tends to privilege the majority.

Standard Direct Preference Optimisation (DPO) minimises expected loss over all training data:

LDPO(π,r)=Ex,yD[logπ(yx)]r(x,y)L_{DPO}(\pi, r) = \mathbb{E}_{x,y \sim D}[\log \pi(y|x)] r(x,y)

GRPO instead partitions data into groups g ∈ G (demographic slices, language communities, or topical clusters) and defines a group-robust objective:

LGRPO=maxgGE(x,yw,yl)Dg[(πθ;x,yw,yl)]L_{\text{GRPO}} = \max_{g \in G} \mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}_g} [\ell(\pi_{\theta}; x, y_w, y_l)]

Intuitively, an adversary chooses the worst-performing group at each step, and the optimiser updates parameters to reduce loss on that group. This guarantees that no group is systematically neglected.

During training, GRPO adaptively reweights data from different groups. Underperforming groups get upweighted; well-served groups get downweighted. Experiments show GRPO substantially improves worst-group performance and reduces disparities in win-rates across groups compared to vanilla DPO.

The connection to the MaxMin theorem is direct. In the MaxMin-RLHF inequality, the alignment gap grows like ε(1−η). The more distinct and less represented a group, the worse it fares. GRPO effectively boosts η for disadvantaged groups by giving them more influence in the gradient signal. It does not remove the impossibility; you still cannot satisfy everyone perfectly with one policy. But it shifts optimisation toward egalitarian robustness rather than average performance.

Multi-Objective Preference Optimisation (MOPO): Vector Rewards, Constrained Objectives

Multi-Objective Preference Optimisation (MOPO, MODPO, AMoPO) moves plurality into the reward itself. Rather than a scalar R, the model optimises a vector reward:

R(x,y)=[rhelp(x,y),rsafe(x,y),rtruth(x,y),]\vec{R}(x, y) = [r_{\text{help}}(x, y), r_{\text{safe}}(x, y), r_{\text{truth}}(x, y), \dots]

This opens up two mathematically rich regimes.

The first is scalarisation with constraints:

Maximize rhelpsubject torsafeτ\text{Maximize } r_{\text{help}} \quad \text{subject to} \quad r_{\text{safe}} \geq \tau

Here, safety is treated as a hard constraint, while helpfulness is optimised within that feasible set. This is classic constrained RL and MORL applied to preference models.

The second is explicit Pareto optimisation. The training objective aims to approximate the Pareto frontier of R, yielding a family of policies {π_λ}, each corresponding to a different weight vector λ over objectives.

Recent work demonstrates both approaches. MODPO can reproduce MORLHF solutions (which use multiple scalar reward models and PPO) more efficiently and stably by treating the language model itself as the parameterisation of the multi-objective policy. AMoPO introduces adaptive weighting over dimensions based on current performance, improving multi-objective alignment scores by large margins on multi-criteria benchmarks compared to single-objective baselines.

Related innovations like Critique-out-Loud reward models add an interpretability layer. The reward model generates an explicit textual critique ("Helpful but factually ambiguous and somewhat unsafe") and then a score. This is a natural match to vector rewards: different aspects of the critique can be mapped to different reward components, allowing the policy to reason about trade-offs explicitly.

Vector rewards matter for pluralism because they preserve information. The agent does not just know "this is 7.3"; it knows "this is (helpful = 9, safe = 3, truthful = 8)." Trade-offs become explicit. Policy selection and steering can be based on context or user choice. The pre-scalarisation critique dissolves: information is preserved in the vector.

The God Model and Its Discontents

The previous section described technical solutions to specific failure modes of scalar alignment. But behind the mathematics lies a deeper question: What kind of AGI does the scalar assumption imply, and do we actually want it?

The scalar paradigm, taken seriously, points toward a singular destination: one model, trained on aggregated global preferences, optimised against one reward function, deployed to serve all of humanity. Call it the God Model. The impossibility theorems force us to ask what this entity would actually represent, and for what purposes it would be adequate.

What Scalar AGI Implies

The God Model is trained on a reward function that compresses humanity's diverse preferences into a single axis. Its "values" are those of the average human, a statistical construct that corresponds to no actual person. It would be confidently moderate on contested questions, systematically aligned to majority views, and structurally incapable of articulating why it chose one trade-off over another.

This should sound familiar: it is Arrow's impossibility theorem applied to alignment. Arrow proved that no aggregation rule can convert diverse individual preferences into a single coherent social ordering while satisfying non-dictatorship, Pareto efficiency, and independence of irrelevant alternatives. Yet that is precisely what we attempt when we train a single reward model on pooled global feedback and optimise against it.

The God Model is not aligned to humanity. It is aligned to the output of an impossible aggregation, a mathematical compromise that violates at least one of Arrow's axioms, usually without acknowledgment. In practice, it sacrifices Pareto efficiency and non-dictatorship: majority preferences dominate, and minority values are erased.

Where Scalar AGI Works

The scalar-average model is perfectly appropriate for a large class of problems.

  • Factual and computational tasks. "What is the capital of France?" or "Factor this polynomial." There is no genuine preference divergence. The answer is right or wrong.
  • Coordination problems. "Which side of the road should cars drive on?" The actual choice matters less than universal agreement. Averaging is sensible because coordination value is symmetric.
  • Uncontroversial domain expertise. "How do I reset a router?" or "What are the symptoms of pneumonia?" Most practitioners converge on the answer. Averages capture legitimate consensus.
  • Logistics and infrastructure. Route planning, scheduling, resource allocation. These are optimisation problems where scalar objectives (minimise cost, maximise throughput) directly apply.

In these domains, the God Model makes sense. One model, one answer, deployed everywhere. It functions as infrastructure, like electricity grids or roads. It works because the problem genuinely has a single right answer, or because coordination is valuable enough to override preference variation.

Where Scalar AGI Fails

The scalar model breaks down when preferences reflect genuine value divergence: different frameworks for meaning, ethics, aesthetics, and how to live.

  • Ethics and meaning. "Is it permissible to pursue ambition at the cost of family time?" or "What makes a life worth living?" These are sites of legitimate disagreement between ethical frameworks, cultures, and individuals.
  • Aesthetics and culture. "What is beautiful?" or "What makes a good story?" Cultural and individual variation is the substance of human creativity and meaning-making. A scalar-averaged answer is necessarily bland, the mean preference that satisfies no one.
  • High-stakes personal decisions. "Should I take this job?" or "How do I parent my child?" These are deeply contextual. "What the global average prefers" is irrelevant and often offensive. The person needs a system that understands their values, not a weighted average of humanity's.
  • Explainability crisis. Here the impossibility comes full circle. A scalar reward model that has collapsed trade-offs into a single number cannot articulate why. You ask, "Why did you recommend this?" The answer is "Because it scored 7.8." You ask, "What about safety?" "That's already in the score." You ask, "What about cultural sensitivity?" Silence. The trade-offs were made by annotators, baked into the reward, and are no longer recoverable.
  • Monoculture risk. Deploying one value system globally risks cultural homogenisation. If scalar optimisation trains the most powerful AI systems, and those systems deploy globally, they optimise for the averaged values of the training population. Over time, this erodes local meaning-making, institutional diversity, and legitimate cultural variation.

The Reframe: Pluralistic AGI as Appropriate Infrastructure

This is where pluralistic alignment offers a different answer to what AGI should be.

The God Model answers the question: "What is the one right way to think?" Pluralistic alignment answers: "How can powerful reasoning adapt to many legitimate ways of thinking?"

From Oracle to Adaptive Infrastructure

The scalar model treats AGI as an oracle. You ask it a question, it returns the answer. There is one answer because there is one value system, one goal, one definition of good. This works for factual questions. It fails for value-laden ones.

Pluralistic alignment treats AGI as adaptive infrastructure: like a road network, a communication protocol, or a court system. It is powerful and general, but it does not impose a single vision of the good. Instead, it enables many different actors (individuals, communities, institutions) to pursue their own aims while maintaining interoperability and shared standards.

Steerability and Multi-Objective Awareness as Features

In the scalar frame, steerability and multi-objective trade-offs look like limitations. If a system is steerable, it is not "truly" aligned; it is just doing what users tell it. If a system has to balance competing objectives, it is not optimising for the "right" thing.

This is backwards. Steerability is alignment for a plural world. If values are contested and contextual, then a system that adapts to different contexts, stakeholders, and user needs is more aligned to reality than one that cannot.

Multi-objective awareness (the ability to reason about trade-offs explicitly) is a feature that makes AGI appropriate to the human condition. Humans live with trade-offs: safety vs. autonomy, equality vs. liberty, efficiency vs. meaning. A system that can articulate these trade-offs, respect hard constraints, and allow stakeholders to choose weightings is more honest and more usable than one that pretends ethics can be solved with a single scalar.

The Normative Shift

What changes is the normative claim:

  • Scalar AGI claims: "I am aligned to human values (singular). Trust me; I have solved ethics."
  • Pluralistic AGI claims: "I operate across multiple legitimate value frameworks. You can inspect and steer my reasoning. My outputs reflect trade-offs you can interrogate and modify. I work better with you than for you."

The first is a promise that cannot be kept (Arrow proved it). The second is honest and scales with human diversity instead of against it.

Concrete Implications

The scalar trap has revealed a fundamental tension between the needs of a plural world and the constraints of a scalar system. The scalar-average model is a mathematical impossibility, and the God Model is a normative dead end.

Pluralistic alignment offers a way forward: a framework that embraces the plurality of human values while preserving the power and generality of AGI.

Conclusion: Pluralism as the Engine of AGI Utility

The scalar trap was inherited from decision theory, operationalised in reinforcement learning, and scaled to billions of users before anyone asked whether the assumptions held. Von Neumann and Morgenstern designed expected utility for individual rationality. Arrow proved that aggregating preferences across individuals is structurally fraught. RLHF ignored both lessons and built infrastructure that treats humanity as a single agent with a single reward function.

The impossibility theorems reviewed here are not endpoints. They are foundational specifications. Arrow tells us not to expect fair aggregation from ordinal rankings. The MaxMin bound tells us exactly how alignment gaps scale with diversity. Preference collapse tells us that KL regularisation amplifies majorities. The alignment trilemma tells us we cannot have representativeness, tractability, and robustness simultaneously.

The Optimistic Case: Why Pluralism Wins

The shift to Pluralistic Alignment may be the key to unlocking the true potential of AGI.

From oracles to infrastructure. A scalar-aligned AGI is an oracle: it gives "the" answer, optimised for an average user who does not exist. A pluralistic AGI is adaptive infrastructure: it responds to the context, values, and needs of the specific human using it. A doctor gets a medical assistant aligned to medical ethics. A creative writer gets a collaborator aligned to aesthetic risk-taking. The utility of the system scales with its ability to adapt.

Complexity as a feature. Human values are complex because reality is complex. A system that can model and navigate multiple conflicting objectives (safety vs. speed, fairness vs. efficiency) is not misaligned. It is high-fidelity. By embracing vector rewards and Pareto frontiers, we build systems that reason about the world as it actually is. This makes them more capable, more nuanced, and ultimately more trustworthy as collaborators.

Democratic intelligence. The God Model creates a monoculture. Pluralistic alignment creates something closer to a democracy of intelligence. By building mechanisms for contestation, steerability, and representation directly into the optimisation loop, we ensure that AGI serves human flourishing in its diversity. This is the difference between an AI that imposes and an AI that enables.

Open Questions: The Frontier Ahead

The field has made remarkable progress in a short time. The most exciting questions now concern how capable these systems can become.

Can pluralistic systems discover better trade-offs than humans? Humans struggle to balance conflicting values. Equity against efficiency, safety against capability, short-term against long-term. Multi-objective systems that maintain Pareto frontiers can, in principle, identify solutions that dominate any single human proposal: policies that are simultaneously safer and more helpful. Whether this works in practice at scale is an empirical question worth pursuing.

How do we measure pluralistic alignment? Current benchmarks assume scalar scores. A model either wins or loses on a leaderboard. We lack robust metrics for distributional faithfulness, steering precision, or worst-group performance across diverse populations. Without evaluation infrastructure, progress is hard to verify. Building benchmarks that capture pluralistic success is prerequisite to the field maturing.

Does pluralism scale? Can a single foundation model support thousands or millions of distinct value configurations without interference or collapse? If so, we approach a world where every community and institution has access to infrastructure aligned with its own values. That possibility is worth working toward.

The Thesis in One Sentence

The assumption that human values compress into a single scalar was a necessary simplification. Recognising its limits and building pluralistic systems is what transforms AGI from a brittle oracle into a genuine partner for the full range of human purposes.

The scalar assumption offered false simplicity. Pluralistic alignment offers complex reality. And reality is where we live.

Keep exploring

Questions about this article?

Ask anything about the topics covered above. Sign in to save your conversation.