Overview
- Formalizes history-aware paraphrase robustness as a discrepancy diameter across prompt-conditioned output laws.
- Models epistemic uncertainty through random model indices and reports credible robustness certificates.
- Accounts for dependent interaction logs using chains with complete connections and dependence-aware limit theory.
Abstract
Updated June 22, 2026
We study uncertainty assessment for history-aware paraphrase robustness of
large language models (LLMs). We view an LLM as a history-dependent stochastic kernel
Kθ(· ∣ H, q) and compare, across paraphrases q in U(H), the induced
laws of an observable Z = g(Y) under a discrepancy d. Robustness is quantified as the
d-diameter of the set of laws {Fθ;H,q : q in U(H)}, and epistemic
uncertainty is modeled by a random model index Θ following
Πepi, yielding credible robustness certificates.
To address dependence in evaluation logs, we model the token or turn stream as a
g-measure, also known as a chain with complete connections. Under explicit mixing and
complexity conditions, we prove a functional empirical-process central limit theorem
for bounded evaluation function classes, extend it to increasing history length through
a CLT-stable truncation approximation, and establish moderate-deviation guarantees for
high-confidence reporting. Numerical experiments validate Gaussian and moderate-deviation
calibration under dependent sampling and illustrate the effect of growing context.
Methods and contribution
The project treats paraphrase robustness as a distributional property rather than a single
prompt-level score. For each dialogue history, a set of semantically equivalent prompts
induces output laws over task-relevant observables, and their discrepancy diameter becomes
the robustness target.
The statistical contribution combines epistemic model uncertainty with dependence-aware
inference for evaluation logs. The framework develops empirical-process central limit
theory, an increasing-history approximation for longer contexts, and moderate-deviation
tools for high-confidence robustness reporting.
Materials
Paper
Code not public