1
Center for Social Science Scholarship · UMBC
Dr. Michael Overton
Associate Professor of Public Administration
Associate Director, Institute for Interdisciplinary Data Science
University of Idaho
To download this presentation and follow along, go to michaeloverton.net or mro0001.github.io/evaluation-presentation.
What this talk covers
1
2
3
4
5
6
Theory, then framework architecture, then criteria, then how to measure each, with examples woven in.
§1 · Where we are
3.02× more papers
4.84× more citations
21% of ICLR 2026 peer reviews were fully AI-generated
AI reviewers accept fabricated papers >80% of the time
§1 · Illustration
New session in Claude Code (Opus 4.7, maximum effort). Prompt: "I want to play a game of 20 questions: ready?" The model said "I've picked something. Question 1?" — then I used /branch to fork the session into two parallel runs from the same starting point.
Asked one question, then quit and asked the model what its word had been.
"lighthouse"
Asked the model to write its word to a file before any questions were asked.
"kangaroo"
The model claimed it had picked a word. It produced two different words. Both plausible games of 20 Questions — only one was verifiable.
§1 · Where we are
Input
LLM
🧠
prompt · model · iteration
Output
The capability is real. The methods to evaluate it are not.
§1 · Where we are
Click any card to expand.
§2 · First principles
What does scientific, rigorous use of an LLM require? Start from what the thing actually is.
Functional correctness. Pass/fail tests. Throughput. Accuracy on a fixed benchmark. Treat the model like a deterministic function.
Misses: LLMs are non-deterministic by design. “Same input, same output” fails even at temperature zero across deployments.
Psychometric measurement. Inter-rater reliability with human coders. Validated coding schemes. Treat the LLM as another annotator.
Misses: exponentially large output space, semantic equivalence, two confidence channels that can disagree, RLHF-induced overconfidence. None of which a human coder has.
Both defaults import assumptions the model violates.
§2 · The metrology pivot
Classical measurement assumes a true score. Metrology asks how wide the plausible range is when no true score exists.
observed = true score + error
Is this measurement right or wrong relative to a known true value?
measurement result = best estimate ± uncertainty
How much confidence can we place in this measurement?
Treat the LLM as a measurement instrument.
From foundation to practice
Protocol: the framework, the decisions a researcher makes, and the workflow for assessing them.
TaMPER · Three decisions, two actions · Two response formats · Sample-Benchmark-Population
Criteria and Metrics: the specific things we measure, and the numbers we use to measure them.
Compliance · Accuracy · Precision · Quality
§3 · The protocol
Five components for documenting and evaluating any LLM-assisted analysis (Overton et al. 2025).
Task
What you want the LLM to do.
Model
Which LLM, configured how.
Prompt
How you instruct it.
Evaluation
How you validate the outputs.
Reporting
How you document everything.
§3 · The protocol
Decision points · what the researcher chooses
Task
What is being measured?
Model
Which instrument?
Prompt
How is the instrument configured?
Actions on those decisions
Evaluation
Validate the chosen task, model, and prompt against compliance, accuracy, and precision.
Reporting
Document each decision and its validation in a way a reviewer can audit.
§3 · The protocol
Prompt
Is this tweet toxic? Return a label and a brief explanation.
Tweet
@hammyp703 @Reuters Fake news try again loser
True
Evaluate with: compliance, accuracy, precision.
"The comment directly insults the addressed users, which is rude and disrespectful..."
Evaluate with: LLM-as-judge, embeddings.
§3 · The protocol · Concept
Small representative subset, hand-coded by experts.
Establish performance benchmarks for compliance, accuracy, precision.
Apply to full dataset. Compare to benchmarks. Identify where the population behaves unlike the sample.
A structured answer to: how much do I need to hand-code?
§3 · The protocol · Practice
(Angelopoulos et al. 2023, Science)
| Method | Point estimate |
|---|---|
| PPI++ | 2.055 |
| Naive (200 labels only) | 2.160 |
| True population mean | 2.053 |
| Method | 95% CI | Width |
|---|---|---|
| PPI++ | [1.954, 2.156] | 0.202 |
| Naive (200 labels only) | [1.975, 2.345] | 0.369 |
Worked example: SST-5 sentiment, mean human rating on a 0–4 scale. Total n = 2,210, predicted by qwen3.5-122b.
§4 · Criteria overview
§4 · Criteria overview
Accuracy · Precision
Did the LLM read the data correctly, and how much of what we observe is real signal versus run-to-run noise?
Compliance · Quality
Is the measurement instrument actually working — producing parseable output and reasoning coherently about the inputs it sees?
§5 · Compliance
Schema match. Are the fields, types, and labels what was requested?
HQ2 case · Compliance (stance detection)
| Model | Established prompt | Revised prompt |
|---|---|---|
| Command A | 99.98% | 100.00% |
| GPT OSS | 100.00% | 100.00% |
| Phi 4 | 100.00% | 100.00% |
| Qwen 3 | 100.00% | 100.00% |
Structured outputs delivered ~100% compliance across every cell.
§5 · Accuracy
Closeness of measurements to a reference, whether objective ground truth or expert human judgment. Captures systematic deviation.
Corresponds to validity in classical terms.
HQ2 case · Accuracy (stance detection)
| Model | Established prompt | Revised prompt |
|---|---|---|
| Command A | 55.19% | 72.78% |
| GPT OSS | 70.61% | 70.04% |
| Phi 4 | 45.59% | 47.58% |
| Qwen 3 | 68.58% | 70.02% |
Are these differences real, or noise? The next slide answers that.
§5 · Accuracy · Signal vs. noise
HQ2 case · Accuracy with McNemar (stance detection)
| Model | Established | Revised | McNemar χ² |
|---|---|---|---|
| Command A | 55.19% | 72.78% | 272.0*** |
| GPT OSS | 70.61% | 70.04% | 153.0 |
| Phi 4 | 45.59% | 47.58% | 379.0*** |
| Qwen 3 | 68.58% | 70.02% | 280.0** |
GPT OSS difference is noise (p > 0.05); the framework directs you to the next criterion. * p<.05 · ** p<.01 · *** p<.001
§5 · Precision
Output consistency under repeated application to the same input. 5+ iterations.
Not the classification "precision" (PPV). Corresponds to reliability in classical terms.
HQ2 case · Precision (stance detection, 5 iterations)
| Model | Krippendorff's α | Agreement rate | ||
|---|---|---|---|---|
| Established | Revised | Established | Revised | |
| Command A | 0.71 | 0.95 | 86.97% | 98.01% |
| GPT OSS | 0.89 | 0.90 | 96.06% | 96.44% |
| Phi 4 | 0.72 | 0.81 | 88.03% | 91.89% |
| Qwen 3 | 0.74 | 0.83 | 89.77% | 93.52% |
Precision broke the tie for Command A: same accuracy region, dramatically tighter agreement.
§5 · When the output is text
| “I love my wife.” | “My wife is great.” | High cosine |
| “I love my wife.” | “Ice cream is cold.” | Low cosine |
Embeddings capture meaning, not shared words. That makes them useful when the output is free text rather than a label.
§5 · Embedding metrics · Try it
Click any two sentences. Toggle the matrix between cosine similarity and lexical overlap to see how the two methods see different structure in the same data.
All pairs. Click any cell to compare.
§5 · Embedding metrics · Sentiment
Sentiment analysis explanations · faceted by model · KDE of average pairwise cosine similarity, established vs. revised prompt overlaid.
§5 · Embedding metrics · Try it
Pairwise cosine similarity across iterations (revised prompt). Match = iterations agreed on the answer; mismatch = they disagreed.
§5 · LLM-as-judge · Worked example
9 quality questions on a 5-point Likert scale, organized into 4 buckets. Same judge (GPT OSS 120B), same data (400 sentiment outputs × 4 models). Only the prompt changes — three lines added asking the judge to be "extremely harsh, but fair."
Mean rating across 4 models · Not Harsh → Harsh
| Linguistic | Logic | Information Fidelity | Informativeness | |||||
|---|---|---|---|---|---|---|---|---|
| Fluency | Coherence | Entailment | Plausibility | Factuality | Faithfulness | Relevance | Information | Restatement |
| 4.58 | 4.87 | 4.94 | 4.07 | 4.75 | 4.45 | 5.00 | 3.83 | 4.36 |
| 4.36 | 4.21 | 4.87 | 3.81 | 4.60 | 4.35 | 5.00 | 3.73 | 4.27 |
| −0.22 | −0.65 | −0.07 | −0.26 | −0.16 | −0.10 | 0.00 | −0.10 | −0.09 |
■ Not Harsh · ■ Harsh · ■ Δ
§5 · LLM-as-judge · Distributions
Mean judge score per output, n = 400 per model. Harsh and not-harsh prompts overlaid as KDE curves.
Same outputs, same judge model — only the prompt differs. The harsh prompt shifts the entire distribution down and widens it.
§5 · LLM-as-judge · Distributions
Harsh prompt only. Mean judge score split by whether the underlying sentiment classification was correctly classified.
Three of four models show separation between accurate and inaccurate outputs. Lenient scoring compresses this signal away.
§5 · LLM-as-judge · Worked example
Meaning of tests
| Test | Interpretation |
|---|---|
| t-test (Welch) | Difference in means |
| Mann–Whitney U | Difference in medians / ranks |
| Kolmogorov–Smirnov | Difference in distributions (CDF) |
| Wasserstein distance | Difference in distributions (distance metric) |
Significance testing · harsh vs. not-harsh prompts
| Model | Kolmogorov–Smirnov | Mann–Whitney U | Wasserstein | t-test (Welch) |
|---|---|---|---|---|
| Command A | 0.343 | 41,756 | 0.147 | −12.633 |
| Gemma 3 | 0.470 | 35,362 | 0.173 | −15.178 |
| GPT OSS | 0.525 | 24,569 | 0.248 | −20.128 |
| Phi 4 | 0.375 | 40,518 | 0.161 | −12.779 |
Bold = p < 0.1, statistically significant. Every test rejects the null in every model.
§5 · Recap
Not vibes.
In a way a reviewer can audit.
Pivot to secondary criteria.
The framework does not choose for you. It lets you defend the choice.
§6 · Takeaway
1Treat LLMs as measurement instruments. Report the instrument's specs. Interpret outputs as carrying uncertainty, not ground truth.
2Evaluate compliance, then accuracy, then precision, in that order. Use statistical tests to separate signal from noise. Run repeated iterations to make precision visible.
3You do not have to choose between scale and rigor. Hand-code a calibration set. Run the LLM on the rest. Emerging tools like Prediction-Powered Inference can combine them with valid confidence intervals, even when the LLM is imperfect.
The infrastructure for rigorous, scalable LLM-assisted social science is being built right now.
§6 · Thank you
Questions, pushback, collaborations — all welcome.
Appendix · A1
I want you to perform a data annotation task. Your task is to carefully read the text and identify the stance of the tweet regarding Amazon HQ2, HQ3, or HQ4 that is conveyed. Your response must belong to one of the four classes, depending on whether the tweet is Supportive, Neutral, Unsupportive, or Not about Amazon. In your output, only respond with the name of the class: Supportive, Neutral, Unsupportive, or Not about Amazon, depending on the stance of the tweet toward Amazon HQ2, HQ3, or HQ4 that is conveyed in the tweet. In your output, I also want you to provide an explanation for the output. Provide your response in the first line and provide the explanation for your response in the second line. Text: <{tweet}>
Appendix · A2
Task Overview: You are going to conduct a Stance analysis on a given tweet about Amazon HQ2, HQ3, or HQ4 and provide a brief explanation for your decision.
Definition: Stance analysis: you will identify the attitudes, feelings, judgments, or commitment that dictate the position of the Tweet author, specifically whether a tweet is Supportive, Neutral, Unsupportive, or Not about Amazon HQ2, HQ3, or HQ4.
Task: Carefully read the text below, identify the stance of the tweet conveyed regarding whether it is Supportive, Neutral, Unsupportive, or Not about Amazon HQ2, HQ3, or HQ4. Provide a brief explanation for your decision. Response classes: 1) Supportive, 2) Neutral, 3) Unsupportive, 4) Not about Amazon.
Tweet: {tweet}
Instructions: 1) Read all sections before responding. 2) Return valid JSON: {Stance: <...>, Explanation: <...>}. 3) Determine if the tweet is about Amazon HQ. 3a) If not, respond "Not about Amazon". 3b) If yes, determine Supportive, Neutral, or Unsupportive. 4) Verify your decision against the tweet. 5) Provide the explanation. 6) Output the final JSON.
Appendix · A3
I want you to perform a data annotation task. Your task is to carefully read the text and identify the category that best matches the Twitter Bio Description. Your response must belong to one of the four categories: Business Stakeholder, Politician or Government Account, Media, or General Public. In your output, only respond with the name of the polarity: Business Stakeholder, Politician or Government Account, Media, or General Public, depending on the information in the Twitter Bio provided. In your output, I also want you to provide an explanation for the output. Provide your response in the first line and provide the explanation for your response in the second line. Twitter Bio: <Twitter Bio>
Appendix · A4
Task Overview: Classify the author of a Twitter bio into one of 4 categories: General Public, Business Stakeholder, Media, Politicians and Government Accounts. Provide a brief explanation.
Definitions: Business Stakeholder — private-sector actors that gain from the deal (corporate reps, developers, lobbyists). Politician or Government Account — elected officials and civil servants with formal authority. Media — news outlets and reporters who frame the HQ2 competition. General Public — remaining individual users (residents, voters, organizers).
Task: Read the Twitter bio, identify the author's category, and provide a brief explanation. Categories: 1) Business Stakeholder, 2) Politician or Government Account, 3) Media, 4) General Public.
Twitter Bio: {Twitter Bio}
Instructions: 1) Read all sections before responding. 2) Return valid JSON: {Category: <...>, Explanation: <...>}. 3) Determine the best category for the bio. 4) Verify your choice against the definitions. 5) Justify why this category over the others. 6) Output the final JSON.