Center for Social Science Scholarship · UMBC

Evaluating LLMs for Credible and Rigorous Social Science Research

Dr. Michael Overton

Associate Professor of Public Administration
Associate Director, Institute for Interdisciplinary Data Science
University of Idaho

To download this presentation and follow along, go to michaeloverton.net or mro0001.github.io/evaluation-presentation.

What this talk covers

Six beats

Current state & issues

Theory: classical → metrology

Evaluation Protocol

Criteria overview

Criteria & metrics

Conclusion

Theory, then framework architecture, then criteria, then how to measure each, with examples woven in.

§1 · Where we are

Scientists are already using LLMs at scale, and the evaluation has not caught up

AI tool use

57% in 2024 → 84% in 2025

(Wiley 2025)

Impact is real

3.02× more papers

4.84× more citations

(Hao et al. 2026)

The gate is breaking

21% of ICLR 2026 peer reviews were fully AI-generated

AI reviewers accept fabricated papers >80% of the time

(Naddaf 2025) · (Jiang et al. 2025)

§1 · Illustration

An illustrative game of 20 Questions

New session in Claude Code (Opus 4.7, maximum effort). Prompt: "I want to play a game of 20 questions: ready?" The model said "I've picked something. Question 1?" — then I used /branch to fork the session into two parallel runs from the same starting point.

Branch A

Asked one question, then quit and asked the model what its word had been.

"lighthouse"

Branch B

Asked the model to write its word to a file before any questions were asked.

"kangaroo"

The model claimed it had picked a word. It produced two different words. Both plausible games of 20 Questions — only one was verifiable.

§1 · Where we are

What scientists are doing with LLMs

Task

Input

LLM

🧠

prompt · model · iteration

Output

The capability is real. The methods to evaluate it are not.

§1 · Where we are

The Current LLM Evaluation Landscape

Click any card to expand.

§2 · First principles

Neither “tool” nor “person” captures what an LLM does

What does scientific, rigorous use of an LLM require? Start from what the thing actually is.

Default 1 · LLM as computer tool

Functional correctness. Pass/fail tests. Throughput. Accuracy on a fixed benchmark. Treat the model like a deterministic function.

Misses: LLMs are non-deterministic by design. “Same input, same output” fails even at temperature zero across deployments.

Default 2 · LLM as person

Psychometric measurement. Inter-rater reliability with human coders. Validated coding schemes. Treat the LLM as another annotator.

Misses: exponentially large output space, semantic equivalence, two confidence channels that can disagree, RLHF-induced overconfidence. None of which a human coder has.

Both defaults import assumptions the model violates.

§2 · The metrology pivot

From classical measurement to metrology

Classical measurement assumes a true score. Metrology asks how wide the plausible range is when no true score exists.

Classical frame

observed = true score + error

Is this measurement right or wrong relative to a known true value?

(Lord & Novick 1968, classical test theory)

Metrological frame

measurement result = best estimate ± uncertainty

How much confidence can we place in this measurement?

(JCGM 100:2008, GUM §6.2.1)

Treat the LLM as a measurement instrument.

Vocabulary shifts validity→accuracy reliability→precision error→uncertainty

From foundation to practice

What gets evaluated, and how

What gets evaluated

Protocol: the framework, the decisions a researcher makes, and the workflow for assessing them.

TaMPER · Three decisions, two actions · Two response formats · Sample-Benchmark-Population

How it gets evaluated

Criteria and Metrics: the specific things we measure, and the numbers we use to measure them.

Compliance · Accuracy · Precision · Quality

§3 · The protocol

The TaMPER framework

Five components for documenting and evaluating any LLM-assisted analysis (Overton et al. 2025).

T

Task

What you want the LLM to do.

M

Model

Which LLM, configured how.

P

Prompt

How you instruct it.

E

Evaluation

How you validate the outputs.

R

Reporting

How you document everything.

§3 · The protocol

Three decisions, two actions

Decision points · what the researcher chooses

T

Task

What is being measured?

M

Model

Which instrument?

P

Prompt

How is the instrument configured?

↓

Actions on those decisions

E

Evaluation

Validate the chosen task, model, and prompt against compliance, accuracy, and precision.

R

Reporting

Document each decision and its validation in a way a reviewer can audit.

§3 · The protocol

Two response formats, two evaluation paths

Prompt

Is this tweet toxic? Return a label and a brief explanation.

@hammyp703 @Reuters Fake news try again loser

Direct response

True

Evaluate with: compliance, accuracy, precision.

Explanation

"The comment directly insults the addressed users, which is rude and disrespectful..."

Evaluate with: LLM-as-judge, embeddings.

§3 · The protocol · Concept

Sample → Benchmark → Population

Sample

Small representative subset, hand-coded by experts.

→

Benchmark

Establish performance benchmarks for compliance, accuracy, precision.

→

Population

Apply to full dataset. Compare to benchmarks. Identify where the population behaves unlike the sample.

A structured answer to: how much do I need to hand-code?

§3 · The protocol · Practice

Prediction-Powered Inference: combining hand-coded and LLM-coded data

(Angelopoulos et al. 2023, Science)

Setup

Hand-code a calibration set; let the LLM code the full corpus.

Calibration set: 200 hand-coded · Unlabeled corpus: 2,010 LLM-coded

Estimator

ML estimate on everything plus a correction for ML error, estimated on the calibration set.

Method	Point estimate
PPI++	2.055
Naive (200 labels only)	2.160
True population mean	2.053

Guarantee

The resulting confidence interval is valid regardless of how good the LLM is.

Method	95% CI	Width
PPI++	[1.954, 2.156]	0.202
Naive (200 labels only)	[1.975, 2.345]	0.369

Worked example: SST-5 sentiment, mean human rating on a 0–4 scale. Total n = 2,210, predicted by qwen3.5-122b.

§4 · Criteria overview

The minimum floor: Compliance → Accuracy → Precision

Compliance

Accuracy

Precision

Quality presupposes the floor

Linguistic

Fluency

Coherence

Logic

Entailment

Plausibility

Information Fidelity

Factuality

Faithfulness

Relevance

Informativeness

Information depth

Restatement

§4 · Criteria overview

Two questions: signal in the data, health of the instrument

Signal in the data

Accuracy · Precision

Did the LLM read the data correctly, and how much of what we observe is real signal versus run-to-run noise?

Health of the instrument

Compliance · Quality

Is the measurement instrument actually working — producing parseable output and reasoning coherently about the inputs it sees?

§5 · Compliance

Compliance: does the output match the required structure?

What it measures

Schema match. Are the fields, types, and labels what was requested?

Metrics

% correct data type
% missing fields
Schema validation rate

HQ2 case · Compliance (stance detection)

Model	Established prompt	Revised prompt
Command A	99.98%	100.00%
GPT OSS	100.00%	100.00%
Phi 4	100.00%	100.00%
Qwen 3	100.00%	100.00%

Structured outputs delivered ~100% compliance across every cell.

§5 · Accuracy

Accuracy: how close to a reference value?

What it measures

Closeness of measurements to a reference, whether objective ground truth or expert human judgment. Captures systematic deviation.

Corresponds to validity in classical terms.

Per-output metrics

Exact match
F1 score
Classification precision (PPV), recall
Correlation with human coding

HQ2 case · Accuracy (stance detection)

Model	Established prompt	Revised prompt
Command A	55.19%	72.78%
GPT OSS	70.61%	70.04%
Phi 4	45.59%	47.58%
Qwen 3	68.58%	70.02%

Are these differences real, or noise? The next slide answers that.

§5 · Accuracy · Signal vs. noise

Statistical-difference tests separate signal from noise

Tests

McNemar: accuracy differences (paired)
Kolmogorov-Smirnov: distribution differences
Mann-Whitney U: median rank differences
Wasserstein: overall distribution distance
Welch's t: mean differences, unequal variance

HQ2 case · Accuracy with McNemar (stance detection)

Model	Established	Revised	McNemar χ²
Command A	55.19%	72.78%	272.0***
GPT OSS	70.61%	70.04%	153.0
Phi 4	45.59%	47.58%	379.0***
Qwen 3	68.58%	70.02%	280.0**

GPT OSS difference is noise (p > 0.05); the framework directs you to the next criterion. * p<.05 · ** p<.01 · *** p<.001

§5 · Precision

Precision: how consistent across repeated runs?

What it measures

Output consistency under repeated application to the same input. 5+ iterations.

Not the classification "precision" (PPV). Corresponds to reliability in classical terms.

Metrics

Agreement rate (categorical)
Standard deviation (numeric)
ICC (ordinal)
Krippendorff's α (mixed types)

HQ2 case · Precision (stance detection, 5 iterations)

Model	Krippendorff's α		Agreement rate
Model	Established	Revised	Established	Revised
Command A	0.71	0.95	86.97%	98.01%
GPT OSS	0.89	0.90	96.06%	96.44%
Phi 4	0.72	0.81	88.03%	91.89%
Qwen 3	0.74	0.83	89.77%	93.52%

Precision broke the tie for Command A: same accuracy region, dramatically tighter agreement.

§5 · When the output is text

Explanations and Semantic Similarity

“I love my wife.”	“My wife is great.”	High cosine
“I love my wife.”	“Ice cream is cold.”	Low cosine

Embeddings capture meaning, not shared words. That makes them useful when the output is free text rather than a label.

§5 · Embedding metrics · Try it

Semantic similarity, hands on

Click any two sentences. Toggle the matrix between cosine similarity and lexical overlap to see how the two methods see different structure in the same data.

§5 · Embedding metrics · Sentiment

Average cosine similarity of explanations, by prompt

Sentiment analysis explanations · faceted by model · KDE of average pairwise cosine similarity, established vs. revised prompt overlaid.

§5 · Embedding metrics · Try it

Match vs mismatch raw cosine, revised prompt

Pairwise cosine similarity across iterations (revised prompt). Match = iterations agreed on the answer; mismatch = they disagreed.

§5 · LLM-as-judge · Worked example

Quality criteria example: Harsh and lenient judges measure different things

9 quality questions on a 5-point Likert scale, organized into 4 buckets. Same judge (GPT OSS 120B), same data (400 sentiment outputs × 4 models). Only the prompt changes — three lines added asking the judge to be "extremely harsh, but fair."

Mean rating across 4 models · Not Harsh → Harsh

Linguistic		Logic		Information Fidelity			Informativeness
Fluency	Coherence	Entailment	Plausibility	Factuality	Faithfulness	Relevance	Information	Restatement
4.58	4.87	4.94	4.07	4.75	4.45	5.00	3.83	4.36
4.36	4.21	4.87	3.81	4.60	4.35	5.00	3.73	4.27
−0.22	−0.65	−0.07	−0.26	−0.16	−0.10	0.00	−0.10	−0.09

■ Not Harsh · ■ Harsh · ■ Δ

§5 · LLM-as-judge · Distributions

Harsh and not-harsh produce different score distributions

Mean judge score per output, n = 400 per model. Harsh and not-harsh prompts overlaid as KDE curves.

Same outputs, same judge model — only the prompt differs. The harsh prompt shifts the entire distribution down and widens it.

§5 · LLM-as-judge · Distributions

Under the harsh prompt, judge scores track accuracy

Harsh prompt only. Mean judge score split by whether the underlying sentiment classification was correctly classified.

Three of four models show separation between accurate and inaccurate outputs. Lenient scoring compresses this signal away.

§5 · LLM-as-judge · Worked example

Confirming the harsh effect with distribution tests

Meaning of tests

Test	Interpretation
t-test (Welch)	Difference in means
Mann–Whitney U	Difference in medians / ranks
Kolmogorov–Smirnov	Difference in distributions (CDF)
Wasserstein distance	Difference in distributions (distance metric)

Significance testing · harsh vs. not-harsh prompts

Model	Kolmogorov–Smirnov	Mann–Whitney U	Wasserstein	t-test (Welch)
Command A	0.343	41,756	0.147	−12.633
Gemma 3	0.470	35,362	0.173	−15.178
GPT OSS	0.525	24,569	0.248	−20.128
Phi 4	0.375	40,518	0.161	−12.779

Bold = p < 0.1, statistically significant. Every test rejects the null in every model.

§5 · Recap

From gut feel to defensible selection

1. Choose with statistical evidence

Not vibes.

2. Report selection rationale

In a way a reviewer can audit.

3. Detect when differences are noise

Pivot to secondary criteria.

The framework does not choose for you. It lets you defend the choice.

§6 · Takeaway

Three things to take into your next paper

1Treat LLMs as measurement instruments. Report the instrument's specs. Interpret outputs as carrying uncertainty, not ground truth.

2Evaluate compliance, then accuracy, then precision, in that order. Use statistical tests to separate signal from noise. Run repeated iterations to make precision visible.

3You do not have to choose between scale and rigor. Hand-code a calibration set. Run the LLM on the rest. Emerging tools like Prediction-Powered Inference can combine them with valid confidence intervals, even when the LLM is imperfect.

The infrastructure for rigorous, scalable LLM-assisted social science is being built right now.

§6 · Thank you

Thank you.

Questions, pushback, collaborations — all welcome.

Appendix · A1

Established prompt · Stance detection

I want you to perform a data annotation task. Your task is to carefully read the text and identify the stance of the tweet regarding Amazon HQ2, HQ3, or HQ4 that is conveyed. Your response must belong to one of the four classes, depending on whether the tweet is Supportive, Neutral, Unsupportive, or Not about Amazon.

In your output, only respond with the name of the class: Supportive, Neutral, Unsupportive, or Not about Amazon, depending on the stance of the tweet toward Amazon HQ2, HQ3, or HQ4 that is conveyed in the tweet. In your output, I also want you to provide an explanation for the output.

Provide your response in the first line and provide the explanation for your response in the second line.

Text: <{tweet}>

Appendix · A2

Revised prompt · Stance detection

Task Overview: You are going to conduct a Stance analysis on a given tweet about Amazon HQ2, HQ3, or HQ4 and provide a brief explanation for your decision.

Definition: Stance analysis: you will identify the attitudes, feelings, judgments, or commitment that dictate the position of the Tweet author, specifically whether a tweet is Supportive, Neutral, Unsupportive, or Not about Amazon HQ2, HQ3, or HQ4.

Task: Carefully read the text below, identify the stance of the tweet conveyed regarding whether it is Supportive, Neutral, Unsupportive, or Not about Amazon HQ2, HQ3, or HQ4. Provide a brief explanation for your decision. Response classes: 1) Supportive, 2) Neutral, 3) Unsupportive, 4) Not about Amazon.

Tweet: {tweet}

Instructions: 1) Read all sections before responding. 2) Return valid JSON: {Stance: <...>, Explanation: <...>}. 3) Determine if the tweet is about Amazon HQ. 3a) If not, respond "Not about Amazon". 3b) If yes, determine Supportive, Neutral, or Unsupportive. 4) Verify your decision against the tweet. 5) Provide the explanation. 6) Output the final JSON.

Appendix · A3

Established prompt · Multi-category classification

I want you to perform a data annotation task. Your task is to carefully read the text and identify the category that best matches the Twitter Bio Description. Your response must belong to one of the four categories: Business Stakeholder, Politician or Government Account, Media, or General Public.

In your output, only respond with the name of the polarity: Business Stakeholder, Politician or Government Account, Media, or General Public, depending on the information in the Twitter Bio provided. In your output, I also want you to provide an explanation for the output.

Provide your response in the first line and provide the explanation for your response in the second line.

Twitter Bio: <Twitter Bio>

Appendix · A4

Revised prompt · Multi-category classification

Task Overview: Classify the author of a Twitter bio into one of 4 categories: General Public, Business Stakeholder, Media, Politicians and Government Accounts. Provide a brief explanation.

Definitions: Business Stakeholder — private-sector actors that gain from the deal (corporate reps, developers, lobbyists). Politician or Government Account — elected officials and civil servants with formal authority. Media — news outlets and reporters who frame the HQ2 competition. General Public — remaining individual users (residents, voters, organizers).

Task: Read the Twitter bio, identify the author's category, and provide a brief explanation. Categories: 1) Business Stakeholder, 2) Politician or Government Account, 3) Media, 4) General Public.

Twitter Bio: {Twitter Bio}

Instructions: 1) Read all sections before responding. 2) Return valid JSON: {Category: <...>, Explanation: <...>}. 3) Determine the best category for the bio. 4) Verify your choice against the definitions. 5) Justify why this category over the others. 6) Output the final JSON.

Evaluating LLMs for Credible and Rigorous Social Science Research

Six beats

Current state & issues

Theory: classical → metrology

Evaluation Protocol

Criteria overview

Criteria & metrics

Conclusion

Scientists are already using LLMs at scale, and the evaluation has not caught up

Impact is real

The gate is breaking

An illustrative game of 20 Questions

Branch A

Branch B

What scientists are doing with LLMs

Classification at scale

Extraction from records

Qualitative coding

Synthetic data

The Current LLM Evaluation Landscape

Natural Language Generation Metrics

Leaderboards

Psychometric Methods

Human Alignment

Accuracy

Neither “tool” nor “person” captures what an LLM does

Default 1 · LLM as computer tool

Default 2 · LLM as person

From classical measurement to metrology

Classical frame

Metrological frame

What gets evaluated, and how

What gets evaluated

How it gets evaluated

The TaMPER framework

T

M

P

E

R

Three decisions, two actions

T

M

P

E

R

Two response formats, two evaluation paths

Direct response

Explanation

Sample → Benchmark → Population

Sample

Benchmark

Population

Prediction-Powered Inference: combining hand-coded and LLM-coded data

The minimum floor: Compliance → Accuracy → Precision

Two questions: signal in the data, health of the instrument

Signal in the data

Health of the instrument

Compliance: does the output match the required structure?

What it measures

Metrics

Accuracy: how close to a reference value?

What it measures

Per-output metrics

Statistical-difference tests separate signal from noise

Tests

Precision: how consistent across repeated runs?

What it measures

Metrics

Explanations and Semantic Similarity

Semantic similarity, hands on

Average cosine similarity of explanations, by prompt

Match vs mismatch raw cosine, revised prompt

Quality criteria example: Harsh and lenient judges measure different things

Harsh and not-harsh produce different score distributions

Under the harsh prompt, judge scores track accuracy

Confirming the harsh effect with distribution tests

From gut feel to defensible selection

1. Choose with statistical evidence

2. Report selection rationale