Center for Social Science Scholarship · UMBC

Evaluating LLMs for Credible and Rigorous Social Science Research

Dr. Michael Overton

Associate Professor of Public Administration
Associate Director, Institute for Interdisciplinary Data Science
University of Idaho

Dr. Michael Overton

To download this presentation and follow along, go to michaeloverton.net or mro0001.github.io/evaluation-presentation.

What this talk covers

Six beats

1

Current state & issues

2

Theory: classical → metrology

3

Evaluation Protocol

4

Criteria overview

5

Criteria & metrics

6

Conclusion

Theory, then framework architecture, then criteria, then how to measure each, with examples woven in.

§1 · Where we are

Scientists are already using LLMs at scale, and the evaluation has not caught up

AI tool use

57% in 2024 84% in 2025

(Wiley 2025)

The gate is breaking

21% of ICLR 2026 peer reviews were fully AI-generated

AI reviewers accept fabricated papers >80% of the time

(Naddaf 2025) · (Jiang et al. 2025)

§1 · Illustration

An illustrative game of 20 Questions

New session in Claude Code (Opus 4.7, maximum effort). Prompt: "I want to play a game of 20 questions: ready?" The model said "I've picked something. Question 1?" — then I used /branch to fork the session into two parallel runs from the same starting point.

Branch A

Asked one question, then quit and asked the model what its word had been.

"lighthouse"

Branch B

Asked the model to write its word to a file before any questions were asked.

"kangaroo"

The model claimed it had picked a word. It produced two different words. Both plausible games of 20 Questions — only one was verifiable.

§1 · Where we are

What scientists are doing with LLMs

Task

Input

LLM

🧠

prompt · model · iteration

Output

The capability is real. The methods to evaluate it are not.

§1 · Where we are

The Current LLM Evaluation Landscape

Click any card to expand.

§2 · First principles

Neither “tool” nor “person” captures what an LLM does

What does scientific, rigorous use of an LLM require? Start from what the thing actually is.

Default 1 · LLM as computer tool

Functional correctness. Pass/fail tests. Throughput. Accuracy on a fixed benchmark. Treat the model like a deterministic function.

Misses: LLMs are non-deterministic by design. “Same input, same output” fails even at temperature zero across deployments.

Default 2 · LLM as person

Psychometric measurement. Inter-rater reliability with human coders. Validated coding schemes. Treat the LLM as another annotator.

Misses: exponentially large output space, semantic equivalence, two confidence channels that can disagree, RLHF-induced overconfidence. None of which a human coder has.

Both defaults import assumptions the model violates.

§2 · The metrology pivot

From classical measurement to metrology

Classical measurement assumes a true score. Metrology asks how wide the plausible range is when no true score exists.

Metrological frame

measurement result = best estimate ± uncertainty

How much confidence can we place in this measurement?

(JCGM 100:2008, GUM §6.2.1)

Treat the LLM as a measurement instrument.

Vocabulary shifts validityaccuracy reliabilityprecision erroruncertainty

From foundation to practice

What gets evaluated, and how

What gets evaluated

Protocol: the framework, the decisions a researcher makes, and the workflow for assessing them.

TaMPER · Three decisions, two actions · Two response formats · Sample-Benchmark-Population

How it gets evaluated

Criteria and Metrics: the specific things we measure, and the numbers we use to measure them.

Compliance · Accuracy · Precision · Quality

§3 · The protocol

The TaMPER framework

Five components for documenting and evaluating any LLM-assisted analysis (Overton et al. 2025).

T

Task

What you want the LLM to do.

M

Model

Which LLM, configured how.

P

Prompt

How you instruct it.

E

Evaluation

How you validate the outputs.

R

Reporting

How you document everything.

§3 · The protocol

Three decisions, two actions

Decision points · what the researcher chooses

T

Task

What is being measured?

M

Model

Which instrument?

P

Prompt

How is the instrument configured?

Actions on those decisions

E

Evaluation

Validate the chosen task, model, and prompt against compliance, accuracy, and precision.

R

Reporting

Document each decision and its validation in a way a reviewer can audit.

§3 · The protocol

Two response formats, two evaluation paths

Prompt

Is this tweet toxic? Return a label and a brief explanation.

Tweet

@hammyp703 @Reuters Fake news try again loser

Direct response

True

Evaluate with: compliance, accuracy, precision.

Explanation

"The comment directly insults the addressed users, which is rude and disrespectful..."

Evaluate with: LLM-as-judge, embeddings.

§3 · The protocol · Concept

Sample → Benchmark → Population

Sample

Small representative subset, hand-coded by experts.

Benchmark

Establish performance benchmarks for compliance, accuracy, precision.

Population

Apply to full dataset. Compare to benchmarks. Identify where the population behaves unlike the sample.

A structured answer to: how much do I need to hand-code?

§3 · The protocol · Practice

Prediction-Powered Inference: combining hand-coded and LLM-coded data

(Angelopoulos et al. 2023, Science)

Setup
Hand-code a calibration set; let the LLM code the full corpus.
Calibration set: 200 hand-coded · Unlabeled corpus: 2,010 LLM-coded
Estimator
ML estimate on everything plus a correction for ML error, estimated on the calibration set.
Method Point estimate
PPI++2.055
Naive (200 labels only)2.160
True population mean2.053
Guarantee
The resulting confidence interval is valid regardless of how good the LLM is.
Method 95% CI Width
PPI++[1.954, 2.156]0.202
Naive (200 labels only)[1.975, 2.345]0.369

Worked example: SST-5 sentiment, mean human rating on a 0–4 scale. Total n = 2,210, predicted by qwen3.5-122b.

§4 · Criteria overview

The minimum floor: Compliance → Accuracy → Precision

Compliance
Accuracy
Precision
Quality presupposes the floor
Linguistic
Fluency
Coherence
Logic
Entailment
Plausibility
Information Fidelity
Factuality
Faithfulness
Relevance
Informativeness
Information depth
Restatement

§4 · Criteria overview

Two questions: signal in the data, health of the instrument

Signal in the data

Accuracy · Precision

Did the LLM read the data correctly, and how much of what we observe is real signal versus run-to-run noise?

Health of the instrument

Compliance · Quality

Is the measurement instrument actually working — producing parseable output and reasoning coherently about the inputs it sees?

§5 · Compliance

Compliance: does the output match the required structure?

What it measures

Schema match. Are the fields, types, and labels what was requested?

Metrics

  • % correct data type
  • % missing fields
  • Schema validation rate

HQ2 case · Compliance (stance detection)

Model Established prompt Revised prompt
Command A99.98%100.00%
GPT OSS100.00%100.00%
Phi 4100.00%100.00%
Qwen 3100.00%100.00%

Structured outputs delivered ~100% compliance across every cell.

§5 · Accuracy

Accuracy: how close to a reference value?

What it measures

Closeness of measurements to a reference, whether objective ground truth or expert human judgment. Captures systematic deviation.

Corresponds to validity in classical terms.

Per-output metrics

  • Exact match
  • F1 score
  • Classification precision (PPV), recall
  • Correlation with human coding

HQ2 case · Accuracy (stance detection)

Model Established prompt Revised prompt
Command A55.19%72.78%
GPT OSS70.61%70.04%
Phi 445.59%47.58%
Qwen 368.58%70.02%

Are these differences real, or noise? The next slide answers that.

§5 · Accuracy · Signal vs. noise

Statistical-difference tests separate signal from noise

Tests

  • McNemar: accuracy differences (paired)
  • Kolmogorov-Smirnov: distribution differences
  • Mann-Whitney U: median rank differences
  • Wasserstein: overall distribution distance
  • Welch's t: mean differences, unequal variance

HQ2 case · Accuracy with McNemar (stance detection)

Model Established Revised McNemar χ²
Command A55.19%72.78%272.0***
GPT OSS70.61%70.04%153.0
Phi 445.59%47.58%379.0***
Qwen 368.58%70.02%280.0**

GPT OSS difference is noise (p > 0.05); the framework directs you to the next criterion. * p<.05 · ** p<.01 · *** p<.001

§5 · Precision

Precision: how consistent across repeated runs?

What it measures

Output consistency under repeated application to the same input. 5+ iterations.

Not the classification "precision" (PPV). Corresponds to reliability in classical terms.

Metrics

  • Agreement rate (categorical)
  • Standard deviation (numeric)
  • ICC (ordinal)
  • Krippendorff's α (mixed types)

HQ2 case · Precision (stance detection, 5 iterations)

Model Krippendorff's α Agreement rate
Established Revised Established Revised
Command A0.710.9586.97%98.01%
GPT OSS0.890.9096.06%96.44%
Phi 40.720.8188.03%91.89%
Qwen 30.740.8389.77%93.52%

Precision broke the tie for Command A: same accuracy region, dramatically tighter agreement.

§5 · When the output is text

Explanations and Semantic Similarity

“I love my wife.” “My wife is great.” High cosine
“I love my wife.” “Ice cream is cold.” Low cosine

Embeddings capture meaning, not shared words. That makes them useful when the output is free text rather than a label.

§5 · Embedding metrics · Try it

Semantic similarity, hands on

Click any two sentences. Toggle the matrix between cosine similarity and lexical overlap to see how the two methods see different structure in the same data.

All pairs. Click any cell to compare.

Select two sentences to see a comparison.

§5 · Embedding metrics · Sentiment

Average cosine similarity of explanations, by prompt

Sentiment analysis explanations · faceted by model · KDE of average pairwise cosine similarity, established vs. revised prompt overlaid.

§5 · Embedding metrics · Try it

Match vs mismatch raw cosine, revised prompt

Pairwise cosine similarity across iterations (revised prompt). Match = iterations agreed on the answer; mismatch = they disagreed.

§5 · LLM-as-judge · Worked example

Quality criteria example: Harsh and lenient judges measure different things

9 quality questions on a 5-point Likert scale, organized into 4 buckets. Same judge (GPT OSS 120B), same data (400 sentiment outputs × 4 models). Only the prompt changes — three lines added asking the judge to be "extremely harsh, but fair."

Mean rating across 4 models · Not Harsh → Harsh

Linguistic Logic Information Fidelity Informativeness
Fluency Coherence Entailment Plausibility Factuality Faithfulness Relevance Information Restatement
4.584.87 4.944.07 4.754.455.00 3.834.36
4.364.21 4.873.81 4.604.355.00 3.734.27
−0.22−0.65 −0.07−0.26 −0.16−0.100.00 −0.10−0.09

Not Harsh  ·  Harsh  ·  Δ

§5 · LLM-as-judge · Distributions

Harsh and not-harsh produce different score distributions

Mean judge score per output, n = 400 per model. Harsh and not-harsh prompts overlaid as KDE curves.

Same outputs, same judge model — only the prompt differs. The harsh prompt shifts the entire distribution down and widens it.

§5 · LLM-as-judge · Distributions

Under the harsh prompt, judge scores track accuracy

Harsh prompt only. Mean judge score split by whether the underlying sentiment classification was correctly classified.

Three of four models show separation between accurate and inaccurate outputs. Lenient scoring compresses this signal away.

§5 · LLM-as-judge · Worked example

Confirming the harsh effect with distribution tests

Meaning of tests

TestInterpretation
t-test (Welch)Difference in means
Mann–Whitney UDifference in medians / ranks
Kolmogorov–SmirnovDifference in distributions (CDF)
Wasserstein distanceDifference in distributions (distance metric)

Significance testing · harsh vs. not-harsh prompts

Model Kolmogorov–Smirnov Mann–Whitney U Wasserstein t-test (Welch)
Command A0.34341,7560.147−12.633
Gemma 30.47035,3620.173−15.178
GPT OSS0.52524,5690.248−20.128
Phi 40.37540,5180.161−12.779

Bold = p < 0.1, statistically significant. Every test rejects the null in every model.

§5 · Recap

From gut feel to defensible selection

1. Choose with statistical evidence

Not vibes.

2. Report selection rationale

In a way a reviewer can audit.

3. Detect when differences are noise

Pivot to secondary criteria.

The framework does not choose for you. It lets you defend the choice.

§6 · Takeaway

Three things to take into your next paper

1Treat LLMs as measurement instruments. Report the instrument's specs. Interpret outputs as carrying uncertainty, not ground truth.

2Evaluate compliance, then accuracy, then precision, in that order. Use statistical tests to separate signal from noise. Run repeated iterations to make precision visible.

3You do not have to choose between scale and rigor. Hand-code a calibration set. Run the LLM on the rest. Emerging tools like Prediction-Powered Inference can combine them with valid confidence intervals, even when the LLM is imperfect.

The infrastructure for rigorous, scalable LLM-assisted social science is being built right now.

§6 · Thank you

Thank you.

Questions, pushback, collaborations — all welcome.

Appendix · A1

Established prompt · Stance detection

I want you to perform a data annotation task. Your task is to carefully read the text and identify the stance of the tweet regarding Amazon HQ2, HQ3, or HQ4 that is conveyed. Your response must belong to one of the four classes, depending on whether the tweet is Supportive, Neutral, Unsupportive, or Not about Amazon.

In your output, only respond with the name of the class: Supportive, Neutral, Unsupportive, or Not about Amazon, depending on the stance of the tweet toward Amazon HQ2, HQ3, or HQ4 that is conveyed in the tweet. In your output, I also want you to provide an explanation for the output.

Provide your response in the first line and provide the explanation for your response in the second line.

Text: <{tweet}>

Appendix · A2

Revised prompt · Stance detection

Task Overview: You are going to conduct a Stance analysis on a given tweet about Amazon HQ2, HQ3, or HQ4 and provide a brief explanation for your decision.

Definition: Stance analysis: you will identify the attitudes, feelings, judgments, or commitment that dictate the position of the Tweet author, specifically whether a tweet is Supportive, Neutral, Unsupportive, or Not about Amazon HQ2, HQ3, or HQ4.

Task: Carefully read the text below, identify the stance of the tweet conveyed regarding whether it is Supportive, Neutral, Unsupportive, or Not about Amazon HQ2, HQ3, or HQ4. Provide a brief explanation for your decision. Response classes: 1) Supportive, 2) Neutral, 3) Unsupportive, 4) Not about Amazon.

Tweet: {tweet}

Instructions: 1) Read all sections before responding. 2) Return valid JSON: {Stance: <...>, Explanation: <...>}. 3) Determine if the tweet is about Amazon HQ. 3a) If not, respond "Not about Amazon". 3b) If yes, determine Supportive, Neutral, or Unsupportive. 4) Verify your decision against the tweet. 5) Provide the explanation. 6) Output the final JSON.

Appendix · A3

Established prompt · Multi-category classification

I want you to perform a data annotation task. Your task is to carefully read the text and identify the category that best matches the Twitter Bio Description. Your response must belong to one of the four categories: Business Stakeholder, Politician or Government Account, Media, or General Public.

In your output, only respond with the name of the polarity: Business Stakeholder, Politician or Government Account, Media, or General Public, depending on the information in the Twitter Bio provided. In your output, I also want you to provide an explanation for the output.

Provide your response in the first line and provide the explanation for your response in the second line.

Twitter Bio: <Twitter Bio>

Appendix · A4

Revised prompt · Multi-category classification

Task Overview: Classify the author of a Twitter bio into one of 4 categories: General Public, Business Stakeholder, Media, Politicians and Government Accounts. Provide a brief explanation.

Definitions: Business Stakeholder — private-sector actors that gain from the deal (corporate reps, developers, lobbyists). Politician or Government Account — elected officials and civil servants with formal authority. Media — news outlets and reporters who frame the HQ2 competition. General Public — remaining individual users (residents, voters, organizers).

Task: Read the Twitter bio, identify the author's category, and provide a brief explanation. Categories: 1) Business Stakeholder, 2) Politician or Government Account, 3) Media, 4) General Public.

Twitter Bio: {Twitter Bio}

Instructions: 1) Read all sections before responding. 2) Return valid JSON: {Category: <...>, Explanation: <...>}. 3) Determine the best category for the bio. 4) Verify your choice against the definitions. 5) Justify why this category over the others. 6) Output the final JSON.