Beyond the Hype: Using Generative AI in Public Administration Research

TaMPERing Generative AI

A framework for the systematic and responsible use of Large Language Models

Dr. Michael Overton

Associate Professor of Public Administration
Associate Director, Institute for Interdisciplinary Data Science
University of Idaho

Why this matters

Researchers are adopting generative AI faster than they are validating it

AI tool use among researchers

57% in 2024 → 84% in 2025

(Wiley 2025)

Scholars have limited guidance on how to use it rigorously.

A fair objection

You might be saying to yourself,
“That number seems ridiculously high.
I can always spot AI, and I don’t see it THAT often!”

A fair objection

You might be saying to yourself,
“That number seems ridiculously high.
I can always spot AI, and I don’t see it THAT often!”

Why a framework

From gut feel to defensible

Without the framework

Pick a model that felt good, prompt until the output looked right, report almost none of it.

With TaMPER

Choose the task, agents, model, and prompt deliberately, evaluate purposefully, and report all transparently.

The framework

TaMPER

The framework

TaMPER

Task

agents

Model

Prompt

Evaluation

Reporting

Six decision points for using generative AI with rigor.

T · Decision one

Task

What are you asking the LLM to do?

Task

What is a “task”?

A task is the set of actions that transform input data into the desired output. Defining one means making three decisions.

Input and output

Consider your corpus of inputs (i.e., documents, tweets, public comments).
Decide whether the output form is fixed in advance or discovered by the model.
Predefined vs Undefined outputs.

Complexity

Decide whether to run the task in one pass or decompose it.
Intricate, multilayered requests turn brittle.
Simpler sequential steps are more reliable and reproducible.

The model's function

Decide whether the model simulates a participant or acts as an analytical tool.
Simulating human = Synthetic data.
Analytical tool = Text Analysis.

LLM as a human	Predefined example	Undefined example	Type of task
Participant	Survey respondent	Open-ended interviewee	Synthetic Data Generation
Coder	Annotate text	Category creation	Text Analysis
Human extractor	Identify and extract text	Summarize document	Text Analysis

Task · In social science

What scholars are doing with LLMs

Click a use case to see the task, a sample input, and the output.

a · the silent letter

Agents

There is a silent a in TaMPER. Today it stands for Agents.

Agents

What is an agent?

Agent = Model + Harness

Harness. The software wrapped around the model that supplies its context, runs the tools it asks for, remembers across turns, and loops until the task is done.

Anthropic calls the base unit the “augmented LLM,” a model plus retrieval, tools, and memory. (Anthropic 2024)

Agents

Agents are LLMs, just wired up

Pick a wiring shape, then click a tool to give the model a way to act. Every node is an LLM.

An agent is one or more LLM calls, linked together and given tools.

What happens

Agents

Agentic workflow or agentic AI

Agentic workflow

A fixed pattern of LLM calls. You design the path in code, and every run follows the same steps.

Agentic AI

The model decides the path at runtime. It picks the next step, calls tools, loops, and stops when it judges the task done.

Agents

Complexity and Uncertainty

Complexity

How complicated the task is. Many steps and deviations from routine, but with clear rules and a checkable answer.

Uncertainty

How analyzable the task is. The right answer requires judgment, context, or values, and cannot be cheaply verified.

Accuracy over five iterations vs number of variables, qwen2.5 72b

Accuracy over five iterations vs number of variables, qwen3 32b thinking

Accuracy over five iterations vs number of variables, gpt-oss 120b thinking

M · Decision two

Model

Which LLM should you actually use?

Model · Three decisions

Three decisions, click to explore

Click a decision. The two boxes define each option; the recommendation follows below.

Click a decision above.

Recommendation

Pick a decision to see the recommendation.

P · Decision three

Prompt

How do you tell the model what you want?

Prompt

Tell it what you want

Prompts are how you interact with LLMs.
Outputs are highly sensitive to the prompt.
Is it art or is it science? A bit of both.
Think in components or sections.

TASK OVERVIEW: Determine whether the news article suggests that {FIRM} engaged in organizational misconduct.NEWS ARTICLE: {NEWS_ARTICLE}

DEFINITIONS:
Organizational Misconduct: "an illegal, unethical, or socially irresponsible behavior performed by an organization that directly harms its stakeholders through fraud, product safety issues, employee mistreatment, and environmental violations."QUESTION: Did {FIRM} engage in organizational misconduct in the news article? Briefly explain your decision. Respond using only one of the following options:
a) yes
b) noINSTRUCTIONS: Before responding, carefully review the TASK OVERVIEW, NEWS ARTICLE, DEFINITIONS, QUESTION, INSTRUCTIONS, and OUTPUT FORMAT. Respond in valid JSON using the key-value pair {"misconduct": <"yes"|"no">, "rationale": "<one or two sentences>"}. Before finalizing, review the DEFINITIONS so the rationale does not deviate from them, and review the NEWS ARTICLE so your decision and rationale do not deviate from it.OUTPUT FORMAT:
{"misconduct": <"yes"|"no">, "rationale": "<one or two sentences>"}

Prompt · Components

Components, and order matters

Click a component to highlight it in the prompt and see its recommendation.

TASK OVERVIEW: Determine whether the news article suggests that {FIRM} engaged in organizational misconduct.NEWS ARTICLE: {NEWS_ARTICLE}

DEFINITIONS:
Organizational Misconduct: "an illegal, unethical, or socially irresponsible behavior performed by an organization that directly harms its stakeholders through fraud, product safety issues, employee mistreatment, and environmental violations."QUESTION: Did {FIRM} engage in organizational misconduct in the news article? Briefly explain your decision. Respond using only one of the following options:
a) yes
b) noINSTRUCTIONS: Before responding, carefully review the TASK OVERVIEW, NEWS ARTICLE, DEFINITIONS, QUESTION, INSTRUCTIONS, and OUTPUT FORMAT. Respond in valid JSON using the key-value pair {"misconduct": <"yes"|"no">, "rationale": "<one or two sentences>"}. Before finalizing, review the DEFINITIONS so the rationale does not deviate from them, and review the NEWS ARTICLE so your decision and rationale do not deviate from it.OUTPUT FORMAT:
{"misconduct": <"yes"|"no">, "rationale": "<one or two sentences>"}

Recommendation

Click a component above.

Prompt · Developing a prompt

Developing a prompt is a loop

Click a node in the flowchart to highlight the matching step.

Start with a broad, generic question.
Review the output for what you dislike.
- Misusing terms? Add definitions.
- Strangely formatted? Specify the output, and limit the options.
- Guide its thinking with step-by-step instructions.
Change the prompt and repeat.

Prompt · Before and after

Each refinement changes the output

Prompt

Did the firm do anything wrong in this news article?

News article

Northwind delayed telling employees about layoffs so it would not owe the retention bonuses promised to those who stayed.

Output

“Northwind appears to have engaged in a practice that could be considered ethically questionable…”

About 300 words, and it never commits to a yes or no.

That baseline is fixed. Click a refinement to see the new prompt, the new output, and what improved.

New prompt

New output

What improved over the original output

Model qwen2.5:72b via MindRouter.

Prompt · Structured output

Structured output, a contract for the answer

Structured output forces the model to return a fixed, machine-readable shape, a JSON schema you define, instead of free prose. The model fills the slots you specify and cannot wander outside them.

You require a schema

{
  "organizational_misconduct": "yes | no",
  "explanation": "string"
}

The model must return

{
  "organizational_misconduct": "yes",
  "explanation": "The report details
                 concealed safety violations."
}

Why it matters. Compliance jumps to about 100 percent, every answer parses the same way, and you can separate what the schema imposed from what the model actually decided.

E · Decision four

Evaluation

How do you know the output is any good?

Evaluation · Measurement frame

From classical measurement to metrology

Classical measurement assumes a true score. Metrology asks how wide the plausible range is when no true score exists.

Classical frame

observed = true score + error

Is this measurement right or wrong relative to a known true value?

(Lord & Novick 1968, classical test theory)

Metrological frame

measurement result = best estimate ± uncertainty

How much confidence can we place in this measurement?

(JCGM 100:2008, GUM §6.2.1)

Treat the LLM as a measurement instrument.

Vocabulary shifts validity→accuracy reliability→precision error→uncertainty

Evaluation · What and how

What gets evaluated, and how

What gets evaluated

Protocol: the framework, the decisions a researcher makes, and the workflow for assessing them.

Prompts · Models · Dual Response Format

How it gets evaluated

Criteria and Metrics: the specific things we measure, and the numbers we use to measure them.

Compliance · Accuracy · Precision · Quality

Evaluation · The protocol

The TaMPER framework

Five components for documenting and evaluating any LLM-assisted analysis (Overton et al. 2025).

T

Task

What you want the LLM to do.

M

Model

Which LLM, configured how.

P

Prompt

How you instruct it.

E

Evaluation

How you validate the outputs.

R

Reporting

How you document everything.

Evaluation · The protocol

Three decisions, two actions

Decision points · what the researcher chooses

T

Task

What is being measured?

M

Model

Which instrument?

P

Prompt

How is the instrument configured?

↓

Actions on those decisions

E

Evaluation

Validate the chosen task, model, and prompt against compliance, accuracy, and precision.

R

Reporting

Document each decision and its validation in a way a reviewer can audit.

Evaluation · Response formats

Dual Response Format

Prompt

Is this tweet toxic? Return a label and a brief explanation.

@hammyp703 @Reuters Fake news try again loser

Direct response

Example: True

Evaluate with: compliance, accuracy, precision.

Explanation

Example: "The comment directly insults the addressed users, which is rude and disrespectful..."

Evaluate with: compliance, quality.

Evaluation · Criteria

Signal, Noise, and criteria

Signal in the data

Accuracy · Precision

Did the LLM read the data correctly, and how much of what we observe is real signal versus run-to-run noise?

Health of the instrument

Compliance · Quality

Is the measurement instrument actually working and producing parseable output and reasoning coherently about the inputs it sees?

At a minimum, scholars should evaluate LLMs based on Compliance, Accuracy, and Precision criteria.

Evaluation · Worked example

Now an example

A single case carried through every criterion: stance detection on the Amazon HQ relocation debate.

Task

Stance detection on 944 tweets regarding the Amazon HQ relocation.

Models

Command A
GPT OSS
Phi 4
Qwen 3

Prompt

Two prompts compared:

Established prompt
Revised prompt

Structured output schema

{
  "stance": "Supportive | Neutral |
             Unsupportive | Not about Amazon",
  "confidence": 0.0,
  "explanation": "string"
}

Evaluation · Compliance in action

Compliance, does the output match the required structure?

What it measures

Schema match. Are the fields, types, and labels the ones that were requested?

Metrics

Percent correct data type
Percent missing fields
Schema validation rate

HQ2 case · Compliance (stance detection)

Model	Established prompt	Revised prompt
Command A	99.98%	100.00%
GPT OSS	100.00%	100.00%
Phi 4	100.00%	100.00%
Qwen 3	100.00%	100.00%

Structured outputs delivered about 100 percent compliance in every cell, so the floor is cleared and we move on.

Evaluation · Accuracy in action

Accuracy, how close to a reference value?

What it measures

Closeness to a reference, whether objective ground truth or expert human judgment. This is validity in classical terms.

Metrics

Exact match and F1
Classification precision and recall
Correlation with human coding
Difference tests, McNemar, Kolmogorov-Smirnov, Mann-Whitney U, Wasserstein

HQ2 case · Accuracy with McNemar (stance detection)

Model	Established	Revised	McNemar χ²
Command A	55.19%	72.78%	272.0***
GPT OSS	70.61%	70.04%	153.0
Phi 4	45.59%	47.58%	379.0***
Qwen 3	68.58%	70.02%	280.0**

The GPT OSS gap is noise, so the framework sends you to the next criterion. * p<.05 · ** p<.01 · *** p<.001

Evaluation · Precision in action

Precision, how consistent across repeated runs?

What it measures

Output consistency when the same input is run five or more times. This is reliability, not the classification precision above.

Metrics

Agreement rate (categorical)
Standard deviation (numeric)
ICC (ordinal)
Krippendorff's α (mixed types)

HQ2 case · Precision (stance detection, 5 iterations)

Model	Krippendorff's α		Agreement rate
Model	Established	Revised	Established	Revised
Command A	0.71	0.95	86.97%	98.01%
GPT OSS	0.89	0.90	96.06%	96.44%
Phi 4	0.72	0.81	88.03%	91.89%
Qwen 3	0.74	0.83	89.77%	93.52%

Precision broke the tie for Command A. Same accuracy region, far tighter agreement, alpha 0.71 up to 0.95.

Evaluation · Criteria

Quality sits above the floor

Compliance, accuracy, and precision are the floor. Quality is the second-order criterion, and it only means something once the floor is cleared.

Linguistic

Fluency and coherence.

Logic

Entailment and plausibility.

Information fidelity

Factuality, faithfulness, relevance.

Informativeness

Information depth and restatement.

These are the buckets the judges score later. They presuppose a compliant, accurate, precise output.

Evaluation · Worked example

Harsh and lenient judges measure different things

9 quality questions on a 5-point scale, in 4 buckets. Same judge (GPT OSS 120B), same data (400 sentiment outputs by 4 models). Only the prompt changes, three lines added asking the judge to be extremely harsh but fair.

Mean rating across 4 models, Not Harsh to Harsh

Linguistic		Logic		Information Fidelity			Informativeness
Fluency	Coherence	Entailment	Plausibility	Factuality	Faithfulness	Relevance	Information	Restatement
4.58	4.87	4.94	4.07	4.75	4.45	5.00	3.83	4.36
4.36	4.21	4.87	3.81	4.60	4.35	5.00	3.73	4.27
−0.22	−0.65	−0.07	−0.26	−0.16	−0.10	0.00	−0.10	−0.09

■ Not Harsh · ■ Harsh · ■ Δ

R · Decision five

Reporting

Make the whole thing reproducible.

Reporting · Reproducibility

Report it back along the same five steps

Task

Input data, preprocessing such as OCR, tokenized descriptive statistics, desired output.

Model

Model selected, models tested, hyperparameters, date accessed.

Prompt

Full prompt template, and the number of calls per item.

Evaluation

Protocol, criteria, benchmarks, and any bias or privacy notes.

Reporting · The checklist

What to report, point by point

Click a section, then a line in its checklist, to see a reported example from the Amazon HQ2 stance-detection study.

Checklist

Reported example

Click a checklist line.

Thank you

Thanks for coming

Dr. Michael Overton

Email · moverton@uidaho.edu

Website · www.michaeloverton.net

GitHub · github.com/mro0001

TaMPERing Generative AI

Researchers are adopting generative AI faster than they are validating it

From gut feel to defensible

Without the framework

With TaMPER

Task

What is a “task”?

Input and output

Complexity

The model's function

What scholars are doing with LLMs

Classification at scale

Extraction from records

Qualitative coding

Synthetic data

Agents

What is an agent?

Agents are LLMs, just wired up

Agentic workflow or agentic AI

Agentic workflow

Agentic AI

Complexity and Uncertainty

Complexity

Uncertainty

Model

Three decisions, click to explore

Prompt

Tell it what you want

Components, and order matters

Developing a prompt is a loop

Each refinement changes the output

Prompt

News article

Output

New prompt

New output

Structured output, a contract for the answer

You require a schema

The model must return

Evaluation

From classical measurement to metrology

Classical frame

Metrological frame

What gets evaluated, and how

What gets evaluated

How it gets evaluated

The TaMPER framework

T

M

P

E

R

Three decisions, two actions

T

M

P

E

R

Dual Response Format

Direct response

Explanation

Signal, Noise, and criteria

Signal in the data

Health of the instrument

Now an example

Task

Models

Prompt

Structured output schema

Compliance, does the output match the required structure?

What it measures

Metrics

Accuracy, how close to a reference value?

What it measures

Metrics

Precision, how consistent across repeated runs?

What it measures

Metrics

Quality sits above the floor

Linguistic