Beyond the Hype: Using Generative AI in Public Administration Research

TaMPERing Generative AI

A framework for the systematic and responsible use of Large Language Models

Dr. Michael Overton

Associate Professor of Public Administration
Associate Director, Institute for Interdisciplinary Data Science
University of Idaho

Dr. Michael Overton

Why this matters

Researchers are adopting generative AI faster than they are validating it

AI tool use among researchers

57% in 2024 84% in 2025

(Wiley 2025)

Scholars have limited guidance on how to use it rigorously.

A fair objection

You might be saying to yourself,
“That number seems ridiculously high.
I can always spot AI, and I don’t see it THAT often!”

A fair objection

You might be saying to yourself,
“That number seems ridiculously high.
I can always spot AI, and I don’t see it THAT often!”

Survivorship bias diagram, red dots marking damage on planes that returned

Why a framework

From gut feel to defensible

Without the framework

Pick a model that felt good, prompt until the output looked right, report almost none of it.

With TaMPER

Choose the task, agents, model, and prompt deliberately, evaluate purposefully, and report all transparently.

The framework

TaMPER

The framework

TaMPER

Task

agents

Model

Prompt

Evaluation

Reporting

Six decision points for using generative AI with rigor.

T · Decision one

Task


What are you asking the LLM to do?

Task

What is a “task”?

A task is the set of actions that transform input data into the desired output. Defining one means making three decisions.

Input and output

  • Consider your corpus of inputs (i.e., documents, tweets, public comments).
  • Decide whether the output form is fixed in advance or discovered by the model.
  • Predefined vs Undefined outputs.

Complexity

  • Decide whether to run the task in one pass or decompose it.
  • Intricate, multilayered requests turn brittle.
  • Simpler sequential steps are more reliable and reproducible.

The model's function

  • Decide whether the model simulates a participant or acts as an analytical tool.
  • Simulating human = Synthetic data.
  • Analytical tool = Text Analysis.
LLM as a humanPredefined exampleUndefined exampleType of task
ParticipantSurvey respondentOpen-ended intervieweeSynthetic Data Generation
CoderAnnotate textCategory creationText Analysis
Human extractorIdentify and extract textSummarize documentText Analysis

Task · In social science

What scholars are doing with LLMs

Click a use case to see the task, a sample input, and the output.

Task

Input

LLM

🧠

prompt · model · iteration

Output

a · the silent letter

Agents


There is a silent a in TaMPER. Today it stands for Agents.

Agents

What is an agent?

Agent = Model + Harness

Harness. The software wrapped around the model that supplies its context, runs the tools it asks for, remembers across turns, and loops until the task is done.

Anthropic calls the base unit the “augmented LLM,” a model plus retrieval, tools, and memory. (Anthropic 2024)

Agents

Agents are LLMs, just wired up

Pick a wiring shape, then click a tool to give the model a way to act. Every node is an LLM.

An agent is one or more LLM calls, linked together and given tools.

Web search Database Code execution

What happens

    Agents

    Agentic workflow or agentic AI

    Agentic workflow

    A fixed pattern of LLM calls. You design the path in code, and every run follows the same steps.

    Input LLM A LLM B LLM C Output 🔒 one fixed path

    Agentic AI

    The model decides the path at runtime. It picks the next step, calls tools, loops, and stops when it judges the task done.

    Input LLM decides the next step loops until done tools and steps calls result when done Output

    Agents

    Complexity and Uncertainty

    Complexity

    How complicated the task is. Many steps and deviations from routine, but with clear rules and a checkable answer.

    Uncertainty

    How analyzable the task is. The right answer requires judgment, context, or values, and cannot be cheaply verified.

    2019 2025 Time Capability Judgment, uncertainty Complexity the gap widens
    Accuracy over five iterations vs number of variables, qwen2.5 72b Accuracy over five iterations vs number of variables, qwen3 32b thinking Accuracy over five iterations vs number of variables, gpt-oss 120b thinking

    M · Decision two

    Model


    Which LLM should you actually use?

    Model · Three decisions

    Three decisions, click to explore

    Click a decision. The two boxes define each option; the recommendation follows below.

     

    • Click a decision above.

     

      Recommendation

      • Pick a decision to see the recommendation.

      P · Decision three

      Prompt


      How do you tell the model what you want?

      Prompt

      Tell it what you want

      • Prompts are how you interact with LLMs.
      • Outputs are highly sensitive to the prompt.
      • Is it art or is it science? A bit of both.
      • Think in components or sections.
      TASK OVERVIEW: Determine whether the news article suggests that {FIRM} engaged in organizational misconduct.NEWS ARTICLE: {NEWS_ARTICLE}
      
      DEFINITIONS:
      Organizational Misconduct: "an illegal, unethical, or socially irresponsible behavior performed by an organization that directly harms its stakeholders through fraud, product safety issues, employee mistreatment, and environmental violations."QUESTION: Did {FIRM} engage in organizational misconduct in the news article? Briefly explain your decision. Respond using only one of the following options:
      a) yes
      b) noINSTRUCTIONS: Before responding, carefully review the TASK OVERVIEW, NEWS ARTICLE, DEFINITIONS, QUESTION, INSTRUCTIONS, and OUTPUT FORMAT. Respond in valid JSON using the key-value pair {"misconduct": <"yes"|"no">, "rationale": "<one or two sentences>"}. Before finalizing, review the DEFINITIONS so the rationale does not deviate from them, and review the NEWS ARTICLE so your decision and rationale do not deviate from it.OUTPUT FORMAT:
      {"misconduct": <"yes"|"no">, "rationale": "<one or two sentences>"}

      Prompt · Components

      Components, and order matters

      Click a component to highlight it in the prompt and see its recommendation.

      TASK OVERVIEW: Determine whether the news article suggests that {FIRM} engaged in organizational misconduct.NEWS ARTICLE: {NEWS_ARTICLE}
      
      DEFINITIONS:
      Organizational Misconduct: "an illegal, unethical, or socially irresponsible behavior performed by an organization that directly harms its stakeholders through fraud, product safety issues, employee mistreatment, and environmental violations."QUESTION: Did {FIRM} engage in organizational misconduct in the news article? Briefly explain your decision. Respond using only one of the following options:
      a) yes
      b) noINSTRUCTIONS: Before responding, carefully review the TASK OVERVIEW, NEWS ARTICLE, DEFINITIONS, QUESTION, INSTRUCTIONS, and OUTPUT FORMAT. Respond in valid JSON using the key-value pair {"misconduct": <"yes"|"no">, "rationale": "<one or two sentences>"}. Before finalizing, review the DEFINITIONS so the rationale does not deviate from them, and review the NEWS ARTICLE so your decision and rationale do not deviate from it.OUTPUT FORMAT:
      {"misconduct": <"yes"|"no">, "rationale": "<one or two sentences>"}

      Recommendation

      • Click a component above.

      Prompt · Developing a prompt

      Developing a prompt is a loop

      Click a node in the flowchart to highlight the matching step.

      • Start with a broad, generic question.
      • Review the output for what you dislike.
        • Misusing terms? Add definitions.
        • Strangely formatted? Specify the output, and limit the options.
        • Guide its thinking with step-by-step instructions.
      • Change the prompt and repeat.
      Generic question Review output Satisfied? Yes! Stop and enjoy your life No Change prompt

      Prompt · Before and after

      Each refinement changes the output

      Prompt

      Did the firm do anything wrong in this news article?

      News article

      Northwind delayed telling employees about layoffs so it would not owe the retention bonuses promised to those who stayed.

      Output

      “Northwind appears to have engaged in a practice that could be considered ethically questionable…”

      About 300 words, and it never commits to a yes or no.

      That baseline is fixed. Click a refinement to see the new prompt, the new output, and what improved.

      New prompt

      New output

      What improved over the original output

      Model qwen2.5:72b via MindRouter.

      Prompt · Structured output

      Structured output, a contract for the answer

      Structured output forces the model to return a fixed, machine-readable shape, a JSON schema you define, instead of free prose. The model fills the slots you specify and cannot wander outside them.

      You require a schema

      {
        "organizational_misconduct": "yes | no",
        "explanation": "string"
      }

      The model must return

      {
        "organizational_misconduct": "yes",
        "explanation": "The report details
                       concealed safety violations."
      }

      Why it matters. Compliance jumps to about 100 percent, every answer parses the same way, and you can separate what the schema imposed from what the model actually decided.

      E · Decision four

      Evaluation


      How do you know the output is any good?

      Evaluation · Measurement frame

      From classical measurement to metrology

      Classical measurement assumes a true score. Metrology asks how wide the plausible range is when no true score exists.

      Metrological frame

      measurement result = best estimate ± uncertainty

      How much confidence can we place in this measurement?

      (JCGM 100:2008, GUM §6.2.1)

      Treat the LLM as a measurement instrument.

      Vocabulary shifts validityaccuracy reliabilityprecision erroruncertainty

      Evaluation · What and how

      What gets evaluated, and how

      What gets evaluated

      Protocol: the framework, the decisions a researcher makes, and the workflow for assessing them.

      Prompts · Models · Dual Response Format

      How it gets evaluated

      Criteria and Metrics: the specific things we measure, and the numbers we use to measure them.

      Compliance · Accuracy · Precision · Quality

      Evaluation · The protocol

      The TaMPER framework

      Five components for documenting and evaluating any LLM-assisted analysis (Overton et al. 2025).

      T

      Task

      What you want the LLM to do.

      M

      Model

      Which LLM, configured how.

      P

      Prompt

      How you instruct it.

      E

      Evaluation

      How you validate the outputs.

      R

      Reporting

      How you document everything.

      Evaluation · The protocol

      Three decisions, two actions

      Decision points · what the researcher chooses

      T

      Task

      What is being measured?

      M

      Model

      Which instrument?

      P

      Prompt

      How is the instrument configured?

      Actions on those decisions

      E

      Evaluation

      Validate the chosen task, model, and prompt against compliance, accuracy, and precision.

      R

      Reporting

      Document each decision and its validation in a way a reviewer can audit.

      Evaluation · Response formats

      Dual Response Format

      Prompt

      Is this tweet toxic? Return a label and a brief explanation.

      Tweet

      @hammyp703 @Reuters Fake news try again loser

      Direct response

      Example: True

      Evaluate with: compliance, accuracy, precision.

      Explanation

      Example: "The comment directly insults the addressed users, which is rude and disrespectful..."

      Evaluate with: compliance, quality.

      Evaluation · Criteria

      Signal, Noise, and criteria

      Signal in the data

      Accuracy · Precision

      Did the LLM read the data correctly, and how much of what we observe is real signal versus run-to-run noise?

      Health of the instrument

      Compliance · Quality

      Is the measurement instrument actually working and producing parseable output and reasoning coherently about the inputs it sees?

      At a minimum, scholars should evaluate LLMs based on Compliance, Accuracy, and Precision criteria.

      Evaluation · Worked example

      Now an example

      A single case carried through every criterion: stance detection on the Amazon HQ relocation debate.

      Task

      Stance detection on 944 tweets regarding the Amazon HQ relocation.

      Models

      • Command A
      • GPT OSS
      • Phi 4
      • Qwen 3

      Prompt

      Two prompts compared:

      • Established prompt
      • Revised prompt

      Structured output schema

      {
        "stance": "Supportive | Neutral |
                   Unsupportive | Not about Amazon",
        "confidence": 0.0,
        "explanation": "string"
      }

      Evaluation · Compliance in action

      Compliance, does the output match the required structure?

      What it measures

      Schema match. Are the fields, types, and labels the ones that were requested?

      Metrics

      • Percent correct data type
      • Percent missing fields
      • Schema validation rate

      HQ2 case · Compliance (stance detection)

      ModelEstablished promptRevised prompt
      Command A99.98%100.00%
      GPT OSS100.00%100.00%
      Phi 4100.00%100.00%
      Qwen 3100.00%100.00%

      Structured outputs delivered about 100 percent compliance in every cell, so the floor is cleared and we move on.

      Evaluation · Accuracy in action

      Accuracy, how close to a reference value?

      What it measures

      Closeness to a reference, whether objective ground truth or expert human judgment. This is validity in classical terms.

      Metrics

      • Exact match and F1
      • Classification precision and recall
      • Correlation with human coding
      • Difference tests, McNemar, Kolmogorov-Smirnov, Mann-Whitney U, Wasserstein

      HQ2 case · Accuracy with McNemar (stance detection)

      ModelEstablishedRevisedMcNemar χ²
      Command A55.19%72.78%272.0***
      GPT OSS70.61%70.04%153.0
      Phi 445.59%47.58%379.0***
      Qwen 368.58%70.02%280.0**

      The GPT OSS gap is noise, so the framework sends you to the next criterion. * p<.05 · ** p<.01 · *** p<.001

      Evaluation · Precision in action

      Precision, how consistent across repeated runs?

      What it measures

      Output consistency when the same input is run five or more times. This is reliability, not the classification precision above.

      Metrics

      • Agreement rate (categorical)
      • Standard deviation (numeric)
      • ICC (ordinal)
      • Krippendorff's α (mixed types)

      HQ2 case · Precision (stance detection, 5 iterations)

      ModelKrippendorff's αAgreement rate
      EstablishedRevisedEstablishedRevised
      Command A0.710.9586.97%98.01%
      GPT OSS0.890.9096.06%96.44%
      Phi 40.720.8188.03%91.89%
      Qwen 30.740.8389.77%93.52%

      Precision broke the tie for Command A. Same accuracy region, far tighter agreement, alpha 0.71 up to 0.95.

      Evaluation · Criteria

      Quality sits above the floor

      Compliance, accuracy, and precision are the floor. Quality is the second-order criterion, and it only means something once the floor is cleared.

      Linguistic

      Fluency and coherence.

      Logic

      Entailment and plausibility.

      Information fidelity

      Factuality, faithfulness, relevance.

      Informativeness

      Information depth and restatement.

      These are the buckets the judges score later. They presuppose a compliant, accurate, precise output.

      Evaluation · Worked example

      Harsh and lenient judges measure different things

      9 quality questions on a 5-point scale, in 4 buckets. Same judge (GPT OSS 120B), same data (400 sentiment outputs by 4 models). Only the prompt changes, three lines added asking the judge to be extremely harsh but fair.

      Mean rating across 4 models, Not Harsh to Harsh

      Linguistic Logic Information Fidelity Informativeness
      FluencyCoherence EntailmentPlausibility FactualityFaithfulnessRelevance InformationRestatement
      4.584.874.944.074.754.455.003.834.36
      4.364.214.873.814.604.355.003.734.27
      −0.22−0.65−0.07−0.26−0.16−0.100.00−0.10−0.09

      Not Harsh  ·  Harsh  ·  Δ

      R · Decision five

      Reporting


      Make the whole thing reproducible.

      Reporting · Reproducibility

      Report it back along the same five steps

      Task

      Input data, preprocessing such as OCR, tokenized descriptive statistics, desired output.

      Model

      Model selected, models tested, hyperparameters, date accessed.

      Prompt

      Full prompt template, and the number of calls per item.

      Evaluation

      Protocol, criteria, benchmarks, and any bias or privacy notes.

      Reporting · The checklist

      What to report, point by point

      Click a section, then a line in its checklist, to see a reported example from the Amazon HQ2 stance-detection study.

      Checklist

        Reported example

        Click a checklist line.

        Thank you

        Thanks for coming

        Dr. Michael Overton

        Email  ·  moverton@uidaho.edu

        Website  ·  www.michaeloverton.net

        GitHub  ·  github.com/mro0001