Workshop · Use Case 3

Multi-Category Classification

A TaMPER example focusing on evaluation.

Compliance · Accuracy · Precision

►  Open the R notebook in Colab

Overview

Goals

  1. Set up and run multi-category classification of Twitter bios.
  2. Use Compliance, Accuracy, and Precision to evaluate the output.
  3. Use those metrics to compare across
    • 3 different models
    • 2 different prompts
    • 2 different temperatures
    • Structured outputs vs no structured outputs

Task

Who is the author?

  • Input: 39 synthetic Twitter bios about a city's bid to land a high-tech innovation center.
  • Goal: label each author as one of four user types.
  • Reference: every bio carries a human gold standard label (usertype_label) so we can score the model.

General Public Media Other Stakeholders Politicians / Government

The bio, not the tweet, is the unit classified.

Setup

Three open models, two temperatures

Models
Qwen 3.6 (35B) · Alibaba GPT-OSS (120B) · OpenAI Gemma 4 (31B) · Google
Hyperparameters
temperature 0 — deterministic temperature 1 — creative / variable
Structured output
JSON schema · Category + Explanation { "type": "object", "properties": { "Category": { "type": "string", "enum": [ "Business Stakeholder", "Politician or Government Account", "Media", "General Public" ] }, "Explanation": { "type": "string" } }, "required": ["Category", "Explanation"] }

Prompt · full

Full prompt

Classification_prompt_reform (verbatim from Workflow.ipynb) Task Overview: You are going to classify the author of a given Twitter bio into one of 4 distinct categories, General Public, Business Stakeholder, Media, Politicians and Government Accounts and provide a brief explanation for your decision. Definitions: Business Stakeholder- Private-sector or quasi-private actors that stand to gain directly from the deal and actively promote it. This group comprises corporate representatives (e.g., Amazon executives or spokespersons), commercial real-estate developers, site-selection consultants, lobbyists, and trade-association accounts. Their messages tend to highlight job creation, tax-base growth, and the competitive advantages of their locality, often using language such as "investment," "opportunity," or "bring Amazon home." Politician or Government Account-Actors who hold formal authority in the policy-making process. They include elected officials (mayors, governors, city council members) who can publicly claim credit or deflect blame, and career civil servants (city planners, economic-development staff, agency heads). On Twitter they usually tweet from verified or official-government handles, reference "our city," "we're offering," or cite legislative actions. Media-Information-dissemination entities that report on, interpret, and frame the HQ2 competition for the public. This includes local-news stations, regional newspapers, national business press, and individual reporters' personal accounts. General Public-All remaining individual users who do not belong to the three organized groups above. They are ordinary residents, voters, community-organizers, or "concerned citizens." Task: carefully read the author biography from a Twitter bio, identify which of the four categories the author belongs to, and provide a brief explanation. Your response must belong to one of: 1) Business Stakeholder, 2) Politician or Government Account, 3) Media, 4) General Public Twitter Bio: {text} Instructions: 1) Carefully review the Task Overview, Definitions, Task, Instructions, and Output Format before responding. 2) Return the output as valid JSON: {Category: <"Business Stakeholder"|"Politician or Government Account"|"Media"|"General Public">, Explanation: <Insert Explanation Here>}. 4) Review your choice against its definition: {text} 5) Provide a brief explanation that justifies your choice over the other options. Output: {Category: <...>, Explanation: <Insert Explanation Here>}

Prompt · simple

Simple prompt

Classification_prompt_org (verbatim from Workflow.ipynb) I want you to perform a data annotation task. Your task is to carefully read the text and identify the category that best matches the Twitter Bio Description. Your response must belong to one of the four categories: Business Stakeholder, Politician or Government Account, Media, or General Public. In your output, only respond with the name of the polarity: Business Stakeholder, Politician or Government Account, Media, or General Public, depending on the information in the Twitter Bio provided. In your output, I also want you to provide an explanation for the output. Provide your response in the first line and provide the explanation for your response in the second line. Twitter Bio: <{text}>

Same four categories — the only change is dropping the definitions and step-by-step instructions (and it asks for plain text, not JSON). We measure what that costs.

Run it

The experimental design

  • All unique, model, prompt, temperature, and structured outputs are run 5 times.
  • Default settings for each comparison: Model: Qwen 3.6, Prompt: Full prompt, Temperature: 0, Output: Structured.
ComparisonWhat changes
ModelsQwen 3.6 · GPT-OSS · Gemma 4
PromptFull · Simple
Temperature0 · 1
OutputStructured · Unstructured

Each comparison changes one factor and holds the rest at the defaults above. Every condition is run 5×.

The evaluation suite

Three Model Comparison

Compliance · Accuracy · Precision

Three Model Comparison

Output examples

Bio: “honored to serve as @bostonmayor & member of @nationalmayors. dedicated to enhancing the well-being and prosperity of all Bostonians. proud to call this vibrant city home.”  ·  Gold standard: Politicians/Government Accounts

ModelPredictedExplanation
Qwen 3.6 (35B)Politician or Government AccountThe Twitter bio explicitly identifies the author as the Mayor of Boston (@bostonmayor) and a member of the National Mayors organization. This directly matches the definition of a Politician or Government Account, which includes elected officials like mayors who hold formal authority in policy-making and represent their city. The language focuses on public service and civic prosperity rather than corporate investment, news reporting, or personal citizen commentary, clearly distinguishing it from the other three categories.
GPT-OSS (120B)Politician or Government AccountThe bio explicitly states the user serves as @bostonmayor, indicating they hold the elected office of Mayor of Boston. This is a formal authority position in policy-making and city governance, matching the definition of a Politician or Government Account.
Gemma 4 (31B)Politician or Government AccountThe author explicitly states they serve as the '@bostonmayor' and are a member of the '@nationalmayors'. According to the definitions, elected officials such as mayors who hold formal authority in the policy-making process fall directly into the 'Politician or Government Account' category.

1 · Compliance — the code

Python
# % missing, % valid label
def compliance(recs):
    n = len(recs)
    miss  = sum(r["category"] in (None,"")
                for r in recs)
    valid = sum(r["category"] in CATEGORIES
                for r in recs)
    return {"pct_missing": 100*miss/n,
            "pct_valid":   100*valid/n}
R
# tidyverse
compliance_table <- function(runs) runs |>
  group_by(model, prompt) |>
  summarise(
    pct_missing = 100*mean(is.na(pred)),
    pct_valid   = 100*mean(pred %in% CATEGORIES),
    .groups = "drop")

1 · Compliance — results

Did the models emit valid labels?

ModelValid labelMissing
Qwen 3.6 (35B)100%0%
GPT-OSS (120B)100%0%
Gemma 4 (31B)95.9%4.1%

reform prompt · temp 0 · structured · 39 bios × 5 runs. Near-perfect across the board; Gemma's small dip is a few transient API timeouts, not format failures.

2 · Accuracy — the code

Python
# exact match vs gold standard
def accuracy(recs):
    pairs = [(canon(r["category"]),
              canon(r["gold"])) for r in recs]
    pairs = [(p,g) for p,g in pairs if p and g]
    return 100*mean(p==g for p,g in pairs)
R
# tidyverse
accuracy_table <- function(runs) runs |>
  group_by(model, prompt) |>
  summarise(
    accuracy = 100*mean(pred == gold,
                        na.rm = TRUE),
    .groups = "drop")

2 · Accuracy

Non-matching examples

ModelBioGold standardPredictedExplanation
Qwen 3.6 (35B)former web journalist and conference coordinator.General PublicMediaThe Twitter bio explicitly identifies the author as a 'former web journalist,' which directly aligns with the Media category's definition of information-dissemination entities and individual reporters who report on, interpret, and frame topics for the public. While the author also mentions being a conference coordinator, the primary professional identifier (journalist) clearly places them within the Media category rather than Business Stakeholder, Politician/Government, or General Public.
GPT-OSS (120B)currently - https://t.co/bcd4efghij. former city council member and alum of the mayor's office under Mayor Jane Doe and Mayor John Smith.Politicians/Government AccountsGeneral PublicThe bio indicates the user is a former city council member and a past employee of the mayor's office, but does not hold a current elected or government position, nor does it mention any business or media affiliation. As a private individual reflecting on past public service, the author fits the General Public category.
Gemma 4 (31B)sharing insights on local business growth, tips & trends. follows, retweets≠endorsement. #accountant #businessdegree #eaglescout — #smallbiz #fintech #marketingtech #cloudservices #machinelearningGeneral PublicBusiness StakeholderThe author identifies as an accountant with a business degree and focuses on 'local business growth,' 'smallbiz,' and various commercial sectors like 'fintech' and 'cloudservices.' This aligns with the Business Stakeholder definition of private-sector actors who promote economic growth and investment, whereas they lack the official authority of a government account, the reporting focus of the media, or the typical profile of a general resident.

2 · Accuracy — results

How close to the human gold standard?

ModelAccuracy
Qwen 3.6 (35B)83.1%
Gemma 4 (31B)82.4%
GPT-OSS (120B)80.0%

Exact match vs human label · reform prompt · temp 0 · structured · 39 bios pooled over 5 runs. Tight band — all three within ~3 points.

2 · Accuracy — is the gap real?

McNemar's test

  • Goal: the models' accuracies differ (83 vs 80 vs 82) — real difference, or run-to-run noise?
  • Step 1. Score every prediction right/wrong against the human gold standard.
  • Step 2. Pair two models on the same bio (and iteration).
  • Step 3. Keep only the disagreements — bios one model gets right and the other wrong.
  • Step 4. Exact binomial test on those discordant counts → a p-value.

Paired and per-item — the right test when both models see the exact same data.

2 · McNemar — the code

Python
# statsmodels, exact binomial
from statsmodels.stats.contingency_tables \
     import mcnemar

df["correct"] = (df.pred == df.gold).astype(int)
a = df[df.model==m1].set_index(["id","it"]).correct
b = df[df.model==m2].set_index(["id","it"]).correct
tab = [[((a==1)&(b==1)).sum(), ((a==1)&(b==0)).sum()],
       [((a==0)&(b==1)).sum(), ((a==0)&(b==0)).sum()]]
mcnemar(tab, exact=True)        # .statistic .pvalue
R
# exact McNemar = binom test on discordants
a <- df$correct[df$model == m1]   # aligned by
b <- df$correct[df$model == m2]   # id + iteration
tab <- table(a, b)
binom.test(tab["1","0"],
           tab["1","0"] + tab["0","1"], 0.5)

2 · Accuracy — model comparison

Are the model gaps significant?

ComparisonAccuracyDiscordantMcNemar p
Qwen 3.6 vs GPT-OSS83.1 vs 80.017 / 110.345
Qwen 3.6 vs Gemma 483.1 vs 82.413 / 131.000
GPT-OSS vs Gemma 480.0 vs 82.412 / 160.572

None are significant (all p > 0.05). The ~3-point accuracy gaps are noise — on this task, model choice doesn't move accuracy. (The prompt does — that's the matrix.)

3 · Precision — the code

Python
# agreement + Krippendorff alpha
def precision(recs):
    items = defaultdict(list)
    for r in recs:
        items[r["id"]].append(canon(r["category"]))
    agree = [len(set(v))==1 for v in items.values()]
    return {"agreement_rate": 100*mean(agree),
            "kripp_alpha": kripp_alpha_nominal(items)}
R
# tidyverse + base-R alpha
precision_table <- function(runs) runs |>
  group_by(model, prompt) |>
  summarise(
    agreement_rate = 100*mean(all_agree(pred)),
    kripp_alpha = kripp_alpha_nominal(
                    pred, iteration),
    .groups = "drop")

3 · Precision

Imprecise examples

Identical settings, run 5× — two runs that disagreed:

ModelBioGold standardPredictedExplanation
GPT-OSS (120B)currently - https://t.co/bcd4efghij. former city council member and alum of the mayor's office under Mayor Jane Doe and Mayor John Smith.Politicians/Government AccountsGeneral PublicThe bio indicates the user is a former city council member and a past employee of the mayor's office, but does not hold a current elected or government position, nor does it mention any business or media affiliation. As a private individual reflecting on past public service, the author fits the General Public category.
Politician or Government AccountThe bio highlights a past elected position ('former city council member') and experience working in the mayor's office, indicating a clear affiliation with government and policy‑making roles. This aligns with the definition of Politician or Government Account, which includes elected officials and career civil servants. The user is not presented as a business representative, media outlet, or a typical private citizen, so the political/government classification is the best fit.
Qwen 3.6 (35B)educator, innovator, and strategic leader bridging advanced technology with diverse sectors to develop cutting-edge solutions.Other StakeholdersGeneral PublicThe Twitter bio describes an individual professional focused on education, innovation, and technology, without any explicit references to corporate deal promotion, government authority, or media reporting. According to the provided definitions, the 'General Public' category serves as a catch-all for individual users who do not fit into the organized groups of Business Stakeholders, Politicians/Government, or Media. Since the bio lacks indicators of economic development advocacy, official government roles, or journalistic activity, it aligns best with the General Public classification.
Business StakeholderThe bio employs professional, corporate-focused language ('strategic leader,' 'bridging advanced technology with diverse sectors,' 'cutting-edge solutions') that aligns with private-sector professionals, tech executives, or consultants. It lacks the official titles, policy references, or civic language typical of government accounts, contains no journalistic elements or news links characteristic of media, and is too professionally oriented to be considered a general public user. Thus, it best fits the Business Stakeholder category.

3 · Precision — results

Same answer across 5 runs?

ModelAgreementKrippendorff α
Gemma 4 (31B)100%1.00
Qwen 3.6 (35B)97.4%0.97
GPT-OSS (120B)87.2%0.90

39 bios × 5 iterations · reform prompt · temp 0 · structured. Gemma perfectly reproducible; GPT-OSS the most variable.

Comparison · Prompt

Full vs simple prompt example

PromptBioGold standardPredictedExplanation
Full (reform)public relations strategist & author of 'The Spin Room: A Guide to Crisis Management'General PublicGeneral Public ✓The Twitter bio identifies the user as a public relations strategist and author, indicating a professional background but lacking explicit markers of corporate representation, government authority, or journalistic reporting. According to the provided definitions, the General Public category encompasses 'all remaining individual users who do not belong to the three organized groups above.' Since the bio does not demonstrate direct promotion of a specific economic deal, official government ties, or news dissemination, the author best fits the General Public category.
Simple (org)Business Stakeholder ✗The bio identifies the user as a public relations strategist and author of a business/crisis management guide, indicating a professional role focused on corporate communications and consulting rather than politics, journalism, or personal use.

Comparison · Prompt

Full vs simple prompt

PromptComplianceAccuracyAgreementα
Full100%83.1%97.4%0.97
Simple100%72.3%87.2%0.92

McNemar (accuracy): p = 0.002 — significant. The full prompt is meaningfully more accurate (+11 pts) and more consistent. Qwen 3.6 · temp 0 · structured · 39 bios × 5.

Comparison · Temperature

Temperature 0 vs 1 examples

TemperatureBioGold standardPredictedExplanation
0media consultant / communications expert. craft and execute press strategies for impactful media coverage. fmr tech news reporter & PR lead for a leading financial analytics firm.MediaMedia ✓The author's bio explicitly identifies them as a media consultant, communications expert, and former tech news reporter who crafts press strategies. This professional background in journalism, media relations, and press strategy aligns directly with the Media category, which encompasses individuals and entities focused on information dissemination, reporting, and communications. Unlike Business Stakeholders who actively promote specific economic deals for direct gain, or Politicians/Government Accounts who hold official policy-making roles, this profile centers on media strategy and journalistic experience, making Media the most accurate classification.
Media ✓The author's bio explicitly identifies them as a 'media consultant,' 'communications expert,' and 'fmr tech news reporter,' which directly aligns with the Media category's focus on information dissemination, press coverage, and reporting. Their professional background centers on crafting press strategies and journalism, distinguishing them from direct business promoters, government officials, or general public commentators.
1Business Stakeholder ✗The author identifies as a media consultant, communications expert, and PR lead for a financial analytics firm, with a focus on crafting press strategies for impactful coverage. This aligns with the Business Stakeholder category, which encompasses corporate representatives, spokespersons, and private-sector communications professionals who manage external relations and actively promote their organization's interests. Although the bio mentions being a former tech news reporter, their current primary role is corporate PR/communications rather than independent journalism, which distinguishes them from the Media category and places them firmly within the private-sector business stakeholder group.
Media ✓The Twitter bio identifies the author as a 'media consultant,' 'communications expert,' and 'fmr tech news reporter' who specializes in crafting press strategies and securing media coverage. This background in journalism, public relations, and information dissemination aligns directly with the Media category, which explicitly includes reporters and information-dissemination entities. Unlike Business Stakeholders who actively promote specific economic deals or Politicians/Government accounts who engage in policy-making, this user's professional identity and stated expertise are centered on media relations and reporting, making Media the most accurate classification.

Comparison · Temperature

Temperature 0 vs 1

TemperatureComplianceAccuracyAgreementα
0100%83.1%97.4%0.97
1100%81.5%84.6%0.86

McNemar (accuracy): p = 0.453 — not significant. Temperature barely touches accuracy, but it lowers precision (α 0.97 → 0.86). Qwen 3.6 · reform · structured · 39 bios × 5.

Comparison · Structured output

Schema-enforced vs free text

OutputComplianceAccuracyAgreementα
Structured100%83.1%97.4%0.97
Unstructured100%82.6%89.7%0.90

McNemar (accuracy): p = 1.000 — not significant. With the strong reform prompt Qwen is already reliable, so structure adds little here — its payoff is larger with weak prompts. Qwen 3.6 · reform · temp 0 · 39 bios × 5.