Workshop · Use Case 3
A TaMPER example focusing on evaluation.
Compliance · Accuracy · Precision
Overview
Task
usertype_label) so we can score the model.General Public Media Other Stakeholders Politicians / Government
The bio, not the tweet, is the unit classified.
Setup
Prompt · full
Prompt · simple
Same four categories — the only change is dropping the definitions and step-by-step instructions (and it asks for plain text, not JSON). We measure what that costs.
Run it
| Comparison | What changes |
|---|---|
| Models | Qwen 3.6 · GPT-OSS · Gemma 4 |
| Prompt | Full · Simple |
| Temperature | 0 · 1 |
| Output | Structured · Unstructured |
Each comparison changes one factor and holds the rest at the defaults above. Every condition is run 5×.
The evaluation suite
Compliance · Accuracy · Precision
Three Model Comparison
Bio: “honored to serve as @bostonmayor & member of @nationalmayors. dedicated to enhancing the well-being and prosperity of all Bostonians. proud to call this vibrant city home.” · Gold standard: Politicians/Government Accounts
| Model | Predicted | Explanation |
|---|---|---|
| Qwen 3.6 (35B) | Politician or Government Account | The Twitter bio explicitly identifies the author as the Mayor of Boston (@bostonmayor) and a member of the National Mayors organization. This directly matches the definition of a Politician or Government Account, which includes elected officials like mayors who hold formal authority in policy-making and represent their city. The language focuses on public service and civic prosperity rather than corporate investment, news reporting, or personal citizen commentary, clearly distinguishing it from the other three categories. |
| GPT-OSS (120B) | Politician or Government Account | The bio explicitly states the user serves as @bostonmayor, indicating they hold the elected office of Mayor of Boston. This is a formal authority position in policy-making and city governance, matching the definition of a Politician or Government Account. |
| Gemma 4 (31B) | Politician or Government Account | The author explicitly states they serve as the '@bostonmayor' and are a member of the '@nationalmayors'. According to the definitions, elected officials such as mayors who hold formal authority in the policy-making process fall directly into the 'Politician or Government Account' category. |
1 · Compliance — the code
# % missing, % valid label
def compliance(recs):
n = len(recs)
miss = sum(r["category"] in (None,"")
for r in recs)
valid = sum(r["category"] in CATEGORIES
for r in recs)
return {"pct_missing": 100*miss/n,
"pct_valid": 100*valid/n}
# tidyverse
compliance_table <- function(runs) runs |>
group_by(model, prompt) |>
summarise(
pct_missing = 100*mean(is.na(pred)),
pct_valid = 100*mean(pred %in% CATEGORIES),
.groups = "drop")
1 · Compliance — results
| Model | Valid label | Missing |
|---|---|---|
| Qwen 3.6 (35B) | 100% | 0% |
| GPT-OSS (120B) | 100% | 0% |
| Gemma 4 (31B) | 95.9% | 4.1% |
reform prompt · temp 0 · structured · 39 bios × 5 runs. Near-perfect across the board; Gemma's small dip is a few transient API timeouts, not format failures.
2 · Accuracy — the code
# exact match vs gold standard
def accuracy(recs):
pairs = [(canon(r["category"]),
canon(r["gold"])) for r in recs]
pairs = [(p,g) for p,g in pairs if p and g]
return 100*mean(p==g for p,g in pairs)
# tidyverse
accuracy_table <- function(runs) runs |>
group_by(model, prompt) |>
summarise(
accuracy = 100*mean(pred == gold,
na.rm = TRUE),
.groups = "drop")
2 · Accuracy
| Model | Bio | Gold standard | Predicted | Explanation |
|---|---|---|---|---|
| Qwen 3.6 (35B) | former web journalist and conference coordinator. | General Public | Media | The Twitter bio explicitly identifies the author as a 'former web journalist,' which directly aligns with the Media category's definition of information-dissemination entities and individual reporters who report on, interpret, and frame topics for the public. While the author also mentions being a conference coordinator, the primary professional identifier (journalist) clearly places them within the Media category rather than Business Stakeholder, Politician/Government, or General Public. |
| GPT-OSS (120B) | currently - https://t.co/bcd4efghij. former city council member and alum of the mayor's office under Mayor Jane Doe and Mayor John Smith. | Politicians/Government Accounts | General Public | The bio indicates the user is a former city council member and a past employee of the mayor's office, but does not hold a current elected or government position, nor does it mention any business or media affiliation. As a private individual reflecting on past public service, the author fits the General Public category. |
| Gemma 4 (31B) | sharing insights on local business growth, tips & trends. follows, retweets≠endorsement. #accountant #businessdegree #eaglescout — #smallbiz #fintech #marketingtech #cloudservices #machinelearning | General Public | Business Stakeholder | The author identifies as an accountant with a business degree and focuses on 'local business growth,' 'smallbiz,' and various commercial sectors like 'fintech' and 'cloudservices.' This aligns with the Business Stakeholder definition of private-sector actors who promote economic growth and investment, whereas they lack the official authority of a government account, the reporting focus of the media, or the typical profile of a general resident. |
2 · Accuracy — results
| Model | Accuracy |
|---|---|
| Qwen 3.6 (35B) | 83.1% |
| Gemma 4 (31B) | 82.4% |
| GPT-OSS (120B) | 80.0% |
Exact match vs human label · reform prompt · temp 0 · structured · 39 bios pooled over 5 runs. Tight band — all three within ~3 points.
2 · Accuracy — is the gap real?
Paired and per-item — the right test when both models see the exact same data.
2 · McNemar — the code
# statsmodels, exact binomial
from statsmodels.stats.contingency_tables \
import mcnemar
df["correct"] = (df.pred == df.gold).astype(int)
a = df[df.model==m1].set_index(["id","it"]).correct
b = df[df.model==m2].set_index(["id","it"]).correct
tab = [[((a==1)&(b==1)).sum(), ((a==1)&(b==0)).sum()],
[((a==0)&(b==1)).sum(), ((a==0)&(b==0)).sum()]]
mcnemar(tab, exact=True) # .statistic .pvalue
# exact McNemar = binom test on discordants
a <- df$correct[df$model == m1] # aligned by
b <- df$correct[df$model == m2] # id + iteration
tab <- table(a, b)
binom.test(tab["1","0"],
tab["1","0"] + tab["0","1"], 0.5)
2 · Accuracy — model comparison
| Comparison | Accuracy | Discordant | McNemar p |
|---|---|---|---|
| Qwen 3.6 vs GPT-OSS | 83.1 vs 80.0 | 17 / 11 | 0.345 |
| Qwen 3.6 vs Gemma 4 | 83.1 vs 82.4 | 13 / 13 | 1.000 |
| GPT-OSS vs Gemma 4 | 80.0 vs 82.4 | 12 / 16 | 0.572 |
None are significant (all p > 0.05). The ~3-point accuracy gaps are noise — on this task, model choice doesn't move accuracy. (The prompt does — that's the matrix.)
3 · Precision — the code
# agreement + Krippendorff alpha
def precision(recs):
items = defaultdict(list)
for r in recs:
items[r["id"]].append(canon(r["category"]))
agree = [len(set(v))==1 for v in items.values()]
return {"agreement_rate": 100*mean(agree),
"kripp_alpha": kripp_alpha_nominal(items)}
# tidyverse + base-R alpha
precision_table <- function(runs) runs |>
group_by(model, prompt) |>
summarise(
agreement_rate = 100*mean(all_agree(pred)),
kripp_alpha = kripp_alpha_nominal(
pred, iteration),
.groups = "drop")
3 · Precision
Identical settings, run 5× — two runs that disagreed:
| Model | Bio | Gold standard | Predicted | Explanation |
|---|---|---|---|---|
| GPT-OSS (120B) | currently - https://t.co/bcd4efghij. former city council member and alum of the mayor's office under Mayor Jane Doe and Mayor John Smith. | Politicians/Government Accounts | General Public | The bio indicates the user is a former city council member and a past employee of the mayor's office, but does not hold a current elected or government position, nor does it mention any business or media affiliation. As a private individual reflecting on past public service, the author fits the General Public category. |
| Politician or Government Account | The bio highlights a past elected position ('former city council member') and experience working in the mayor's office, indicating a clear affiliation with government and policy‑making roles. This aligns with the definition of Politician or Government Account, which includes elected officials and career civil servants. The user is not presented as a business representative, media outlet, or a typical private citizen, so the political/government classification is the best fit. | |||
| Qwen 3.6 (35B) | educator, innovator, and strategic leader bridging advanced technology with diverse sectors to develop cutting-edge solutions. | Other Stakeholders | General Public | The Twitter bio describes an individual professional focused on education, innovation, and technology, without any explicit references to corporate deal promotion, government authority, or media reporting. According to the provided definitions, the 'General Public' category serves as a catch-all for individual users who do not fit into the organized groups of Business Stakeholders, Politicians/Government, or Media. Since the bio lacks indicators of economic development advocacy, official government roles, or journalistic activity, it aligns best with the General Public classification. |
| Business Stakeholder | The bio employs professional, corporate-focused language ('strategic leader,' 'bridging advanced technology with diverse sectors,' 'cutting-edge solutions') that aligns with private-sector professionals, tech executives, or consultants. It lacks the official titles, policy references, or civic language typical of government accounts, contains no journalistic elements or news links characteristic of media, and is too professionally oriented to be considered a general public user. Thus, it best fits the Business Stakeholder category. |
3 · Precision — results
| Model | Agreement | Krippendorff α |
|---|---|---|
| Gemma 4 (31B) | 100% | 1.00 |
| Qwen 3.6 (35B) | 97.4% | 0.97 |
| GPT-OSS (120B) | 87.2% | 0.90 |
39 bios × 5 iterations · reform prompt · temp 0 · structured. Gemma perfectly reproducible; GPT-OSS the most variable.
Comparison · Prompt
| Prompt | Bio | Gold standard | Predicted | Explanation |
|---|---|---|---|---|
| Full (reform) | public relations strategist & author of 'The Spin Room: A Guide to Crisis Management' | General Public | General Public ✓ | The Twitter bio identifies the user as a public relations strategist and author, indicating a professional background but lacking explicit markers of corporate representation, government authority, or journalistic reporting. According to the provided definitions, the General Public category encompasses 'all remaining individual users who do not belong to the three organized groups above.' Since the bio does not demonstrate direct promotion of a specific economic deal, official government ties, or news dissemination, the author best fits the General Public category. |
| Simple (org) | Business Stakeholder ✗ | The bio identifies the user as a public relations strategist and author of a business/crisis management guide, indicating a professional role focused on corporate communications and consulting rather than politics, journalism, or personal use. |
Comparison · Prompt
| Prompt | Compliance | Accuracy | Agreement | α |
|---|---|---|---|---|
| Full | 100% | 83.1% | 97.4% | 0.97 |
| Simple | 100% | 72.3% | 87.2% | 0.92 |
McNemar (accuracy): p = 0.002 — significant. The full prompt is meaningfully more accurate (+11 pts) and more consistent. Qwen 3.6 · temp 0 · structured · 39 bios × 5.
Comparison · Temperature
| Temperature | Bio | Gold standard | Predicted | Explanation |
|---|---|---|---|---|
| 0 | media consultant / communications expert. craft and execute press strategies for impactful media coverage. fmr tech news reporter & PR lead for a leading financial analytics firm. | Media | Media ✓ | The author's bio explicitly identifies them as a media consultant, communications expert, and former tech news reporter who crafts press strategies. This professional background in journalism, media relations, and press strategy aligns directly with the Media category, which encompasses individuals and entities focused on information dissemination, reporting, and communications. Unlike Business Stakeholders who actively promote specific economic deals for direct gain, or Politicians/Government Accounts who hold official policy-making roles, this profile centers on media strategy and journalistic experience, making Media the most accurate classification. |
| Media ✓ | The author's bio explicitly identifies them as a 'media consultant,' 'communications expert,' and 'fmr tech news reporter,' which directly aligns with the Media category's focus on information dissemination, press coverage, and reporting. Their professional background centers on crafting press strategies and journalism, distinguishing them from direct business promoters, government officials, or general public commentators. | |||
| 1 | Business Stakeholder ✗ | The author identifies as a media consultant, communications expert, and PR lead for a financial analytics firm, with a focus on crafting press strategies for impactful coverage. This aligns with the Business Stakeholder category, which encompasses corporate representatives, spokespersons, and private-sector communications professionals who manage external relations and actively promote their organization's interests. Although the bio mentions being a former tech news reporter, their current primary role is corporate PR/communications rather than independent journalism, which distinguishes them from the Media category and places them firmly within the private-sector business stakeholder group. | ||
| Media ✓ | The Twitter bio identifies the author as a 'media consultant,' 'communications expert,' and 'fmr tech news reporter' who specializes in crafting press strategies and securing media coverage. This background in journalism, public relations, and information dissemination aligns directly with the Media category, which explicitly includes reporters and information-dissemination entities. Unlike Business Stakeholders who actively promote specific economic deals or Politicians/Government accounts who engage in policy-making, this user's professional identity and stated expertise are centered on media relations and reporting, making Media the most accurate classification. |
Comparison · Temperature
| Temperature | Compliance | Accuracy | Agreement | α |
|---|---|---|---|---|
| 0 | 100% | 83.1% | 97.4% | 0.97 |
| 1 | 100% | 81.5% | 84.6% | 0.86 |
McNemar (accuracy): p = 0.453 — not significant. Temperature barely touches accuracy, but it lowers precision (α 0.97 → 0.86). Qwen 3.6 · reform · structured · 39 bios × 5.
Comparison · Structured output
| Output | Compliance | Accuracy | Agreement | α |
|---|---|---|---|---|
| Structured | 100% | 83.1% | 97.4% | 0.97 |
| Unstructured | 100% | 82.6% | 89.7% | 0.90 |
McNemar (accuracy): p = 1.000 — not significant. With the strong reform prompt Qwen is already reliable, so structure adds little here — its payoff is larger with weak prompts. Qwen 3.6 · reform · temp 0 · 39 bios × 5.