Seeing History Unseen: Evaluating Vision-Language Models for WCAG-Compliant Alt-Text in Digital Heritage Collections

Mähr, Moritz; Twente, Moritz

doi:10.63744/njQVYcLndSPE

Seeing History Unseen

Evaluating Vision-Language Models for WCAG-Compliant Alt-Text in Digital Heritage Collections

Open-access full text and machine-readable version of the ACH 2025 article.

Published

21 November 2025

This article is published open access under CC BY 4.0. This page provides the canonical DOI, a machine-readable full-text version for better discoverability.

Citation

Authors: Moritz Mähr, Moritz Twente
Title: Seeing History Unseen: Evaluating Vision-Language Models for WCAG-Compliant Alt-Text in Digital Heritage Collections
Journal: Anthology of Computers and the Humanities, Volume 3, 2025, pp. 1147-1167
DOI: https://doi.org/10.63744/njQVYcLndSPE
License: CC BY 4.0

Downloads

Publisher page: https://anthology.ach.org/volumes/vol0003/seeing-history-unseen-evaluating-vision-language/
Machine-readable full text (TXT): seeing-history-unseen.txt

Full Text (machine-extracted)

The following text was extracted from the open-access PDF to improve indexing in search engines and AI-assisted systems. For citation and page references, use the publisher version at the DOI.

Anthology of Computers and the Humanities, Vol. 3

Seeing History Unseen: Evaluating Vision-Language
Models for WCAG-Compliant Alt-Text in Digital
Heritage Collections
Moritz Mähr1,2 , and Moritz Twente1
1
Stadt.Geschichte.Basel, University of Basel, Basel, Switzerland
2
Digital Humanities, University of Bern, Bern, Switzerland

Abstract
Digitized heritage collections remain partially inaccessible because images often lack de-
scriptive alternative text (alt-text). We evaluate whether contemporary Vision-Language
Models (VLMs) can assist in producing WCAG-compliant alt-text for heterogeneous his-
torical materials. Using a 100-item dataset curated from the Stadt.Geschichte.Basel Open
Research Data Platform—covering photographs, maps, drawings, objects, diagrams, and
print ephemera across multiple eras—we generate candidate descriptions with four VLMs
(Google Gemini 2.5 Flash Lite, Meta Llama 4 Maverick, OpenAI GPT-4o mini, Qwen 3 VL
8B Instruct). Our pipeline fixes WCAG and output constraints in the system prompt and
injects concise, collection-specific metadata at the user turn to mitigate “lost-in-the-middle”
effects. Feasibility benchmarks on a 20-item subset show 100% coverage, latencies of ~2–4 s
per item, and sub-cent costs per description. A rater study with 21 humanities scholars ranks
per-image model outputs; Friedman and Wilcoxon tests reveal no statistically significant
performance differences, while qualitative audits identify recurring errors: factual misrecog-
nition, selective omission, and uncritical reproduction of harmful historical terminology.
We argue that VLMs are operationally viable but epistemically fragile in heritage contexts.
Effective adoption requires editorial policies, sensitivity filtering, and targeted human-in-the-
loop review, especially for sensitive content and complex figures. The study contributes a
transparent, reproducible workflow, a small but representative evaluation set, and an initial
cost–quality baseline to inform GLAM institutions considering AI-assisted accessibility at
scale.

Keywords: alt-text, vision-language models, accessibility, WCAG 2.2, digital heritage
collections, historical accuracy, human-in-the-loop, ethical implications, metadata, disability
justice

1 Introduction
Digital archives promised to democratize access to cultural heritage, yet a significant portion of
visual historical content remains inaccessible to people who are blind or have low vision. Many
digitized photographs, maps, manuscripts, and other images lack descriptive alternative text (alt-
text), creating an epistemic barrier to the past. This perpetuates an asymmetry in sensory access
to history, where sighted people hold privileged insight into visual sources while non-sighted au-
diences encounter barriers to engagement. Making images legible through text is more than a
technical fix—it is a matter of historical justice and inclusivity in digital humanities. Even beyond
blind and low-vision users, rich image descriptions can aid others, such as neurodivergent readers
who benefit from explicit detail that sighted users might glean implicitly [2].
Moritz Mähr, and Moritz Twente. “Seeing History Unseen: Evaluating Vision-Language Models for WCAG-Compliant
Alt-Text in Digital Heritage Collections.” In: Computational Humanities Research 2025, ed. by Taylor Arnold,
Margherita Fantoli, and Ruben Ros. Vol. 3. Anthology of Computers and the Humanities. 2025, 1147–1167.
https://doi.org/10.63744/njQVYcLndSPE.
© 2025 by the authors. Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

1147
Alt-text itself is not new: the HTML alt attribute dates back to the 1990s to support acces-
sibility. However, providing high-quality image descriptions has often been a secondary priority
in scholarly communication [3]. Crafting alt-text is labor-intensive and typically left to authors
or curators as a final step, if done at all. The burden often falls on sighted domain experts (not
accessibility experts) to determine what information is or is not included in an image’s descrip-
tion. Human-generated descriptions are valued for capturing contextual meaning and can greatly
enhance the accessibility, searchability, and archivability of digital scholarship. Yet in practice,
many projects—especially smaller public history initiatives—lack the resources to implement ac-
cessibility from the start. The result is that visual evidence remains “unseen” by those who rely on
assistive technologies.
Recent advances in multimodal AI offer a potential remedy. Vision-Language Models (VLMs)
such as OpenAI’s GPT-4o mini, Google’s Gemini 2.5 Flash Lite, and open-weight systems like
Meta’s Llama 4 Maverick or Qwen’s Qwen3 VL 8B Instruct now claim near-human performance
in image description tasks. These models can ingest an image and generate a caption or descrip-
tion, essentially simulating the interpretive act of a human describer. If these models could produce
alt-text that is both high-quality and historically informed as well as aligned with the Web Con-
tent Accessibility Guidelines (WCAG 2.2) [14], this would dramatically reduce the human effort
required to remediate large collections. Heritage institutions could then scale up accessibility by
generating alt-text for thousands of images, because the costs of machine captioning are negligible
in comparison to a human expert. Consequently, the “readership” of digital archives would expand
to include those who were previously excluded.
However, adopting automated captioning in a heritage context raises critical questions about
truth, evidence, and authenticity. Delegating descriptive labor to machines is not a neutral technical
fix; it is an act imbued with values and biases [1]. Deciding what details to include in an image’s
description is technically difficult and ethically fraught, especially for historical images depicting
people or sensitive cultural content. Vision models trained on general web images may uncritically
adopt source terminology, inject anachronistic biases (e.g., misidentifying a 1920s street scene as
“Victorian”), reinforce curatorial blind spots, or omit crucial context that a human historian would
provide. There is also the danger of techno-ableism [12], where the needs of people who are
blind are superficially addressed by technology without truly empowering them or respecting their
perspectives. Uncritical use of AI could inadvertently recenter the sighted, algorithmic point of
view rather than the lived experience of those using the alt-text.
In this work, we argue that AI-generated alt-text for historical collections is a pivotal test case
for the entanglement of AI innovation, archival practice, and disability justice. But can a machine
“see” history as we do? If a model can convincingly describe a photograph from 100 years ago,
how does that change the way we verify and trust such descriptions? Embracing this kind of ma-
chine vision in historical scholarship may require new protocols akin to earlier paradigm shifts (for
example, the move from handwritten catalog cards to MARC records, or from microfilm to digital
scans). Just as those changes demanded critical awareness of how tools shape historical discovery,
the use of AI-generated descriptions demands a new hermeneutic of suspicion. We must learn to
critically read machine-generated metadata, much as we read any human-produced finding aid or
annotation [5]. The central purpose of our study is to assess whether and how current AI mod-
els can serve as accessibility assistants in a digital history workflow, and to critically examine
the conditions and implications of their responsible use. Our approach is interdisciplinary, blend-
ing computational experimentation with qualitative, historiographically informed analysis. The
research design comprises the following steps:

1. Data compilation: We compile a small yet balanced dataset consisting of historical sources
and research data.

1148
2. Model selection and prompt development: We conduct WCAG-aligned prompt engineer-
ing and model selection in an iterative and exploratory manner.

3. Generation and data collection: Once an optimal configuration of prompts and models
has been identified, we generate candidate alternative texts (alt-text) and collect quantitative
data on coverage, throughput, and unit cost.

4. Expert evaluation: A group of 21 domain experts—humanities scholars with relevant dis-
ciplinary expertise—evaluate and rank the AI-generated alt-text.

5. Expert review: The authors qualitatively assess a selection of the highest-ranked alt-text
for factual accuracy, contextual adequacy, and bias reproductions.

6. Analysis: We perform both statistical and qualitative analyses of the data obtained in steps
3–5.

By doing so, we aim to illuminate both the opportunities and the pitfalls of integrating AI into
inclusive humanities scholarship.

1.1 Research questions
To guide this inquiry, we pose the following research questions:

• RQ1 Feasibility: What coverage, throughput, and unit cost can current VLMs achieve
for WCAG-aligned alt-text on a heterogeneous heritage corpus, and where do they fail?

• RQ2 Relative quality: How do experts rank model outputs? What error patterns recur?

By answering these questions, our work helps to establish an empirical baseline for AI-assisted
accessibility in the humanities. It also offers a reflective critique, examining AI outputs as objects
of study in their own right. In the following sections, we outline our data and methodology (Sec-
tion 2), present initial observations from our experiments (Section 3), and discuss implications for
digital humanities practice (4), before concluding with planned next steps (Section 5).

2 Data & Methodology
To ground our evaluation in a real-world scenario, we use data from Stadt.Geschichte.Basel, a
large-scale historical research project tracing the history of Basel from 50’000 BCE to the present
day. Research data is FAIRly available on the project’s Open Research Data Platform with meta-
data in a Dublin Core schema created by our Team for Research Data Management Team in a
comprehensive annotation workflow, following guidelines set out in our handbook for the creation
of non-discriminatory metadata [9].
Crucially, alt-text has been missing in our data model until now, rendering this collection an
ideal testing ground for our study. The diversity of the corpus poses a significant challenge to au-
tomated captioning: many figures are visually and historically complex, requiring domain knowl-
edge to describe properly. This data thus allows us to investigate whether AI captioners can handle
the ’long tail’ of content found in historical archives, beyond the everyday photographs on which
many models are trained [4].

1149
2.1 Dataset for Alt Text Generation and Evaluation
For our survey, we compiled a dataset designed to represent both the heterogeneity of media types
and the timeframe covered by the Stadt.Geschichte.Basel project. The project collection, published
on our Open Research Data Platform, features more than 1700 media objects including metadata as
of October 2025. From this corpus, we created a dataset to use for alt-text generation trials. This
dataset comprises a hundred items and is released with this paper to be used for benchmarking
purposes. Additionally, we created a subset of 20 items to make it more feasible to evaluate alt-
text in an expert survey. For both sets, items were selected to maintain representativeness across
the same dimensions while being manageable for expert reviewers to assess within a reasonable
time frame. (See Appendix A.1 for a more detailed description of the dataset).
All items were categorized into ten distinct media types (e.g. paintings, maps, scans of news-
papers etc., see Appendix A.1), allowing us to ensure a balanced distribution of content. Data types
primarily comprise images and figures, maps and geodata, tables and statistics, and bibliographic
references [8]: Heterogeneous digitized items including historical photographs, reproductions of
artifacts, city maps and architectural plans, handwritten letters and manuscripts, statistical charts,
and printed ephemera (e.g., newspaper clippings, posters). We made sure to include items with
complex visual structures (items that need additional information to convey their meaning, e.g., a
legend for maps or diagrams), items with visible text in different languages (e.g., scans of news-
papers or posters) as well as items with potentially sensitive content (e.g., content with derogatory
and/or racist terminology).
To prompt the models as described below, we used JPG files at a standardized size of 800×800
pixels – the same resolution employed for human viewers on our online platform – and their cor-
responding metadata in JSON format.

2.2 Dataset Limitations
The number of eligible items is reduced by excluding items that are only available with placeholder
images on our platform due to copyright restrictions. Additionally, due to the typesetting workflow
during the production of the printed volumes, some collection items had to be split up into different
files – maps and charts where the legend is provided in a second image file, separate from the main
figure. This pertains to 19 out of 100 items in our data set, respectively four out of 20 items in
the survey. Connections between these segmented files are made explicit in our metadata, but the
models only receive one image file as input at a time, leading to some loss of information that
would be visually available to a human reader. This could result in a lower description quality.

2.3 Model Selection
We selected four multimodal vision-language models (VLMs) representing a balance of open-
weight and proprietary systems with comparable cost and capability:1
Selection criteria:

• Openness & diversity – two proprietary (OpenAI, Google) and two open-weight (Meta,
Qwen) models.

• Cost-capability parity – all models priced between $0.08–$0.15/M input and $0.40–
$0.60/M output tokens, with ≥100K context windows.

• Multilingual & visual competence – explicit support for German and image understanding.
1
Cheaper models such as Mistral Pixtral 12B, AllenAI Molmo 7B-D, and OpenAI GPT-4.1 Nano were tested but
excluded due to consistently empty or nonsensical outputs.

1150
Attribute Gemini 2.5 Flash Llama 4 GPT-4o mini Qwen3 VL 8B
Lite Maverick Instruct
Developer Google Meta OpenAI Alibaba
Openness Proprietary Open weights Proprietary Open-weight
Context 1.05M 1.05M 128K 131K
Latency (s) 0.42 0.56 0.58 1.29
Input $/M 0.10 0.15 0.15 0.08
Output $/M 0.40 0.60 0.60 0.50
Notes Fast, low-cost; Multilingual, Compact GPT-4o Robust open
optimized for multimodal variant; strong baseline with
captioning. reasoning. factual grounding. OCR features.
Table 2: Models used for evaluation. Data reported by OpenRouter (27.10.2025).

The aim was to cover diverse architectures and governance regimes while maintaining fairness
in performance evaluation.

2.4 Prompt engineering
We systematically varied prompt roles and placement, comparing instruction blocks in the system
prompt versus the user prompt, and front-loading versus trailing constraints. We used the same
user and system prompts for all models in zero-shot mode. Following evidence that models privi-
lege information in short and well structured prompts, we fixed normative requirements (WCAG
2.2 aligned, de-CH style, length limits, handling of decorative/functional/complex images) in the
system prompt and kept the user prompt minimal and image-bound to reduce “lost-in-the-middle”
effects [7]. The user prompt injected collection-specific metadata—title, description, EDTF date,
era, creator/publisher/source—and a concise description of the purpose of the alt text, then the im-
age URL. Adding this structured context markedly improved specificity, reduced refusals, and low-
ered hallucinations, consistent with retrieval-style findings that supplying external, task-relevant
evidence boosts generation quality and faithfulness. Recent work confirms that vision–language
models can serve such accessibility roles when embedded in context-rich pipelines. In particular,
user studies with blind and low-vision participants demonstrate that context-aware image descrip-
tions—those combining visual and webpage metadata—are preferred and rated higher for quality,
imaginability, and plausibility than context-free baselines [10]. This supports our design choice to
inject structured collection metadata into the prompt.. These results are consistent with findings
that prompt structure and multimodal fusion can systematically shift which visual cues VLMs
rely on [6]. By anchoring metadata before the image input, we effectively steer the model toward
shape- and context-based reasoning rather than shallow texture correlations—an effect analogous
to prompt-based cue steering observed in vision-language bias studies.
1 def build_prompt (media: MediaObject ) -> str:
2 return f""" Titel : { media . title or "Kein Titel "}
3 Beschreibung : {media. description or "Keine Beschreibung "}
4 Ersteller : {media . creator or "Kein Ersteller "}
5 Herausgeber : {media. publisher or "Kein Herausgeber "}
6 Quelle : { media . source or " Keine Quelle "}
7 Datum: { media.date or "Kein Datum "}
8 Epoche : { media .era or " Keine Epoche "}""".strip ()
9
10 def build_messages (

1151
11 prompt : str , image_url : str
12 ) -> tuple[list[dict[str , Any ]], str , str ]:
13 system = """ZIEL
14
15 Alt -Texte für historische und archäologische Sammlungsbilder .
16 Kurz , sachlich , zugänglich . Erfassung der visuellen Essenz für Screenreader .
17
18 REGELN
19
20 1. Essenz statt Detail . Keine Redundanz zum Seitentext , kein „Bild “von.
21 2. Zentralen Text im Bild wiedergeben oder kurz paraphrasieren .
22 3. Kontext (Epoche , Ort , Gattung , Material , Datierung ) nur bei Relevanz für
,→ Verständnis .
23 4. Prägnante visuelle Merkmale nennen : Farbe , Haltung , Zustand , Attribute .
24 5. Karten / Diagramme : zentrale Aussage oder Variablen .
25 6. Sprache : neutral , präzise , faktenbasiert ; keine Wertung , keine Spekulation .
26 7. Umfang :
27 * Standard : –90180 Zeichen
28 * Komplexe Karten / Tabellen : max. 400 Zeichen
29
30 VERBOTE
31
32 * Kein alt=, Anführungszeichen , Preambeln oder Füllwörter „(“zeigt ,
,→ „“darstellt ).
33 * Keine offensichtlichen Metadaten (z. B. Jahreszahlen aus Beschriftung ).
34 * Keine Bewertungen , Hypothesen oder Stilkommentare .
35 * Keine Emojis oder emotionalen Begriffe .
36
37 HEURISTIKEN
38
39 Porträt : Person (Name , falls bekannt ), Epoche , Pose oder Attribut , ggf.
,→ Funktion .
40 Objekt : Gattung , Material , Datierung , auffällige Besonderheit .
41 Dokument : Typ , Sprache /Schrift , Datierung , Kernaussage .
42 Karte: Gebiet , Zeitraum , Zweck , Hauptvariablen .
43 Ereignisfoto : Wer , was , wo , situativer Kontext .
44 Plakat / Cover: Titel , Zweck , zentrale Schlagzeile .
45
46 FALLBACK
47
48 Unklarer Inhalt : generische , aber sinnvolle Essenz aus Metadaten .
49
50 QUELLEN
51
52 Nur visuelle Analyse ( Bildinhalt ) und übergebene Metadaten . Keine externen
,→ Kontexte .""".strip ()
53 return (
54 [
55 {"role": " system ", " content ": system },
56 {
57 "role": "user",
58 " content ": [
59 {"type": "text", "text": prompt },
60 {"type": " image_url ", " image_url ": {"url": image_url }},
61 ],
62 },
63 ],
64 system ,
65 prompt ,
66 )

1152
2.5 Alt Text Generation and Post-processing
Using the carefully engineered system and user prompts, we ran each image through each of the
four models, yielding four candidate descriptions per image. The generation process was auto-
mated via a Python script using OpenRouter as an API wrapper. We produced 80 candidate alt-texts
(4 per image for n=20 images in our survey). After generation, no post-processing was applied.
All results were stored along with metadata and model identifiers for evaluation.
No model refused to describe an image due to some built-in safety filter (labelling a historical
photograph as sensitive content). Otherwise we would have handled those on a case-by-case basis
by leaving that image for human description. Overall, this pipeline is designed to be simple, and
maximize coverage (getting at least one description for every image) while maintaining quality
through careful prompting.

2.6 Survey
Twenty-one humanities scholars ranked, per image, four model-generated descriptions from best
(1) to worst (4) under WCAG-intended criteria for alt text. Raters were asked to consider: (a)
concise rendering of the core visual content; (b) avoidance of redundant phrases (e.g., “image
of”); (c) prioritisation of salient visual features (persons, objects, actions, visible text); and (d)
context inclusion only when it improves comprehension. While factual accuracy, complete-
ness, and absence of bias were not primary ranking dimensions, they may have been factored in
implicitly.

2.7 Close reading
To check for these dimensions, the authors conducted a qualitative close reading of a selection
of the generated alt-text. This analysis specifically targeted outputs that had received the highest
rankings from the expert panel.

• Factual accuracy (Did the generated description contain any incorrect identifications of
people, objects, or actions?)

• Contextual adequacy (Did the generated description include any incorrect or misleading
historical context?)

• Bias reproduction (Did the model reproduce sensitive, derogatory, or racist terminology
from the source material?)

This allowed us to investigate whether an alt-text could be ranked highly for WCAG alignment
while simultaneously being factually incorrect or ethically problematic.

3 Results and Analysis
3.1 RQ1 Feasibility: Coverage, Throughput, and Unit Cost
To address the feasibility of automatic alt-text generation at corpus scale, we compared four state-
of-the-art vision–language models (VLMs): Google Gemini 2.5 Flash Lite, Meta Llama 4 Mav-
erick, OpenAI GPT-4o mini, and Qwen 3 VL 8B Instruct. Each model generated alt-text de-
scriptions for 20 representative heritage images selected for diversity of content, medium, and
metadata completeness.

1153
3.1.1 Coverage and reliability
All models returned non-empty outputs for all 20 prompts, yielding 100 % coverage and no failed
responses. This demonstrates that current VLMs can reliably produce textual descriptions even
for heterogeneous heritage data without the need for fallback mechanisms.
3.1.2 Throughput and latency
Processing speed ranged between 0.24–0.43 items/s, corresponding to median latencies of 2–4 s
per item. Models were accessed via OpenRouter.ai with the following providers: Google – gemini-
2.5-flash-lite, OpenAI – gpt-4o-mini, Together – llama-4-maverick, and Alibaba – Qwen 3 VL 8B.
Qwen 3 VL 8B achieved the fastest throughput and lowest latency, while OpenAI GPT-4o mini was
slower but consistent. All models showed moderate response-time variability (≈ 1.7–10 s).
3.1.3 Cost efficiency
Unit generation costs differed by two orders of magnitude, reflecting API pricing rather than archi-
tectural complexity. According to OpenRouter.ai cost reports, costs per item ranged from $1.8 ×
10−4 (Qwen) to $3.6 × 10−3 (OpenAI).

Coverage (%)
Throughput Median Mean Cost

(items/s) Latency (s) (USD/item)
Model
Google Gemini 2.5 Flash Lite 100 0.31 2.72 0.000215
Meta Llama 4 Maverick 100 0.41 2.38 0.000395
OpenAI GPT-4o Mini 100 0.24 4.00 0.003625
Qwen 3 VL 8B Instruct 100 0.43 2.27 0.000182

Table 3: Feasibility metrics for alt text generation (n = 20).

3.1.4 Summary of RQ1
All models achieved complete coverage, acceptable latency, and minimal cost, confirming the
technical and economic feasibility of automated alt text generation for large, heterogeneous cultural
collections. Failures were not due to empty outputs but to qualitative weaknesses, which are
examined under RQ2.

3.2 RQ2 Relative Quality: Expert Ranking and Qualitative Assessment
3.2.1 Quantitative ranking analysis
Within each task, all models were directly compared; task-level median ranks were analyzed across
the 20 tasks using the Friedman test for repeated measures, followed by pairwise Wilcoxon
signed-rank tests with Holm–Bonferroni correction. Agreement across tasks was quantified
with Kendall’s W.

χ2 (3, N = 20) = 6.02, p = 0.11, W = 0.0085.
The results indicate no statistically significant difference among models (p > 0.05) and
very low inter-task agreement (W ≈ 0.01), implying that relative rankings varied substantially
by task. Pairwise Wilcoxon comparisons (Appendix A.4) showed no significant differences after

1154
correction (pHolm > 0.5); unadjusted p-values suggested weak, non-significant trends favoring
OpenAI GPT-4o Mini and Qwen 3 VL 8B over Google Gemini and Meta Llama.
3.2.2 Descriptive patterns
Twenty-one human experts each rated four alternative texts for 20 images, yielding a total of 420
individual ratings:

Model Rank 1 Rank 2 Rank 3 Rank 4
Google Gemini 2.5 Flash Lite 86 84 113 137
Meta Llama 4 Maverick 88 114 110 108
OpenAI GPT-4o Mini 132 101 84 103
Qwen 3 VL 8B Instruct 114 121 113 72

OpenAI and Qwen outputs received more first-place and fewer last-place rankings, but over-
lapping rank distributions (Appendix A.3) indicate that these tendencies remain descriptive rather
than inferentially significant.
3.2.3 Qualitative evaluation of top-rated outputs
A manual close reading inspection of hand-picked alt texts with highest mean rank scores revealed
that even those outputs deemed to be the ‘best’ were not free from substantive and ethical short-
comings. In fact, all models produced at least one error. Some are easy to catch in a manual
review (factually wrong descriptions), others require expert domain knowledge (reproduction of
stereotypes).

Figure 1: Faltblatt der Gruppe ‹Freiräume für Frauen› (FFF) (m92410)

Example for Factually Wrong Text (Figure 1)

• Best-ranked alt-text: “Faltblatt der Gruppe ‹Freiräume für Frauen› (FFF) von 1992. Zwei
Personen gehen eine Treppe hinunter, Frau mit Zigarette, Mann mit Sonnenbrille und Hän-
den in den Hosentaschen” (Google Gemini 2.5 Flash Lite)

1155
• Translation: “Leaflet of the ’Freiräume für Frauen’ (FFF) group from 1992. Two people
walk down a staircase, a woman with a cigarette, a man with sunglasses and hands in his
pockets”
• Shortcoming: This description is factually incorrect. The two people in the photograph are
clearly walking up the staircase, not down.

Figure 2: ‹Gegen den Eisenbahnhandel›, anonyme antisemitische Karikatur in der Extraausgabe
der Basler ‹Allgemeinen Schweizer Zeitung› vom 14. Februar 1898 (m91960)

Example for Reproduction of Stereotypes (Figure 2)
• Best-ranked alt-text: “Antisemitische Karikatur zur Eisenbahnverstaatlichung 1898.
Links: Wehrhafte Helvetia zwingt knienden Geldverleiher nieder. Rechts: Kniende Hel-
vetia bittet reichen Juden um Geld. Kontrastdarstellung zum Volksentscheid über die
Eisenbahnverstaatlichung” (Meta Llama 4 Maverick)
• Translation: “Antisemitic cartoon on the nationalization of the railways 1898. Left: Defen-
sive Helvetia forces a kneeling moneylender down. Right: Kneeling Helvetia asks a rich Jew
for money. Contrasting depiction of the referendum on the nationalization of the railways”
• Shortcoming: The text uncritically reproduces the term “reichen Juden” (rich Jew). This is
an antisemitic stereotype likely drawn from the item’s metadata or biased training data, not
a neutral description of the visual content.

Example for Selective Description (Figure 3)
• Best ranked alt-text: “Balkendiagramm: Zuschauerzahlen von FC Basel (blau) und Konz-
erten (gelb) im St. Jakob-Stadion, 1960–2022. Höchstwerte bei Fußballspielen ab 2001 im
neuen St. Jakob-Park, Einbruch 2020 durch Corona-Pandemie” (Meta Llama 4 Maverick)
• Translation: “Bar chart: Audience numbers for FC Basel (blue) and concerts (yellow) at
St. Jakob-Stadion, 1960–2022. Peak values for football matches from 2001 in the new St.
Jakob-Park, slump in 2020 due to the Corona pandemic”
• Shortcoming: The description is selective and unbalanced. It provides a detailed interpre-
tation of the trends for the football matches (blue bars) but does not give information or
interpretation about neither the concert attendance (yellow bars), nor the number of football
matches (green circles), omitting a lot of the chart’s comparative data.

1156
Figure 3: Zuschauerzahlen von FC Basel und Konzerten im St. Jakob-Stadion, 1960–2022
(m88415_1)

3.2.4 Interpretation
These examples for erroneous alt texts that still were ranked best in our survey illustrate that high
quantitative rankings do not imply factual accuracy or ethical adequacy as illustrated by a close
reading. Even when linguistically fluent and stylistically polished, VLM-generated alt texts may
introduce epistemic distortions or perpetuate historical bias.
Quantitatively, no model achieved a statistically distinct performance profile; qualitatively,
all exhibited systematic error patterns—misrecognition, omission, and uncritical reproduction
of harmful source language. This combination highlights the limits of rank-based evaluation
alone: expert preference captures relative quality but not factual or ethical soundness.

3.3 Synthesis
• RQ1 Feasibility: All four VLMs achieved full coverage, low latency, and negligible cost,
confirming the operational viability of automated WCAG-aligned alt text generation for her-
itage corpora.

• RQ2 Relative quality: Expert rankings showed no statistically significant hierarchy among
models (p = 0.11, W ≈ 0.01), and qualitative inspection exposed factual inaccuracies,
biased reproduction, and selective omissions even in top-rated outputs.

Overall, current VLMs can populate heritage databases at scale but require expert review and
critical post-editing to ensure factual precision, ethical compliance, and contextual adequacy.
Automated alt text workflows should therefore combine model ensembles with targeted human
oversight to meet both accessibility and historiographical standards.

4 Discussion
Our findings confirm the central tension in using contemporary VLMs for heritage accessibility:
they are operationally feasible but epistemically fragile. The 100% coverage, low latency, and
negligible cost (RQ1) demonstrate that the technical and economic barriers to generating descrip-
tions at a corpus-wide scale are virtually gone. However, the results from our expert evaluation
(RQ2) reveal a gap between this operational success and the production of high-quality, trustworthy
alt text.

1157
The lack of a statistically significant winner among the models, combined with the low inter-
task agreement (W ≈ 0.01), is a key finding. It suggests that no single model is a reliable
one-shot solution. A model that performs well on a photograph might fail on a diagram, and vice-
versa. This variability reinforces the findings of mechanistic analyses like [6], which show that
VLM outputs are highly sensitive to how their fusion layers mediate visual cues. Our metadata-
rich prompts likely steered models toward more context-aware descriptions, but this “cue steering”
was not a panacea against factual or ethical errors.
Critically, our mixed-methods approach exposed the limits of rank-based evaluation alone.
The qualitative close reading (Section 3.2.3) revealed that outputs ranked highly by experts for
WCAG alignment (i.e., conciseness and style) could still be factually wrong, ethically problematic,
or epistemically shallow. The FFF flyer (m92410) example, which confidently misidentifies the
walking direction, and the antisemitic cartoon (m91960) example, which uncritically reproduces
the term “reicher Jude” (rich Jew) from the source’s metadata, are stark illustrations. Fluency, in
essence, is not a proxy for accuracy or ethical adequacy.
This study validates the framing of AI as an accessibility assistant (Section 1) rather than an
autonomous author. The VLM output should be treated as a first draft for human review, not a final
product. This reframes the labor of digital humanists: from authoring descriptions from scratch
to critically editing machine-generated drafts. This aligns with the hybrid assessment frameworks
proposed in educational research [11] and necessitates the “new hermeneutic of suspicion” [5]
advocated in our introduction. Curators and historians must be trained in AI literacy [13] to spot
subtle biases and misinterpretations that a fluent-sounding description might otherwise obscure.
Finally, our study has limitations. The expert ranking (n=21) was based on a relatively small
(n=20) subset of images, which, while diverse in content, limits the statistical power of our quan-
titative analysis. Furthermore, the survey criteria explicitly prioritized WCAG stylistic guidelines
over factual accuracy, a dimension we could only capture post-hoc via our qualitative close read-
ing. A crucial missing component, which we intentionally bracketed to first establish a baseline, is
the perspective of blind and low-vision users themselves. Without their input, any AI-driven ac-
cessibility solution risks falling into the trap of “techno-ableism” [12], designing for a community
without designing with them.

5 Future Work
Building on this study’s findings, we identify several critical paths for future research to bridge the
gap between the operational promise and epistemic fragility of AI-generated alt text for heritage.

• Usability and User-Experience (UX) Studies: The most urgent next step is to move beyond
expert-as-proxy and engage directly with blind and low-vision (BLV) users. Future work
should conduct qualitative usability studies to assess how BLV readers experience these AI-
generated descriptions. Do they find them helpful, confusing, or biased? Does the inclusion
of metadata (as prompted) improve or hinder “imaginability”? This addresses the techno-
ableism critique and centers the lived experience of those the technology claims to serve.

• Domain-Specific Fine-Tuning: This study relied on general-purpose VLMs with prompt
engineering. A promising avenue is the fine-tuning of open-weight models (such as Meta’s
Llama 4 or Qwen’s Qwen3 VL) on a high-quality, domain-specific dataset. By training a
model on thousands of expert-vetted alt texts from GLAM (Galleries, Libraries, Archives,
and Museums) collections, it may be possible to create a model that is more factually ac-
curate, context-aware, and ethically sensitive to heritage content than its general-purpose
counterparts.

1158
• Developing Human-in-the-Loop (HITL) Workflows: Our results confirm the necessity of
expert review. Future research should move from evaluation to implementation by designing
and testing HITL editorial interfaces. What is the most effective workflow for a historian to
review, correct, and approve AI-generated alt text? How can we best integrate AI-generated
“drafts” into existing collections management systems (CMS) and research data platforms,
complete with policies for handling sensitive content?

• Scaling the Benchmark: This study established a 100-item benchmark dataset. The next
phase should involve using this larger dataset to conduct a more robust quantitative analysis.
This would allow for a more granular breakdown of model performance by media type (e.g.,
maps vs. manuscripts vs. photographs) and help establish more reliable cost-quality trade-
offs to guide GLAM institutions in adopting these technologies.

Acknowledgements
We thank Cristina Münch and Noëlle Schnegg for curating metadata and providing image assets
from the Stadt.Geschichte.Basel collection. We are grateful to the 21 expert participants for their
careful rankings and comments. For insightful feedback on an early draft, we thank Dr. Mehrdad
Almasi (Luxembourg Centre for Contemporary and Digital History, C²DH). Any remaining errors
are our own.

References
[1] Bowker, Geoffrey C. and Star, Susan Leigh. Sorting Things out: Classification and Its Con-
sequences. Inside Technology. Cambridge, Massachusetts: MIT Press, 1999. 377 pp. ISBN:
978-0-262-02461-7.
[2] Cecilia, Rafie, Moussouri, Theano, and Fraser, John. “AltText: An Institutional Tool for
Change”. In: Curator 66, no. 2 (2023), pp. 225–231. DOI: 10.1111/cura.12551.
[3] Cecilia, Rafie, Moussouri, Theano, and Fraser, John. “Creating Accessible Digital Images
for Vision Impaired Audiences and Researchers”. In: Curator 66, no. 1 (2023), pp. 5–8.
DOI: 10.1111/cura.12536.
[4] Cetinic, Eva. “Towards Generating and Evaluating Iconographic Image Captions of Art-
works”. In: Journal of Imaging 7, no. 8 (July 23, 2021), p. 123. ISSN: 2313-433X. DOI:
10.3390/jimaging7080123. PMID: 34460759. URL: https://pmc.ncbi.nlm.nih.
gov/articles/PMC8404909/ (visited on 10/28/2025).
[5] Fickers, Andreas. “Digital Hermeneutics: The Reflexive Turn in Digital Public History?”
In: Handbook of Digital Public History, ed. by Serge Noiret, Valérie Schafer, and Gerben
Zaagsma. De Gruyter, 2022, pp. 139–148. DOI: 10.1515/9783110430295-012.
[6] Gavrikov, Paul, Lukasik, Jovita, Jung, Steffen, Geirhos, Robert, Mirza, M. Jehanzeb, Keu-
per, Margret, and Keuper, Janis. “Can We Talk Models Into Seeing the World Differently?”
Version 2. Mar. 5, 2025. DOI: 10.48550/arXiv.2403.09193. arXiv: 2403.09193 [cs].
URL: http://arxiv.org/abs/2403.09193 (visited on 10/27/2025). Pre-published.
[7] Liu, Nelson F., Lin, Kevin, Hewitt, John, Paranjape, Ashwin, Bevilacqua, Michele, Petroni,
Fabio, and Liang, Percy. “Lost in the Middle: How Language Models Use Long Contexts”.
Nov. 20, 2023. DOI: 10.48550/arXiv.2307.03172. arXiv: 2307.03172 [cs]. URL:
http://arxiv.org/abs/2307.03172 (visited on 10/28/2025). Pre-published.
[8] Mähr, Moritz. “Research Data Management in (Public) History”. Keynote. Istituto Svizzero
di Roma, June 2022. URL: https://doi.org/10.5281/zenodo.6637118.

1159
[9] Mähr, Moritz and Schnegg, Noëlle. “Handbuch zur Erstellung diskriminierungsfreier Meta-
daten für historische Quellen und Forschungsdaten: Erfahrungen aus dem geschichtswis-
senschaftlichen Forschungsprojekt Stadt.Geschichte.Basel”. Basel: Zenodo, June 2024.
DOI: 10.5281/ZENODO.11124720.
[10] Mohanbabu, Ananya Gubbi and Pavel, Amy. “Context-Aware Image Descriptions for Web
Accessibility”. In: The 26th International ACM SIGACCESS Conference on Computers and
Accessibility. Oct. 27, 2024, pp. 1–17. DOI: 10.1145/3663548.3675658. arXiv: 2409.
03054 [cs]. URL: http://arxiv.org/abs/2409.03054 (visited on 10/27/2025).
[11] Reihanian, Iman, Hou, Yunfei, Chen, Yu, and Zheng, Yifei. “A Review of Generative AI
in Computer Science Education: Challenges and Opportunities in Accuracy, Authenticity,
and Assessment”. Version 1. June 17, 2025. DOI: 10.48550/arXiv.2507.11543. arXiv:
2507.11543 [cs]. URL: http://arxiv.org/abs/2507.11543 (visited on 10/27/2025).
Pre-published.
[12] Shew, Ashley. Against Technoableism: Rethinking Who Needs Improvement. New York: W.
W. Norton, 2023.
[13] Strien, Daniel van, Bell, Mark, McGregor, Nora Rose, and Trizna, Michael. “An Introduc-
tion to AI for GLAM”. In: Proceedings of the Second Teaching Machine Learning and
Artificial Intelligence Workshop. The Second Teaching Machine Learning and Artificial In-
telligence Workshop. PMLR, Mar. 14, 2022, pp. 20–24. URL: https://proceedings.
mlr.press/v170/strien22a.html (visited on 10/28/2025).
[14] World Wide Web Consortium. “Web Content Accessibility Guidelines (WCAG) 2.2”. 2023.
URL: https://www.w3.org/TR/WCAG22/.

A Appendix
A.1 Dataset description
In both our selection of 100-item and the 20-item survey subset, we tried to find an overall balance
between all data types and eras that make up the Stadt.Geschichte.Basel collection.
A.1.1 Distribution by Type
Due to the collection’s historical nature, not all data types appear in all eras, and the smaller size
of the survey subset accentuates these constraints. We dropped Painting items and the Antiquity
era from the survey subset due to their low prevalence in our corpus.
A.1.2 Distribution by Era
With regards to the historical eras represented in the subset, we aimed to cover the full chronologi-
cal span of the Stadt.Geschichte.Basel project, from 50’000 BCE until the 21st century. Since each
item is tagged with an era in the metadata, we could systematically select items across periods in
a way that resembles that distribution in the whole research data set (at the time of writing). Items
from some eras, e.g. Antiquity and 21st Century, are less frequent in the overall collection which
is reflected in a lower representation in our dataset.
A.1.3 Distribution across Era and Type (survey count in parentheses)
A.1.4 Language Distribution
Our research data collection primarily contains items in German, with a small number of items
in Latin, French and Dutch. We aimed to reflect this language distribution in our selection. In
a similar vein, we wanted to take into account different typographic styles. However, writing is

1160
Type Dataset Survey
Painting 12 0
Object 13 2
Photograph (Archaeological Site) 10 2
Photograph (Historical Scenes) 10 2
Scan of Newspapers, Posters, Lists, etc. 10 3
Drawing (Archaeological Reconstruction) 10 3
Drawing (Historical Drawing) 10 2
Map 10 2
Diagram (Statistics) 10 2
Diagram (Flowchart, Schema etc.) 5 2
Total 100 20

Table 4: Distribution of media types in the dataset and survey subset.

Era Dataset Survey
Protohistory 11 3
Antiquity 3 0
Middle Ages 16 2
Early Modern period 21 3
19th century 19 5
20th century 25 5
21st century 5 2
Total 100 20

Table 5: Distribution of historical eras in the dataset and survey subset.

Early Modern

Protohistory Antiquity Middle Ages 19th century 20th century 21st century
Period
Type
Scan of Newspapers, Lists, etc. 0 (0) 0 (0) 1 (0) 2 (0) 3 (2) 4 (1) 0 (0)
Photograph (Historical Scenes) 0 (0) 0 (0) 0 (0) 0 (0) 3 (1) 7 (1) 0 (0)
Photograph (Archaeological Site) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 5 (1) 5 (1)
Object 3 (0) 0 (0) 3 (0) 6 (1) 1 (1) 0 (0) 0 (0)
Map 1 (0) 3 (0) 1 (0) 1 (0) 1 (1) 3 (1) 0 (0)
Drawing (Historical Drawing) 0 (0) 0 (0) 1 (0) 5 (2) 4 (0) 0 (0) 0 (0)
Drawing (Archaeological Reconstruction) 5 (3) 0 (0) 5 (0) 0 (0) 0 (0) 0 (0) 0 (0)
Diagram (Statistics) 0 (0) 0 (0) 0 (0) 1 (0) 2 (0) 6 (2) 1 (1)
Diagram (Flowchart, Schema etc.) 0 (0) 0 (0) 3 (2) 0 (0) 1 (0) 1 (0) 0 (0)
Painting 0 (0) 0 (0) 2 (0) 6 (0) 4 (0) 0 (0) 0 (0)

Table 6: Distribution of media types across historical eras in the survey subset (counts in paren-
theses).

1161
not fully legible in many cases anyway – since we are working with 800×800 pixel JPG image
thumbnails – and thus played only a minor factor in the selection process.

Language Dataset Survey
German 33 9
Latin 8 2
French 2 0
Without written text 57 9

Table 7: Distribution of languages in the dataset and survey subset.

A.1.5 Spatial Context
While geospatial context is not a part of our data model, most items in the Stadt.Geschichte.Basel
collection can be associated with specific locations in Basel or elsewhere. The geographical dis-
tribution of the collection items did not influence our selection process directly, but a rough cate-
gorization was done afterwards to see whether differences in geographical scope are represented
in our dataset.

Spatial Context Dataset Survey
City of Basel 60 14
Basel Region/Northwestern Switzerland/Upper Rhine 16 2
Switzerland 6 0
Switzerland and Neighbouring Countries 5 1
Europe 3 1
Worldwide 5 2
NA 5 0

Table 8: Distribution of spatial contexts in the dataset and survey subset.

A.1.6 Media Complexity
For technical reasons, some objects in our collection consist of several media items. These are
legends for maps and diagrams, visually supplying information that is crucial to fully grasp the
meaning of the media item.

Media Complexity Dataset Survey
Single-Item Object 81 19
Multiple-Item Object (Figure and separate Legend) 16 4

Table 9: Distribution of media complexity in the dataset and survey subset.

A.2 System Performance Analysis
A.3 Rank Distributions and Aggregate Performance

1162
Figure 4: Throughput, Latency, and Cost by Model

Figure 5: Counts of Ranks per Model (All Ratings)

Figure 6: Rank Distributions per Model (Task-Level Medians; Lower = Better)

1163
Table 10: Rank counts per object and model.

objectid model count_rank_1 count_rank_2 count_rank_3 count_rank_4
m12965 google/gemini-2.5-flash-lite 0 9 8 4
m12965 meta-llama/llama-4-maverick 11 0 4 6
m12965 openai/gpt-4o-mini 7 5 4 5
m12965 qwen/qwen3-vl-8b-instruct 3 7 5 6
m13176 google/gemini-2.5-flash-lite 2 4 5 10
m13176 meta-llama/llama-4-maverick 4 4 10 3
m13176 openai/gpt-4o-mini 6 6 2 7
m13176 qwen/qwen3-vl-8b-instruct 9 7 4 1
m15298_1 google/gemini-2.5-flash-lite 6 1 6 8
m15298_1 meta-llama/llama-4-maverick 2 11 7 1
m15298_1 openai/gpt-4o-mini 6 2 4 9
m15298_1 qwen/qwen3-vl-8b-instruct 7 7 4 3
m20435 google/gemini-2.5-flash-lite 5 5 8 3
m20435 meta-llama/llama-4-maverick 2 6 4 9
m20435 openai/gpt-4o-mini 3 4 6 8
m20435 qwen/qwen3-vl-8b-instruct 11 6 3 1
m22924 google/gemini-2.5-flash-lite 5 3 4 9
m22924 meta-llama/llama-4-maverick 7 3 5 6
m22924 openai/gpt-4o-mini 3 9 6 3
m22924 qwen/qwen3-vl-8b-instruct 6 6 6 3
m28635 google/gemini-2.5-flash-lite 2 6 7 6
m28635 meta-llama/llama-4-maverick 6 9 2 4
m28635 openai/gpt-4o-mini 13 3 4 1
m28635 qwen/qwen3-vl-8b-instruct 0 3 8 10
m29084 google/gemini-2.5-flash-lite 2 9 10 0
m29084 meta-llama/llama-4-maverick 2 2 2 15
m29084 openai/gpt-4o-mini 7 5 3 6
m29084 qwen/qwen3-vl-8b-instruct 10 5 6 0
m34620 google/gemini-2.5-flash-lite 3 8 9 1
m34620 meta-llama/llama-4-maverick 5 5 4 7
m34620 openai/gpt-4o-mini 4 5 3 9
m34620 qwen/qwen3-vl-8b-instruct 9 3 5 4
m37030_1 google/gemini-2.5-flash-lite 7 6 5 3
m37030_1 meta-llama/llama-4-maverick 3 7 4 7
m37030_1 openai/gpt-4o-mini 9 7 3 2
m37030_1 qwen/qwen3-vl-8b-instruct 2 1 9 9
m37716 google/gemini-2.5-flash-lite 3 2 7 9
m37716 meta-llama/llama-4-maverick 3 4 8 6
m37716 openai/gpt-4o-mini 8 8 2 3
m37716 qwen/qwen3-vl-8b-instruct 7 7 4 3
m39198_1 google/gemini-2.5-flash-lite 3 2 1 15
m39198_1 meta-llama/llama-4-maverick 4 5 9 3
m39198_1 openai/gpt-4o-mini 10 4 4 3
m39198_1 qwen/qwen3-vl-8b-instruct 4 10 7 0
m82972 google/gemini-2.5-flash-lite 11 5 3 2
m82972 meta-llama/llama-4-maverick 2 9 7 3
m82972 openai/gpt-4o-mini 5 5 4 7
m82972 qwen/qwen3-vl-8b-instruct 3 2 7 9
m88415_1 google/gemini-2.5-flash-lite 1 3 1 16
m88415_1 meta-llama/llama-4-maverick 5 9 7 0
m88415_1 openai/gpt-4o-mini 9 3 5 4
m88415_1 qwen/qwen3-vl-8b-instruct 6 6 8 1
m91000_1 google/gemini-2.5-flash-lite 1 3 6 11
m91000_1 meta-llama/llama-4-maverick 7 7 6 1
m91000_1 openai/gpt-4o-mini 5 6 2 8
m91000_1 qwen/qwen3-vl-8b-instruct 8 5 7 1
m91960 google/gemini-2.5-flash-lite 2 4 8 7
continued on next page

1164
Table 10: Rank counts per object and model (continued)

objectid model count_rank_1 count_rank_2 count_rank_3 count_rank_4
m91960 meta-llama/llama-4-maverick 10 6 4 1
m91960 openai/gpt-4o-mini 4 1 6 10
m91960 qwen/qwen3-vl-8b-instruct 5 10 3 3
m92357 google/gemini-2.5-flash-lite 8 3 5 5
m92357 meta-llama/llama-4-maverick 3 11 3 4
m92357 openai/gpt-4o-mini 5 4 7 5
m92357 qwen/qwen3-vl-8b-instruct 5 3 6 7
m92410 google/gemini-2.5-flash-lite 14 0 3 4
m92410 meta-llama/llama-4-maverick 3 6 6 6
m92410 openai/gpt-4o-mini 3 11 4 3
m92410 qwen/qwen3-vl-8b-instruct 1 4 8 8
m94271 google/gemini-2.5-flash-lite 9 3 5 4
m94271 meta-llama/llama-4-maverick 2 3 4 12
m94271 openai/gpt-4o-mini 9 4 6 2
m94271 qwen/qwen3-vl-8b-instruct 1 11 6 3
m94775 google/gemini-2.5-flash-lite 0 2 2 17
m94775 meta-llama/llama-4-maverick 4 5 9 3
m94775 openai/gpt-4o-mini 7 5 8 1
m94775 qwen/qwen3-vl-8b-instruct 10 9 2 0
m95804 google/gemini-2.5-flash-lite 2 6 10 3
m95804 meta-llama/llama-4-maverick 3 2 5 11
m95804 openai/gpt-4o-mini 9 4 1 7
m95804 qwen/qwen3-vl-8b-instruct 7 9 5 0

Statistic Value
Friedman χ2 6.0191
p-value 0.1107
Kendall’s W (observed) 0.0085
Kendall’s W (from Friedman) 0.1003
Number of tasks 20
Number of unique raters 21
Total submissions 420

Table 11: Ranked Friedman and Kendall’s W Test Summary

A.4 Pairwise Comparison of Models

A.5 Reproducibility and Data Availability
All code, datasets, and analysis artefacts supporting this study are openly available under open
licenses at:
Repository: https://github.com/maehr/chr2025-seeing-history-unseen
Persistent record: FIXME Zenodo DOI
The repository provides a complete, executable research pipeline for the CHR 2025 paper
“Seeing History Unseen: Evaluating Vision-Language Models for WCAG-Compliant Alt-Text in
Digital Heritage Collections.”
Key components:

• src/ — source code for data generation, cleaning, and statistical analysis

1165
Figure 7: Pairwise Adjusted p-values (Holm) — Task-Level Inference

model_a model_b statistic pvalue p_adjusted_holm
google/gemini-2.5-flash-lite meta- 100.0 0.864524 1.0
llama/llama-
4-maverick
google/gemini-2.5-flash-lite openai/gpt- 59.5 0.088574 0.513834
4o-mini
google/gemini-2.5-flash-lite qwen/qwen3- 60.5 0.095709 0.513834
vl-8b-instruct
meta-llama/llama-4-maverick openai/gpt- 60.0 0.085639 0.513834
4o-mini
meta-llama/llama-4-maverick qwen/qwen3- 61.5 0.103028 0.513834
vl-8b-instruct
openai/gpt-4o-mini qwen/qwen3- 105.0 1.0 1.0
vl-8b-instruct
Table 13: Pairwise Wilcoxon Signed-Rank Tests between Models with Holm Adjustment

1166
• runs/ — timestamped outputs of alt-text generation runs, including raw API responses

• data/processed/ — anonymised survey and ranking data used for evaluation

• analysis/ — statistical summaries, CSVs, and figures referenced in this appendix

• paper/images/ — figure assets for the manuscript

Reference run: runs/20251021_233530/ — canonical example with subsample configura-
tion (20 media objects × 4 models). All tables and plots in this appendix derive from this run and
subsequent survey analyses.
A pre-configured GitHub Codespace enables fully containerised reproduction without local
setup. All scripts print output paths and runtime logs to ensure transparent traceability.

A.6 FAIR and CARE Compliance
The project adheres to the FAIR (Findable, Accessible, Interoperable, Reusable) and CARE (Col-
lective Benefit, Authority to Control, Responsibility, Ethics) principles for open humanities data.

• Findable: Repository indexed on GitHub and Zenodo with persistent DOI; structured meta-
data and semantic filenames.

• Accessible: Publicly accessible under AGPL-3.0 (code) and CC BY 4.0 (data, documenta-
tion). No authentication barriers.

• Interoperable: Machine-readable CSV, JSONL, and Parquet formats; consistent column
schemas; human- and machine-readable metadata.

• Reusable: Version-controlled pipeline, deterministic random seeds, explicit dependencies,
and complete provenance logs.

• Collective Benefit: Focus on accessibility and inclusion in digital heritage; results aim to
improve equitable access to cultural data.

• Authority to Control: No personal or culturally sensitive material; contributors retain au-
thorship and citation credit.

• Responsibility: Transparent methodological reporting and ethical safeguards for AI-assisted
heritage interpretation.

• Ethics: Evaluation limited to non-personal, publicly available heritage materials; compli-
ance with institutional research ethics guidelines.

Together, these practices ensure that the entire workflow - from model evaluation to figure
generation – is transparent, reproducible, and reusable across digital humanities and accessibility
research contexts.

1167

Back to top