Anthology of Computers and the Humanities, Vol. 3


              Seeing History Unseen: Evaluating Vision-Language
               Models for WCAG-Compliant Alt-Text in Digital
                              Heritage Collections
                                        Moritz Mähr1,2 , and Moritz Twente1
                              1
                                  Stadt.Geschichte.Basel, University of Basel, Basel, Switzerland
                                   2
                                     Digital Humanities, University of Bern, Bern, Switzerland

                                                           Abstract
                Digitized heritage collections remain partially inaccessible because images often lack de-
                scriptive alternative text (alt-text). We evaluate whether contemporary Vision-Language
                Models (VLMs) can assist in producing WCAG-compliant alt-text for heterogeneous his-
                torical materials. Using a 100-item dataset curated from the Stadt.Geschichte.Basel Open
                Research Data Platform—covering photographs, maps, drawings, objects, diagrams, and
                print ephemera across multiple eras—we generate candidate descriptions with four VLMs
                (Google Gemini 2.5 Flash Lite, Meta Llama 4 Maverick, OpenAI GPT-4o mini, Qwen 3 VL
                8B Instruct). Our pipeline fixes WCAG and output constraints in the system prompt and
                injects concise, collection-specific metadata at the user turn to mitigate “lost-in-the-middle”
                effects. Feasibility benchmarks on a 20-item subset show 100% coverage, latencies of ~2–4 s
                per item, and sub-cent costs per description. A rater study with 21 humanities scholars ranks
                per-image model outputs; Friedman and Wilcoxon tests reveal no statistically significant
                performance differences, while qualitative audits identify recurring errors: factual misrecog-
                nition, selective omission, and uncritical reproduction of harmful historical terminology.
                We argue that VLMs are operationally viable but epistemically fragile in heritage contexts.
                Effective adoption requires editorial policies, sensitivity filtering, and targeted human-in-the-
                loop review, especially for sensitive content and complex figures. The study contributes a
                transparent, reproducible workflow, a small but representative evaluation set, and an initial
                cost–quality baseline to inform GLAM institutions considering AI-assisted accessibility at
                scale.

                Keywords: alt-text, vision-language models, accessibility, WCAG 2.2, digital heritage
                collections, historical accuracy, human-in-the-loop, ethical implications, metadata, disability
                justice


         1   Introduction
         Digital archives promised to democratize access to cultural heritage, yet a significant portion of
         visual historical content remains inaccessible to people who are blind or have low vision. Many
         digitized photographs, maps, manuscripts, and other images lack descriptive alternative text (alt-
         text), creating an epistemic barrier to the past. This perpetuates an asymmetry in sensory access
         to history, where sighted people hold privileged insight into visual sources while non-sighted au-
         diences encounter barriers to engagement. Making images legible through text is more than a
         technical fix—it is a matter of historical justice and inclusivity in digital humanities. Even beyond
         blind and low-vision users, rich image descriptions can aid others, such as neurodivergent readers
         who benefit from explicit detail that sighted users might glean implicitly [2].
         Moritz Mähr, and Moritz Twente. “Seeing History Unseen: Evaluating Vision-Language Models for WCAG-Compliant
         Alt-Text in Digital Heritage Collections.” In: Computational Humanities Research 2025, ed. by Taylor Arnold,
         Margherita Fantoli, and Ruben Ros. Vol. 3. Anthology of Computers and the Humanities. 2025, 1147–1167.
         https://doi.org/10.63744/njQVYcLndSPE.
         © 2025 by the authors. Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).


                                                              1147
     Alt-text itself is not new: the HTML alt attribute dates back to the 1990s to support acces-
sibility. However, providing high-quality image descriptions has often been a secondary priority
in scholarly communication [3]. Crafting alt-text is labor-intensive and typically left to authors
or curators as a final step, if done at all. The burden often falls on sighted domain experts (not
accessibility experts) to determine what information is or is not included in an image’s descrip-
tion. Human-generated descriptions are valued for capturing contextual meaning and can greatly
enhance the accessibility, searchability, and archivability of digital scholarship. Yet in practice,
many projects—especially smaller public history initiatives—lack the resources to implement ac-
cessibility from the start. The result is that visual evidence remains “unseen” by those who rely on
assistive technologies.
     Recent advances in multimodal AI offer a potential remedy. Vision-Language Models (VLMs)
such as OpenAI’s GPT-4o mini, Google’s Gemini 2.5 Flash Lite, and open-weight systems like
Meta’s Llama 4 Maverick or Qwen’s Qwen3 VL 8B Instruct now claim near-human performance
in image description tasks. These models can ingest an image and generate a caption or descrip-
tion, essentially simulating the interpretive act of a human describer. If these models could produce
alt-text that is both high-quality and historically informed as well as aligned with the Web Con-
tent Accessibility Guidelines (WCAG 2.2) [14], this would dramatically reduce the human effort
required to remediate large collections. Heritage institutions could then scale up accessibility by
generating alt-text for thousands of images, because the costs of machine captioning are negligible
in comparison to a human expert. Consequently, the “readership” of digital archives would expand
to include those who were previously excluded.
     However, adopting automated captioning in a heritage context raises critical questions about
truth, evidence, and authenticity. Delegating descriptive labor to machines is not a neutral technical
fix; it is an act imbued with values and biases [1]. Deciding what details to include in an image’s
description is technically difficult and ethically fraught, especially for historical images depicting
people or sensitive cultural content. Vision models trained on general web images may uncritically
adopt source terminology, inject anachronistic biases (e.g., misidentifying a 1920s street scene as
“Victorian”), reinforce curatorial blind spots, or omit crucial context that a human historian would
provide. There is also the danger of techno-ableism [12], where the needs of people who are
blind are superficially addressed by technology without truly empowering them or respecting their
perspectives. Uncritical use of AI could inadvertently recenter the sighted, algorithmic point of
view rather than the lived experience of those using the alt-text.
     In this work, we argue that AI-generated alt-text for historical collections is a pivotal test case
for the entanglement of AI innovation, archival practice, and disability justice. But can a machine
“see” history as we do? If a model can convincingly describe a photograph from 100 years ago,
how does that change the way we verify and trust such descriptions? Embracing this kind of ma-
chine vision in historical scholarship may require new protocols akin to earlier paradigm shifts (for
example, the move from handwritten catalog cards to MARC records, or from microfilm to digital
scans). Just as those changes demanded critical awareness of how tools shape historical discovery,
the use of AI-generated descriptions demands a new hermeneutic of suspicion. We must learn to
critically read machine-generated metadata, much as we read any human-produced finding aid or
annotation [5]. The central purpose of our study is to assess whether and how current AI mod-
els can serve as accessibility assistants in a digital history workflow, and to critically examine
the conditions and implications of their responsible use. Our approach is interdisciplinary, blend-
ing computational experimentation with qualitative, historiographically informed analysis. The
research design comprises the following steps:

   1. Data compilation: We compile a small yet balanced dataset consisting of historical sources
      and research data.


                                                 1148
      2. Model selection and prompt development: We conduct WCAG-aligned prompt engineer-
         ing and model selection in an iterative and exploratory manner.

      3. Generation and data collection: Once an optimal configuration of prompts and models
         has been identified, we generate candidate alternative texts (alt-text) and collect quantitative
         data on coverage, throughput, and unit cost.

      4. Expert evaluation: A group of 21 domain experts—humanities scholars with relevant dis-
         ciplinary expertise—evaluate and rank the AI-generated alt-text.

      5. Expert review: The authors qualitatively assess a selection of the highest-ranked alt-text
         for factual accuracy, contextual adequacy, and bias reproductions.

      6. Analysis: We perform both statistical and qualitative analyses of the data obtained in steps
         3–5.

    By doing so, we aim to illuminate both the opportunities and the pitfalls of integrating AI into
inclusive humanities scholarship.

1.1     Research questions
To guide this inquiry, we pose the following research questions:

       • RQ1 Feasibility: What coverage, throughput, and unit cost can current VLMs achieve
         for WCAG-aligned alt-text on a heterogeneous heritage corpus, and where do they fail?

       • RQ2 Relative quality: How do experts rank model outputs? What error patterns recur?

    By answering these questions, our work helps to establish an empirical baseline for AI-assisted
accessibility in the humanities. It also offers a reflective critique, examining AI outputs as objects
of study in their own right. In the following sections, we outline our data and methodology (Sec-
tion 2), present initial observations from our experiments (Section 3), and discuss implications for
digital humanities practice (4), before concluding with planned next steps (Section 5).

2      Data & Methodology
To ground our evaluation in a real-world scenario, we use data from Stadt.Geschichte.Basel, a
large-scale historical research project tracing the history of Basel from 50’000 BCE to the present
day. Research data is FAIRly available on the project’s Open Research Data Platform with meta-
data in a Dublin Core schema created by our Team for Research Data Management Team in a
comprehensive annotation workflow, following guidelines set out in our handbook for the creation
of non-discriminatory metadata [9].
    Crucially, alt-text has been missing in our data model until now, rendering this collection an
ideal testing ground for our study. The diversity of the corpus poses a significant challenge to au-
tomated captioning: many figures are visually and historically complex, requiring domain knowl-
edge to describe properly. This data thus allows us to investigate whether AI captioners can handle
the ’long tail’ of content found in historical archives, beyond the everyday photographs on which
many models are trained [4].


                                                   1149
2.1    Dataset for Alt Text Generation and Evaluation
For our survey, we compiled a dataset designed to represent both the heterogeneity of media types
and the timeframe covered by the Stadt.Geschichte.Basel project. The project collection, published
on our Open Research Data Platform, features more than 1700 media objects including metadata as
of October 2025. From this corpus, we created a dataset to use for alt-text generation trials. This
dataset comprises a hundred items and is released with this paper to be used for benchmarking
purposes. Additionally, we created a subset of 20 items to make it more feasible to evaluate alt-
text in an expert survey. For both sets, items were selected to maintain representativeness across
the same dimensions while being manageable for expert reviewers to assess within a reasonable
time frame. (See Appendix A.1 for a more detailed description of the dataset).
     All items were categorized into ten distinct media types (e.g. paintings, maps, scans of news-
papers etc., see Appendix A.1), allowing us to ensure a balanced distribution of content. Data types
primarily comprise images and figures, maps and geodata, tables and statistics, and bibliographic
references [8]: Heterogeneous digitized items including historical photographs, reproductions of
artifacts, city maps and architectural plans, handwritten letters and manuscripts, statistical charts,
and printed ephemera (e.g., newspaper clippings, posters). We made sure to include items with
complex visual structures (items that need additional information to convey their meaning, e.g., a
legend for maps or diagrams), items with visible text in different languages (e.g., scans of news-
papers or posters) as well as items with potentially sensitive content (e.g., content with derogatory
and/or racist terminology).
     To prompt the models as described below, we used JPG files at a standardized size of 800×800
pixels – the same resolution employed for human viewers on our online platform – and their cor-
responding metadata in JSON format.

2.2    Dataset Limitations
The number of eligible items is reduced by excluding items that are only available with placeholder
images on our platform due to copyright restrictions. Additionally, due to the typesetting workflow
during the production of the printed volumes, some collection items had to be split up into different
files – maps and charts where the legend is provided in a second image file, separate from the main
figure. This pertains to 19 out of 100 items in our data set, respectively four out of 20 items in
the survey. Connections between these segmented files are made explicit in our metadata, but the
models only receive one image file as input at a time, leading to some loss of information that
would be visually available to a human reader. This could result in a lower description quality.

2.3    Model Selection
We selected four multimodal vision-language models (VLMs) representing a balance of open-
weight and proprietary systems with comparable cost and capability:1
   Selection criteria:

      • Openness & diversity – two proprietary (OpenAI, Google) and two open-weight (Meta,
        Qwen) models.

      • Cost-capability parity – all models priced between $0.08–$0.15/M input and $0.40–
        $0.60/M output tokens, with ≥100K context windows.

      • Multilingual & visual competence – explicit support for German and image understanding.
1
 Cheaper models such as Mistral Pixtral 12B, AllenAI Molmo 7B-D, and OpenAI GPT-4.1 Nano were tested but
excluded due to consistently empty or nonsensical outputs.


                                                 1150
      Attribute     Gemini 2.5 Flash     Llama 4               GPT-4o mini          Qwen3 VL 8B
                    Lite                 Maverick                                   Instruct
      Developer    Google              Meta                 OpenAI             Alibaba
      Openness     Proprietary         Open weights         Proprietary        Open-weight
      Context      1.05M               1.05M                128K               131K
      Latency (s)  0.42                0.56                 0.58               1.29
      Input $/M    0.10                0.15                 0.15               0.08
      Output $/M   0.40                0.60                 0.60               0.50
      Notes        Fast, low-cost;     Multilingual,        Compact GPT-4o Robust open
                   optimized for       multimodal           variant; strong    baseline with
                   captioning.         reasoning.           factual grounding. OCR features.
            Table 2: Models used for evaluation. Data reported by OpenRouter (27.10.2025).


         The aim was to cover diverse architectures and governance regimes while maintaining fairness
     in performance evaluation.

     2.4   Prompt engineering
     We systematically varied prompt roles and placement, comparing instruction blocks in the system
     prompt versus the user prompt, and front-loading versus trailing constraints. We used the same
     user and system prompts for all models in zero-shot mode. Following evidence that models privi-
     lege information in short and well structured prompts, we fixed normative requirements (WCAG
     2.2 aligned, de-CH style, length limits, handling of decorative/functional/complex images) in the
     system prompt and kept the user prompt minimal and image-bound to reduce “lost-in-the-middle”
     effects [7]. The user prompt injected collection-specific metadata—title, description, EDTF date,
     era, creator/publisher/source—and a concise description of the purpose of the alt text, then the im-
     age URL. Adding this structured context markedly improved specificity, reduced refusals, and low-
     ered hallucinations, consistent with retrieval-style findings that supplying external, task-relevant
     evidence boosts generation quality and faithfulness. Recent work confirms that vision–language
     models can serve such accessibility roles when embedded in context-rich pipelines. In particular,
     user studies with blind and low-vision participants demonstrate that context-aware image descrip-
     tions—those combining visual and webpage metadata—are preferred and rated higher for quality,
     imaginability, and plausibility than context-free baselines [10]. This supports our design choice to
     inject structured collection metadata into the prompt.. These results are consistent with findings
     that prompt structure and multimodal fusion can systematically shift which visual cues VLMs
     rely on [6]. By anchoring metadata before the image input, we effectively steer the model toward
     shape- and context-based reasoning rather than shallow texture correlations—an effect analogous
     to prompt-based cue steering observed in vision-language bias studies.
 1   def build_prompt (media: MediaObject ) -> str:
 2   return f""" Titel : { media . title or "Kein Titel "}
 3   Beschreibung : {media. description or "Keine Beschreibung "}
 4   Ersteller : {media . creator or "Kein Ersteller "}
 5   Herausgeber : {media. publisher or "Kein Herausgeber "}
 6   Quelle : { media . source or " Keine Quelle "}
 7   Datum: { media.date or "Kein Datum "}
 8   Epoche : { media .era or " Keine Epoche "}""".strip ()
 9
10   def build_messages (


                                                    1151
11   prompt : str , image_url : str
12   ) -> tuple[list[dict[str , Any ]], str , str ]:
13   system = """ZIEL
14
15   Alt -Texte für historische und archäologische Sammlungsbilder .
16   Kurz , sachlich , zugänglich . Erfassung der visuellen Essenz für Screenreader .
17
18   REGELN
19
20   1. Essenz statt Detail . Keine Redundanz zum Seitentext , kein „Bild “von.
21   2. Zentralen Text im Bild wiedergeben oder kurz paraphrasieren .
22   3. Kontext (Epoche , Ort , Gattung , Material , Datierung ) nur bei Relevanz für
         ,→ Verständnis .
23   4. Prägnante visuelle Merkmale nennen : Farbe , Haltung , Zustand , Attribute .
24   5. Karten / Diagramme : zentrale Aussage oder Variablen .
25   6. Sprache : neutral , präzise , faktenbasiert ; keine Wertung , keine Spekulation .
26   7. Umfang :
27   * Standard : –90180 Zeichen
28   * Komplexe Karten / Tabellen : max. 400 Zeichen
29
30   VERBOTE
31
32   * Kein alt=, Anführungszeichen , Preambeln oder Füllwörter „(“zeigt ,
         ,→ „“darstellt ).
33   * Keine offensichtlichen Metadaten (z. B. Jahreszahlen aus Beschriftung ).
34   * Keine Bewertungen , Hypothesen oder Stilkommentare .
35   * Keine Emojis oder emotionalen Begriffe .
36
37   HEURISTIKEN
38
39   Porträt : Person (Name , falls bekannt ), Epoche , Pose oder Attribut , ggf.
         ,→ Funktion .
40   Objekt : Gattung , Material , Datierung , auffällige Besonderheit .
41   Dokument : Typ , Sprache /Schrift , Datierung , Kernaussage .
42   Karte: Gebiet , Zeitraum , Zweck , Hauptvariablen .
43   Ereignisfoto : Wer , was , wo , situativer Kontext .
44   Plakat / Cover: Titel , Zweck , zentrale Schlagzeile .
45
46   FALLBACK
47
48   Unklarer Inhalt : generische , aber sinnvolle Essenz aus Metadaten .
49
50   QUELLEN
51
52   Nur visuelle Analyse ( Bildinhalt ) und übergebene Metadaten . Keine externen
          ,→ Kontexte .""".strip ()
53   return (
54   [
55   {"role": " system ", " content ": system },
56   {
57   "role": "user",
58   " content ": [
59   {"type": "text", "text": prompt },
60   {"type": " image_url ", " image_url ": {"url": image_url }},
61   ],
62   },
63   ],
64   system ,
65   prompt ,
66   )


                                              1152
2.5    Alt Text Generation and Post-processing
Using the carefully engineered system and user prompts, we ran each image through each of the
four models, yielding four candidate descriptions per image. The generation process was auto-
mated via a Python script using OpenRouter as an API wrapper. We produced 80 candidate alt-texts
(4 per image for n=20 images in our survey). After generation, no post-processing was applied.
All results were stored along with metadata and model identifiers for evaluation.
    No model refused to describe an image due to some built-in safety filter (labelling a historical
photograph as sensitive content). Otherwise we would have handled those on a case-by-case basis
by leaving that image for human description. Overall, this pipeline is designed to be simple, and
maximize coverage (getting at least one description for every image) while maintaining quality
through careful prompting.

2.6    Survey
Twenty-one humanities scholars ranked, per image, four model-generated descriptions from best
(1) to worst (4) under WCAG-intended criteria for alt text. Raters were asked to consider: (a)
concise rendering of the core visual content; (b) avoidance of redundant phrases (e.g., “image
of”); (c) prioritisation of salient visual features (persons, objects, actions, visible text); and (d)
context inclusion only when it improves comprehension. While factual accuracy, complete-
ness, and absence of bias were not primary ranking dimensions, they may have been factored in
implicitly.

2.7    Close reading
To check for these dimensions, the authors conducted a qualitative close reading of a selection
of the generated alt-text. This analysis specifically targeted outputs that had received the highest
rankings from the expert panel.

      • Factual accuracy (Did the generated description contain any incorrect identifications of
        people, objects, or actions?)

      • Contextual adequacy (Did the generated description include any incorrect or misleading
        historical context?)

      • Bias reproduction (Did the model reproduce sensitive, derogatory, or racist terminology
        from the source material?)

   This allowed us to investigate whether an alt-text could be ranked highly for WCAG alignment
while simultaneously being factually incorrect or ethically problematic.

3     Results and Analysis
3.1    RQ1 Feasibility: Coverage, Throughput, and Unit Cost
To address the feasibility of automatic alt-text generation at corpus scale, we compared four state-
of-the-art vision–language models (VLMs): Google Gemini 2.5 Flash Lite, Meta Llama 4 Mav-
erick, OpenAI GPT-4o mini, and Qwen 3 VL 8B Instruct. Each model generated alt-text de-
scriptions for 20 representative heritage images selected for diversity of content, medium, and
metadata completeness.


                                                1153
3.1.1    Coverage and reliability
All models returned non-empty outputs for all 20 prompts, yielding 100 % coverage and no failed
responses. This demonstrates that current VLMs can reliably produce textual descriptions even
for heterogeneous heritage data without the need for fallback mechanisms.
3.1.2    Throughput and latency
Processing speed ranged between 0.24–0.43 items/s, corresponding to median latencies of 2–4 s
per item. Models were accessed via OpenRouter.ai with the following providers: Google – gemini-
2.5-flash-lite, OpenAI – gpt-4o-mini, Together – llama-4-maverick, and Alibaba – Qwen 3 VL 8B.
Qwen 3 VL 8B achieved the fastest throughput and lowest latency, while OpenAI GPT-4o mini was
slower but consistent. All models showed moderate response-time variability (≈ 1.7–10 s).
3.1.3    Cost efficiency
Unit generation costs differed by two orders of magnitude, reflecting API pricing rather than archi-
tectural complexity. According to OpenRouter.ai cost reports, costs per item ranged from $1.8 ×
10−4 (Qwen) to $3.6 × 10−3 (OpenAI).


                                                           Coverage (%)
                                                                          Throughput    Median            Mean Cost

                                                                          (items/s)     Latency (s)       (USD/item)
                   Model
                   Google Gemini 2.5 Flash Lite       100                  0.31         2.72          0.000215
                   Meta Llama 4 Maverick              100                  0.41         2.38          0.000395
                   OpenAI GPT-4o Mini                 100                  0.24         4.00          0.003625
                   Qwen 3 VL 8B Instruct              100                  0.43         2.27          0.000182

                    Table 3: Feasibility metrics for alt text generation (n = 20).


3.1.4    Summary of RQ1
All models achieved complete coverage, acceptable latency, and minimal cost, confirming the
technical and economic feasibility of automated alt text generation for large, heterogeneous cultural
collections. Failures were not due to empty outputs but to qualitative weaknesses, which are
examined under RQ2.

3.2     RQ2 Relative Quality: Expert Ranking and Qualitative Assessment
3.2.1    Quantitative ranking analysis
Within each task, all models were directly compared; task-level median ranks were analyzed across
the 20 tasks using the Friedman test for repeated measures, followed by pairwise Wilcoxon
signed-rank tests with Holm–Bonferroni correction. Agreement across tasks was quantified
with Kendall’s W.

                           χ2 (3, N = 20) = 6.02,     p = 0.11,                        W = 0.0085.
    The results indicate no statistically significant difference among models (p > 0.05) and
very low inter-task agreement (W ≈ 0.01), implying that relative rankings varied substantially
by task. Pairwise Wilcoxon comparisons (Appendix A.4) showed no significant differences after


                                                    1154
correction (pHolm > 0.5); unadjusted p-values suggested weak, non-significant trends favoring
OpenAI GPT-4o Mini and Qwen 3 VL 8B over Google Gemini and Meta Llama.
3.2.2   Descriptive patterns
Twenty-one human experts each rated four alternative texts for 20 images, yielding a total of 420
individual ratings:

              Model                           Rank 1 Rank 2 Rank 3 Rank 4
              Google Gemini 2.5 Flash Lite         86        84       113       137
              Meta Llama 4 Maverick                88       114       110       108
              OpenAI GPT-4o Mini                  132       101        84       103
              Qwen 3 VL 8B Instruct               114       121       113        72

    OpenAI and Qwen outputs received more first-place and fewer last-place rankings, but over-
lapping rank distributions (Appendix A.3) indicate that these tendencies remain descriptive rather
than inferentially significant.
3.2.3   Qualitative evaluation of top-rated outputs
A manual close reading inspection of hand-picked alt texts with highest mean rank scores revealed
that even those outputs deemed to be the ‘best’ were not free from substantive and ethical short-
comings. In fact, all models produced at least one error. Some are easy to catch in a manual
review (factually wrong descriptions), others require expert domain knowledge (reproduction of
stereotypes).


             Figure 1: Faltblatt der Gruppe ‹Freiräume für Frauen› (FFF) (m92410)

Example for Factually Wrong Text (Figure 1)

    • Best-ranked alt-text: “Faltblatt der Gruppe ‹Freiräume für Frauen› (FFF) von 1992. Zwei
      Personen gehen eine Treppe hinunter, Frau mit Zigarette, Mann mit Sonnenbrille und Hän-
      den in den Hosentaschen” (Google Gemini 2.5 Flash Lite)


                                              1155
   • Translation: “Leaflet of the ’Freiräume für Frauen’ (FFF) group from 1992. Two people
     walk down a staircase, a woman with a cigarette, a man with sunglasses and hands in his
     pockets”
   • Shortcoming: This description is factually incorrect. The two people in the photograph are
     clearly walking up the staircase, not down.


Figure 2: ‹Gegen den Eisenbahnhandel›, anonyme antisemitische Karikatur in der Extraausgabe
der Basler ‹Allgemeinen Schweizer Zeitung› vom 14. Februar 1898 (m91960)

Example for Reproduction of Stereotypes (Figure 2)
   • Best-ranked alt-text: “Antisemitische Karikatur zur Eisenbahnverstaatlichung 1898.
     Links: Wehrhafte Helvetia zwingt knienden Geldverleiher nieder. Rechts: Kniende Hel-
     vetia bittet reichen Juden um Geld. Kontrastdarstellung zum Volksentscheid über die
     Eisenbahnverstaatlichung” (Meta Llama 4 Maverick)
   • Translation: “Antisemitic cartoon on the nationalization of the railways 1898. Left: Defen-
     sive Helvetia forces a kneeling moneylender down. Right: Kneeling Helvetia asks a rich Jew
     for money. Contrasting depiction of the referendum on the nationalization of the railways”
   • Shortcoming: The text uncritically reproduces the term “reichen Juden” (rich Jew). This is
     an antisemitic stereotype likely drawn from the item’s metadata or biased training data, not
     a neutral description of the visual content.

Example for Selective Description (Figure 3)
   • Best ranked alt-text: “Balkendiagramm: Zuschauerzahlen von FC Basel (blau) und Konz-
     erten (gelb) im St. Jakob-Stadion, 1960–2022. Höchstwerte bei Fußballspielen ab 2001 im
     neuen St. Jakob-Park, Einbruch 2020 durch Corona-Pandemie” (Meta Llama 4 Maverick)
   • Translation: “Bar chart: Audience numbers for FC Basel (blue) and concerts (yellow) at
     St. Jakob-Stadion, 1960–2022. Peak values for football matches from 2001 in the new St.
     Jakob-Park, slump in 2020 due to the Corona pandemic”
   • Shortcoming: The description is selective and unbalanced. It provides a detailed interpre-
     tation of the trends for the football matches (blue bars) but does not give information or
     interpretation about neither the concert attendance (yellow bars), nor the number of football
     matches (green circles), omitting a lot of the chart’s comparative data.


                                             1156
Figure 3: Zuschauerzahlen von FC Basel und Konzerten im St. Jakob-Stadion, 1960–2022
(m88415_1)


3.2.4    Interpretation
These examples for erroneous alt texts that still were ranked best in our survey illustrate that high
quantitative rankings do not imply factual accuracy or ethical adequacy as illustrated by a close
reading. Even when linguistically fluent and stylistically polished, VLM-generated alt texts may
introduce epistemic distortions or perpetuate historical bias.
     Quantitatively, no model achieved a statistically distinct performance profile; qualitatively,
all exhibited systematic error patterns—misrecognition, omission, and uncritical reproduction
of harmful source language. This combination highlights the limits of rank-based evaluation
alone: expert preference captures relative quality but not factual or ethical soundness.

3.3     Synthesis
      • RQ1 Feasibility: All four VLMs achieved full coverage, low latency, and negligible cost,
        confirming the operational viability of automated WCAG-aligned alt text generation for her-
        itage corpora.

      • RQ2 Relative quality: Expert rankings showed no statistically significant hierarchy among
        models (p = 0.11, W ≈ 0.01), and qualitative inspection exposed factual inaccuracies,
        biased reproduction, and selective omissions even in top-rated outputs.

    Overall, current VLMs can populate heritage databases at scale but require expert review and
critical post-editing to ensure factual precision, ethical compliance, and contextual adequacy.
Automated alt text workflows should therefore combine model ensembles with targeted human
oversight to meet both accessibility and historiographical standards.

4     Discussion
Our findings confirm the central tension in using contemporary VLMs for heritage accessibility:
they are operationally feasible but epistemically fragile. The 100% coverage, low latency, and
negligible cost (RQ1) demonstrate that the technical and economic barriers to generating descrip-
tions at a corpus-wide scale are virtually gone. However, the results from our expert evaluation
(RQ2) reveal a gap between this operational success and the production of high-quality, trustworthy
alt text.


                                               1157
     The lack of a statistically significant winner among the models, combined with the low inter-
task agreement (W ≈ 0.01), is a key finding. It suggests that no single model is a reliable
one-shot solution. A model that performs well on a photograph might fail on a diagram, and vice-
versa. This variability reinforces the findings of mechanistic analyses like [6], which show that
VLM outputs are highly sensitive to how their fusion layers mediate visual cues. Our metadata-
rich prompts likely steered models toward more context-aware descriptions, but this “cue steering”
was not a panacea against factual or ethical errors.
     Critically, our mixed-methods approach exposed the limits of rank-based evaluation alone.
The qualitative close reading (Section 3.2.3) revealed that outputs ranked highly by experts for
WCAG alignment (i.e., conciseness and style) could still be factually wrong, ethically problematic,
or epistemically shallow. The FFF flyer (m92410) example, which confidently misidentifies the
walking direction, and the antisemitic cartoon (m91960) example, which uncritically reproduces
the term “reicher Jude” (rich Jew) from the source’s metadata, are stark illustrations. Fluency, in
essence, is not a proxy for accuracy or ethical adequacy.
     This study validates the framing of AI as an accessibility assistant (Section 1) rather than an
autonomous author. The VLM output should be treated as a first draft for human review, not a final
product. This reframes the labor of digital humanists: from authoring descriptions from scratch
to critically editing machine-generated drafts. This aligns with the hybrid assessment frameworks
proposed in educational research [11] and necessitates the “new hermeneutic of suspicion” [5]
advocated in our introduction. Curators and historians must be trained in AI literacy [13] to spot
subtle biases and misinterpretations that a fluent-sounding description might otherwise obscure.
     Finally, our study has limitations. The expert ranking (n=21) was based on a relatively small
(n=20) subset of images, which, while diverse in content, limits the statistical power of our quan-
titative analysis. Furthermore, the survey criteria explicitly prioritized WCAG stylistic guidelines
over factual accuracy, a dimension we could only capture post-hoc via our qualitative close read-
ing. A crucial missing component, which we intentionally bracketed to first establish a baseline, is
the perspective of blind and low-vision users themselves. Without their input, any AI-driven ac-
cessibility solution risks falling into the trap of “techno-ableism” [12], designing for a community
without designing with them.

5   Future Work
Building on this study’s findings, we identify several critical paths for future research to bridge the
gap between the operational promise and epistemic fragility of AI-generated alt text for heritage.

    • Usability and User-Experience (UX) Studies: The most urgent next step is to move beyond
      expert-as-proxy and engage directly with blind and low-vision (BLV) users. Future work
      should conduct qualitative usability studies to assess how BLV readers experience these AI-
      generated descriptions. Do they find them helpful, confusing, or biased? Does the inclusion
      of metadata (as prompted) improve or hinder “imaginability”? This addresses the techno-
      ableism critique and centers the lived experience of those the technology claims to serve.

    • Domain-Specific Fine-Tuning: This study relied on general-purpose VLMs with prompt
      engineering. A promising avenue is the fine-tuning of open-weight models (such as Meta’s
      Llama 4 or Qwen’s Qwen3 VL) on a high-quality, domain-specific dataset. By training a
      model on thousands of expert-vetted alt texts from GLAM (Galleries, Libraries, Archives,
      and Museums) collections, it may be possible to create a model that is more factually ac-
      curate, context-aware, and ethically sensitive to heritage content than its general-purpose
      counterparts.


                                                1158
    • Developing Human-in-the-Loop (HITL) Workflows: Our results confirm the necessity of
      expert review. Future research should move from evaluation to implementation by designing
      and testing HITL editorial interfaces. What is the most effective workflow for a historian to
      review, correct, and approve AI-generated alt text? How can we best integrate AI-generated
      “drafts” into existing collections management systems (CMS) and research data platforms,
      complete with policies for handling sensitive content?

    • Scaling the Benchmark: This study established a 100-item benchmark dataset. The next
      phase should involve using this larger dataset to conduct a more robust quantitative analysis.
      This would allow for a more granular breakdown of model performance by media type (e.g.,
      maps vs. manuscripts vs. photographs) and help establish more reliable cost-quality trade-
      offs to guide GLAM institutions in adopting these technologies.

Acknowledgements
We thank Cristina Münch and Noëlle Schnegg for curating metadata and providing image assets
from the Stadt.Geschichte.Basel collection. We are grateful to the 21 expert participants for their
careful rankings and comments. For insightful feedback on an early draft, we thank Dr. Mehrdad
Almasi (Luxembourg Centre for Contemporary and Digital History, C²DH). Any remaining errors
are our own.

References
 [1] Bowker, Geoffrey C. and Star, Susan Leigh. Sorting Things out: Classification and Its Con-
     sequences. Inside Technology. Cambridge, Massachusetts: MIT Press, 1999. 377 pp. ISBN:
     978-0-262-02461-7.
 [2] Cecilia, Rafie, Moussouri, Theano, and Fraser, John. “AltText: An Institutional Tool for
     Change”. In: Curator 66, no. 2 (2023), pp. 225–231. DOI: 10.1111/cura.12551.
 [3] Cecilia, Rafie, Moussouri, Theano, and Fraser, John. “Creating Accessible Digital Images
     for Vision Impaired Audiences and Researchers”. In: Curator 66, no. 1 (2023), pp. 5–8.
     DOI: 10.1111/cura.12536.
 [4] Cetinic, Eva. “Towards Generating and Evaluating Iconographic Image Captions of Art-
     works”. In: Journal of Imaging 7, no. 8 (July 23, 2021), p. 123. ISSN: 2313-433X. DOI:
     10.3390/jimaging7080123. PMID: 34460759. URL: https://pmc.ncbi.nlm.nih.
     gov/articles/PMC8404909/ (visited on 10/28/2025).
 [5] Fickers, Andreas. “Digital Hermeneutics: The Reflexive Turn in Digital Public History?”
     In: Handbook of Digital Public History, ed. by Serge Noiret, Valérie Schafer, and Gerben
     Zaagsma. De Gruyter, 2022, pp. 139–148. DOI: 10.1515/9783110430295-012.
 [6] Gavrikov, Paul, Lukasik, Jovita, Jung, Steffen, Geirhos, Robert, Mirza, M. Jehanzeb, Keu-
     per, Margret, and Keuper, Janis. “Can We Talk Models Into Seeing the World Differently?”
     Version 2. Mar. 5, 2025. DOI: 10.48550/arXiv.2403.09193. arXiv: 2403.09193 [cs].
     URL: http://arxiv.org/abs/2403.09193 (visited on 10/27/2025). Pre-published.
 [7] Liu, Nelson F., Lin, Kevin, Hewitt, John, Paranjape, Ashwin, Bevilacqua, Michele, Petroni,
     Fabio, and Liang, Percy. “Lost in the Middle: How Language Models Use Long Contexts”.
     Nov. 20, 2023. DOI: 10.48550/arXiv.2307.03172. arXiv: 2307.03172 [cs]. URL:
     http://arxiv.org/abs/2307.03172 (visited on 10/28/2025). Pre-published.
 [8] Mähr, Moritz. “Research Data Management in (Public) History”. Keynote. Istituto Svizzero
     di Roma, June 2022. URL: https://doi.org/10.5281/zenodo.6637118.


                                               1159
 [9] Mähr, Moritz and Schnegg, Noëlle. “Handbuch zur Erstellung diskriminierungsfreier Meta-
     daten für historische Quellen und Forschungsdaten: Erfahrungen aus dem geschichtswis-
     senschaftlichen Forschungsprojekt Stadt.Geschichte.Basel”. Basel: Zenodo, June 2024.
     DOI: 10.5281/ZENODO.11124720.
[10]    Mohanbabu, Ananya Gubbi and Pavel, Amy. “Context-Aware Image Descriptions for Web
        Accessibility”. In: The 26th International ACM SIGACCESS Conference on Computers and
        Accessibility. Oct. 27, 2024, pp. 1–17. DOI: 10.1145/3663548.3675658. arXiv: 2409.
        03054 [cs]. URL: http://arxiv.org/abs/2409.03054 (visited on 10/27/2025).
[11] Reihanian, Iman, Hou, Yunfei, Chen, Yu, and Zheng, Yifei. “A Review of Generative AI
     in Computer Science Education: Challenges and Opportunities in Accuracy, Authenticity,
     and Assessment”. Version 1. June 17, 2025. DOI: 10.48550/arXiv.2507.11543. arXiv:
     2507.11543 [cs]. URL: http://arxiv.org/abs/2507.11543 (visited on 10/27/2025).
     Pre-published.
[12]    Shew, Ashley. Against Technoableism: Rethinking Who Needs Improvement. New York: W.
        W. Norton, 2023.
[13]    Strien, Daniel van, Bell, Mark, McGregor, Nora Rose, and Trizna, Michael. “An Introduc-
        tion to AI for GLAM”. In: Proceedings of the Second Teaching Machine Learning and
        Artificial Intelligence Workshop. The Second Teaching Machine Learning and Artificial In-
        telligence Workshop. PMLR, Mar. 14, 2022, pp. 20–24. URL: https://proceedings.
        mlr.press/v170/strien22a.html (visited on 10/28/2025).
[14]    World Wide Web Consortium. “Web Content Accessibility Guidelines (WCAG) 2.2”. 2023.
        URL: https://www.w3.org/TR/WCAG22/.

A      Appendix
A.1     Dataset description
In both our selection of 100-item and the 20-item survey subset, we tried to find an overall balance
between all data types and eras that make up the Stadt.Geschichte.Basel collection.
A.1.1    Distribution by Type
Due to the collection’s historical nature, not all data types appear in all eras, and the smaller size
of the survey subset accentuates these constraints. We dropped Painting items and the Antiquity
era from the survey subset due to their low prevalence in our corpus.
A.1.2    Distribution by Era
With regards to the historical eras represented in the subset, we aimed to cover the full chronologi-
cal span of the Stadt.Geschichte.Basel project, from 50’000 BCE until the 21st century. Since each
item is tagged with an era in the metadata, we could systematically select items across periods in
a way that resembles that distribution in the whole research data set (at the time of writing). Items
from some eras, e.g. Antiquity and 21st Century, are less frequent in the overall collection which
is reflected in a lower representation in our dataset.
A.1.3    Distribution across Era and Type (survey count in parentheses)
A.1.4    Language Distribution
Our research data collection primarily contains items in German, with a small number of items
in Latin, French and Dutch. We aimed to reflect this language distribution in our selection. In
a similar vein, we wanted to take into account different typographic styles. However, writing is


                                                1160
                 Type                                                              Dataset Survey
                 Painting                                                                       12                            0
                 Object                                                                         13                            2
                 Photograph (Archaeological Site)                                               10                            2
                 Photograph (Historical Scenes)                                                 10                            2
                 Scan of Newspapers, Posters, Lists, etc.                                       10                            3
                 Drawing (Archaeological Reconstruction)                                        10                            3
                 Drawing (Historical Drawing)                                                   10                            2
                 Map                                                                            10                            2
                 Diagram (Statistics)                                                           10                            2
                 Diagram (Flowchart, Schema etc.)                                                5                            2
                 Total                                                                         100                    20

             Table 4: Distribution of media types in the dataset and survey subset.


                            Era                               Dataset Survey
                            Protohistory                                    11                  3
                            Antiquity                                        3                  0
                            Middle Ages                                     16                  2
                            Early Modern period                             21                  3
                            19th century                                    19                  5
                            20th century                                    25                  5
                            21st century                                     5                  2
                            Total                                    100                       20

            Table 5: Distribution of historical eras in the dataset and survey subset.


                                                                                                Early Modern

                                               Protohistory     Antiquity        Middle Ages                   19th century       20th century   21st century
                                                                                                Period
  Type
  Scan of Newspapers, Lists, etc.              0 (0)            0 (0)            1 (0)          2 (0)          3 (2)              4 (1)          0 (0)
  Photograph (Historical Scenes)               0 (0)            0 (0)            0 (0)          0 (0)          3 (1)              7 (1)          0 (0)
  Photograph (Archaeological Site)             0 (0)            0 (0)            0 (0)          0 (0)          0 (0)              5 (1)          5 (1)
  Object                                       3 (0)            0 (0)            3 (0)          6 (1)          1 (1)              0 (0)          0 (0)
  Map                                          1 (0)            3 (0)            1 (0)          1 (0)          1 (1)              3 (1)          0 (0)
  Drawing (Historical Drawing)                 0 (0)            0 (0)            1 (0)          5 (2)          4 (0)              0 (0)          0 (0)
  Drawing (Archaeological Reconstruction)      5 (3)            0 (0)            5 (0)          0 (0)          0 (0)              0 (0)          0 (0)
  Diagram (Statistics)                         0 (0)            0 (0)            0 (0)          1 (0)          2 (0)              6 (2)          1 (1)
  Diagram (Flowchart, Schema etc.)             0 (0)            0 (0)            3 (2)          0 (0)          1 (0)              1 (0)          0 (0)
  Painting                                     0 (0)            0 (0)            2 (0)          6 (0)          4 (0)              0 (0)          0 (0)

Table 6: Distribution of media types across historical eras in the survey subset (counts in paren-
theses).


                                              1161
not fully legible in many cases anyway – since we are working with 800×800 pixel JPG image
thumbnails – and thus played only a minor factor in the selection process.

                              Language               Dataset Survey
                              German                      33          9
                              Latin                        8          2
                              French                       2          0
                              Without written text        57          9

               Table 7: Distribution of languages in the dataset and survey subset.


A.1.5    Spatial Context
While geospatial context is not a part of our data model, most items in the Stadt.Geschichte.Basel
collection can be associated with specific locations in Basel or elsewhere. The geographical dis-
tribution of the collection items did not influence our selection process directly, but a rough cate-
gorization was done afterwards to see whether differences in geographical scope are represented
in our dataset.

            Spatial Context                                           Dataset Survey
            City of Basel                                                   60        14
            Basel Region/Northwestern Switzerland/Upper Rhine               16         2
            Switzerland                                                      6         0
            Switzerland and Neighbouring Countries                           5         1
            Europe                                                           3         1
            Worldwide                                                        5         2
            NA                                                               5         0

             Table 8: Distribution of spatial contexts in the dataset and survey subset.


A.1.6    Media Complexity
For technical reasons, some objects in our collection consist of several media items. These are
legends for maps and diagrams, visually supplying information that is crucial to fully grasp the
meaning of the media item.

              Media Complexity                                       Dataset Survey
              Single-Item Object                                          81        19
              Multiple-Item Object (Figure and separate Legend)           16         4

            Table 9: Distribution of media complexity in the dataset and survey subset.


A.2     System Performance Analysis
A.3     Rank Distributions and Aggregate Performance


                                               1162
             Figure 4: Throughput, Latency, and Cost by Model


             Figure 5: Counts of Ranks per Model (All Ratings)


Figure 6: Rank Distributions per Model (Task-Level Medians; Lower = Better)


                                   1163
                     Table 10: Rank counts per object and model.

objectid   model                          count_rank_1 count_rank_2 count_rank_3 count_rank_4
m12965     google/gemini-2.5-flash-lite             0            9            8               4
m12965     meta-llama/llama-4-maverick             11            0            4               6
m12965     openai/gpt-4o-mini                       7            5            4               5
m12965     qwen/qwen3-vl-8b-instruct                3            7            5               6
m13176     google/gemini-2.5-flash-lite             2            4            5              10
m13176     meta-llama/llama-4-maverick              4            4           10               3
m13176     openai/gpt-4o-mini                       6            6            2               7
m13176     qwen/qwen3-vl-8b-instruct                9            7            4               1
m15298_1   google/gemini-2.5-flash-lite             6            1            6               8
m15298_1   meta-llama/llama-4-maverick              2           11            7               1
m15298_1   openai/gpt-4o-mini                       6            2            4               9
m15298_1   qwen/qwen3-vl-8b-instruct                7            7            4               3
m20435     google/gemini-2.5-flash-lite             5            5            8               3
m20435     meta-llama/llama-4-maverick              2            6            4               9
m20435     openai/gpt-4o-mini                       3            4            6               8
m20435     qwen/qwen3-vl-8b-instruct               11            6            3               1
m22924     google/gemini-2.5-flash-lite             5            3            4               9
m22924     meta-llama/llama-4-maverick              7            3            5               6
m22924     openai/gpt-4o-mini                       3            9            6               3
m22924     qwen/qwen3-vl-8b-instruct                6            6            6               3
m28635     google/gemini-2.5-flash-lite             2            6            7               6
m28635     meta-llama/llama-4-maverick              6            9            2               4
m28635     openai/gpt-4o-mini                      13            3            4               1
m28635     qwen/qwen3-vl-8b-instruct                0            3            8              10
m29084     google/gemini-2.5-flash-lite             2            9           10               0
m29084     meta-llama/llama-4-maverick              2            2            2              15
m29084     openai/gpt-4o-mini                       7            5            3               6
m29084     qwen/qwen3-vl-8b-instruct               10            5            6               0
m34620     google/gemini-2.5-flash-lite             3            8            9               1
m34620     meta-llama/llama-4-maverick              5            5            4               7
m34620     openai/gpt-4o-mini                       4            5            3               9
m34620     qwen/qwen3-vl-8b-instruct                9            3            5               4
m37030_1   google/gemini-2.5-flash-lite             7            6            5               3
m37030_1   meta-llama/llama-4-maverick              3            7            4               7
m37030_1   openai/gpt-4o-mini                       9            7            3               2
m37030_1   qwen/qwen3-vl-8b-instruct                2            1            9               9
m37716     google/gemini-2.5-flash-lite             3            2            7               9
m37716     meta-llama/llama-4-maverick              3            4            8               6
m37716     openai/gpt-4o-mini                       8            8            2               3
m37716     qwen/qwen3-vl-8b-instruct                7            7            4               3
m39198_1   google/gemini-2.5-flash-lite             3            2            1              15
m39198_1   meta-llama/llama-4-maverick              4            5            9               3
m39198_1   openai/gpt-4o-mini                      10            4            4               3
m39198_1   qwen/qwen3-vl-8b-instruct                4           10            7               0
m82972     google/gemini-2.5-flash-lite            11            5            3               2
m82972     meta-llama/llama-4-maverick              2            9            7               3
m82972     openai/gpt-4o-mini                       5            5            4               7
m82972     qwen/qwen3-vl-8b-instruct                3            2            7               9
m88415_1   google/gemini-2.5-flash-lite             1            3            1              16
m88415_1   meta-llama/llama-4-maverick              5            9            7               0
m88415_1   openai/gpt-4o-mini                       9            3            5               4
m88415_1   qwen/qwen3-vl-8b-instruct                6            6            8               1
m91000_1   google/gemini-2.5-flash-lite             1            3            6              11
m91000_1   meta-llama/llama-4-maverick              7            7            6               1
m91000_1   openai/gpt-4o-mini                       5            6            2               8
m91000_1   qwen/qwen3-vl-8b-instruct                8            5            7               1
m91960     google/gemini-2.5-flash-lite             2            4            8               7
                                                                          continued on next page


                                               1164
                       Table 10: Rank counts per object and model (continued)

          objectid   model                          count_rank_1 count_rank_2 count_rank_3 count_rank_4
          m91960     meta-llama/llama-4-maverick             10            6            4             1
          m91960     openai/gpt-4o-mini                       4            1            6            10
          m91960     qwen/qwen3-vl-8b-instruct                5           10            3             3
          m92357     google/gemini-2.5-flash-lite             8            3            5             5
          m92357     meta-llama/llama-4-maverick              3           11            3             4
          m92357     openai/gpt-4o-mini                       5            4            7             5
          m92357     qwen/qwen3-vl-8b-instruct                5            3            6             7
          m92410     google/gemini-2.5-flash-lite            14            0            3             4
          m92410     meta-llama/llama-4-maverick              3            6            6             6
          m92410     openai/gpt-4o-mini                       3           11            4             3
          m92410     qwen/qwen3-vl-8b-instruct                1            4            8             8
          m94271     google/gemini-2.5-flash-lite             9            3            5             4
          m94271     meta-llama/llama-4-maverick              2            3            4            12
          m94271     openai/gpt-4o-mini                       9            4            6             2
          m94271     qwen/qwen3-vl-8b-instruct                1           11            6             3
          m94775     google/gemini-2.5-flash-lite             0            2            2            17
          m94775     meta-llama/llama-4-maverick              4            5            9             3
          m94775     openai/gpt-4o-mini                       7            5            8             1
          m94775     qwen/qwen3-vl-8b-instruct               10            9            2             0
          m95804     google/gemini-2.5-flash-lite             2            6           10             3
          m95804     meta-llama/llama-4-maverick              3            2            5            11
          m95804     openai/gpt-4o-mini                       9            4            1             7
          m95804     qwen/qwen3-vl-8b-instruct                7            9            5             0


                                  Statistic                                Value
                                  Friedman χ2                             6.0191
                                  p-value                                 0.1107
                                  Kendall’s W (observed)                  0.0085
                                  Kendall’s W (from Friedman)             0.1003
                                  Number of tasks                             20
                                  Number of unique raters                     21
                                  Total submissions                          420

                     Table 11: Ranked Friedman and Kendall’s W Test Summary


A.4     Pairwise Comparison of Models


A.5     Reproducibility and Data Availability
All code, datasets, and analysis artefacts supporting this study are openly available under open
licenses at:
    Repository: https://github.com/maehr/chr2025-seeing-history-unseen
    Persistent record: FIXME Zenodo DOI
    The repository provides a complete, executable research pipeline for the CHR 2025 paper
“Seeing History Unseen: Evaluating Vision-Language Models for WCAG-Compliant Alt-Text in
Digital Heritage Collections.”
    Key components:

      • src/ — source code for data generation, cleaning, and statistical analysis


                                                         1165
            Figure 7: Pairwise Adjusted p-values (Holm) — Task-Level Inference


model_a                        model_b       statistic      pvalue          p_adjusted_holm
google/gemini-2.5-flash-lite meta-          100.0         0.864524       1.0
                             llama/llama-
                             4-maverick
google/gemini-2.5-flash-lite openai/gpt-    59.5          0.088574       0.513834
                             4o-mini
google/gemini-2.5-flash-lite qwen/qwen3- 60.5             0.095709       0.513834
                             vl-8b-instruct
meta-llama/llama-4-maverick openai/gpt-     60.0          0.085639       0.513834
                             4o-mini
meta-llama/llama-4-maverick qwen/qwen3- 61.5              0.103028       0.513834
                             vl-8b-instruct
openai/gpt-4o-mini           qwen/qwen3- 105.0            1.0            1.0
                             vl-8b-instruct
  Table 13: Pairwise Wilcoxon Signed-Rank Tests between Models with Holm Adjustment


                                          1166
      • runs/ — timestamped outputs of alt-text generation runs, including raw API responses

      • data/processed/ — anonymised survey and ranking data used for evaluation

      • analysis/ — statistical summaries, CSVs, and figures referenced in this appendix

      • paper/images/ — figure assets for the manuscript

    Reference run: runs/20251021_233530/ — canonical example with subsample configura-
tion (20 media objects × 4 models). All tables and plots in this appendix derive from this run and
subsequent survey analyses.
    A pre-configured GitHub Codespace enables fully containerised reproduction without local
setup. All scripts print output paths and runtime logs to ensure transparent traceability.

A.6     FAIR and CARE Compliance
The project adheres to the FAIR (Findable, Accessible, Interoperable, Reusable) and CARE (Col-
lective Benefit, Authority to Control, Responsibility, Ethics) principles for open humanities data.

      • Findable: Repository indexed on GitHub and Zenodo with persistent DOI; structured meta-
        data and semantic filenames.

      • Accessible: Publicly accessible under AGPL-3.0 (code) and CC BY 4.0 (data, documenta-
        tion). No authentication barriers.

      • Interoperable: Machine-readable CSV, JSONL, and Parquet formats; consistent column
        schemas; human- and machine-readable metadata.

      • Reusable: Version-controlled pipeline, deterministic random seeds, explicit dependencies,
        and complete provenance logs.

      • Collective Benefit: Focus on accessibility and inclusion in digital heritage; results aim to
        improve equitable access to cultural data.

      • Authority to Control: No personal or culturally sensitive material; contributors retain au-
        thorship and citation credit.

      • Responsibility: Transparent methodological reporting and ethical safeguards for AI-assisted
        heritage interpretation.

      • Ethics: Evaluation limited to non-personal, publicly available heritage materials; compli-
        ance with institutional research ethics guidelines.

    Together, these practices ensure that the entire workflow - from model evaluation to figure
generation – is transparent, reproducible, and reusable across digital humanities and accessibility
research contexts.


                                                1167