EpiHub - Open Digital Epigraphy Hub

Aioanei, Andrei C., Regine R. Hunziker-Rodewald, Konstantin M. Klein, and Dominik L. Michels. 2024. “Deep Aramaic: Towards a Synthetic Data Paradigm Enabling Machine Learning in Epigraphy.” PLOS ONE 19 (4): e0299297. doi:10.1371/journal.pone.0299297.

Epigraphy is witnessing a growing integration of artificial intelligence, notably through its subfield of machine learning (ML), especially in tasks like extracting insights from ancient inscriptions. However, scarce labeled data for training ML algorithms severely limits current techniques, especially for ancient scripts like Old Aramaic. Our research pioneers an innovative methodology for generating synthetic training data tailored to Old Aramaic letters. Our pipeline synthesizes photo-realistic Aramaic letter datasets, incorporating textural features, lighting, damage, and augmentations to mimic real-world inscription diversity. Despite minimal real examples, we engineer a dataset of 250 000 training and 25 000 validation images covering the 22 letter classes in the Aramaic alphabet. This comprehensive corpus provides a robust volume of data for training a residual neural network (ResNet) to classify highly degraded Aramaic letters. The ResNet model demonstrates 95% accuracy in classifying real images from the 8th century BCE Hadad statue inscription. Additional experiments validate performance on varying materials and styles, proving effective generalization. Our results validate the model’s capabilities in handling diverse real-world scenarios, proving the viability of our synthetic data approach and avoiding the dependence on scarce training data that has constrained epigraphic analysis. Our innovative framework elevates interpretation accuracy on damaged inscriptions, thus enhancing knowledge extraction from these historical resources.

Arkhipov, Ilya, Dominique Charpin, Christian Gaubert, and Nele Ziegler. 2024. “The Old Babylonian Glossary of the ARCHIBAB Text Corpus: Results and Prospects.” Revue d’assyriologie et d’archéologie Orientale 118 (1): 91–102. https://shs.cairn.info/revue-revue-d-assyriologie-et-d-archeologie-orientale-2024-1-page-91.

View on shs.cairn.info

Avanzini, Alessandra, Annamaria De Santis, Matteo Gallo, Daniele Marotta, and Irene Rossi. 2015. “Computational Lexicography and Digital Epigraphy. Building Digital Lexica of Fragmentary Attested Languages in the Project DASI.” In 2015 Digital Heritage International Congress, edited by Gabriele Guidi, Roberto Scopigno, Juan Carlos Torres, and Holger Graf, 2. Analysis&Interpretation Theory, Methodologies, Preservation&Standards Digital Heritage Projects&Applications:46–59. New York: IEEE. doi:10.1109/DigitalHeritage.2015.7419535.

The ERC project DASI is aimed at digitizing the overall epigraphic heritage of the ancient Arabian peninsula, in order to enhance knowledge of the pre-Islamic Arabian languages and cultures. This paper describes the challenges faced and the solutions proposed in the construction of a digital lexicon tool for under-resources languages such as those attested in the epigraphic documentation of pre-Islamic Arabia.

View on ieeexplore.ieee.org

Avanzini, Alessandra, Annamaria De Santis, and Irene Rossi. 2018. “Encoding, Interoperability, Lexicography: Digital Epigraphy through the Lens of DASI Experience.” In Crossing Experiences in Digital Epigraphy: From Practice to Discipline, edited by Annamaria De Santis and Irene Rossi, 1–17. Warsaw/Berlin: De Gruyter Open Poland. doi:10.1515/9783110607208-002.

The paper describes the main challenges faced, and the solutions adopted in the frame of the project DASI - Digital Archive for the study of pre-Islamic Arabian inscriptions. In particular, the methodological and technological issues emerged in the conversion from a domain-specific text-based project of digital edition of an epigraphic corpus, to an objective-driven archive for the study and dissemination of inscriptions in different languages and scripts are discussed. With a view to keeping pace with, and possibly fostering reasoning on best practices in the community of digital epigraphers beyond each specific cultural/linguistic domain, special attention is devoted to: the modelling of data and encoding (XML annotation vs database approach; the conceptual model for the valorization of the material aspect of the epigraph; the textual encoding for critical editions); interoperability (pros and cons of compliance to standards; harmonization of metadata; openness; semantic interoperability); lexicography (tools for under-resourced languages; translations).

View on www.degruyterbrill.com

Chauveau, Nicolas. 2015. “Le Design Numérique Au Service de La Recherche En SHS : Une Étude de Cas Du Projet VÉgA, Vocabulaire de l’Égyptien Ancien:” Sciences Du Design 2 (2): 82–87. doi:10.3917/sdd.002.0082.

Le VÉgA, ou Vocabulaire de l’Égyptien Ancien, est un dictionnaire en ligne qui vise à devenir pour l’égyptologie une source incontournable et actualisée, ainsi qu’un support de collaborations scientifiques pour les décennies à venir. Le VÉgA permet de modéliser et représenter les connaissances évolutives en égyptien ancien, en regroupant et recoupant les mots, leurs attestations, leurs références, leurs graphies exactes en hiéroglyphes. Cet outil est le fruit d’une collaboration public/privé dans le cadre du LabEx Archimede de l’Université Paul-Valéry Montpellier 3. Le VÉgA résulte d’une approche nouvelle dans les sciences humaines et sociales, en l’occurrence l’égyptologie, intégrant les méthodes et les outils du design dans la recherche scientifique. Le LabEx Archimede s’est adjoint le savoir-faire d’une agence disposant d’une expérience de recherche par le design qui correspondait parfaitement aux besoins des égyptologues.

View on www.cairn.info

Chauveau, Nicolas, Magali Massiera, and Frédéric Rouffet. 2017. “Faire Découvrir Le Vocabulaire de l’Égyptien Ancien (VÉgA), Un Enjeu de Médiation Numérique.” In Design et Innovation Dans La Chaîne Du Livre, edited by Stéphane Vial and Marie-Julie Catoir-Brisson, 239–45. Presses Universitaires de France. doi:10.3917/puf.vials.2017.03.0239.

View on shs.cairn.info

Coffee, Neil, Christopher Forstall, and James Gawley. 2017. The Tesserae Project. PDF. Università Ca’ Foscari Venezia, Italia. doi:10.14277/6969-182-9/ANT-14-14.

The Tesserae Project offers a free online intertextual search tool for ancient Greek, Latin, and English. Tesserae has in the past allowed for a pairwise searching of literary texts in these languages for exact word or lemma similarities. This paper describes two new types of search now offered by Tesserae, by meaning (semantic search) and by sound.

View on edizionicafoscari.unive.it

Grossner, Karl, Susan Grunewald, and Ruth Mostern. 2023. “Bringing Places from the Distant Past to the Present: A Report on the World Historical Gazetteer.” International Journal on Digital Libraries 24 (3): 159–62. doi:10.1007/s00799-022-00341-2.

This article is a report about the progress and current status of the World Historical Gazetteer (whgazetteer.org) (WHG) in the context of its value for helping to organize and record digital and paleographic information. It summarizes the development and functionality of the WHG as a software platform for connecting specialist collections of historical place names. It also reviews the idea of places as entities (rather than simple objects with single labels). It also explains the utility of gazetteers in digital library infrastructure and describes potential future developments.

Lasagni, Chiara. 2017. “Il Progetto «The Epigraphic Landscape of Athens» e l’ELA Database: Caratteristiche e Risultati Preliminari per Uno Studio Semantico Della Topografia Ateniese.” Historika. Studi Di Storia Greca e Romana 7: 23–52. https://journals.openedition.org/historika/382.

View on journals.openedition.org

Mambrini, Francesco, and Marco Passarotti. 2020. “Representing Etymology in the LiLa Knowledge Base of Linguistic Resources for Latin.” In Proceedings of the 2020 Globalex Workshop on Linked Lexicography, edited by Ilan Kernerman, Simon Krek, John P. McCrae, Jorge Gracia, Sina Ahmadi, and Besim Kabashi, 20–28. Marseille, France: European Language Resources Association. https://aclanthology.org/2020.globalex-1.3/.

In this paper we describe the process of inclusion of etymological information in a knowledge base of interoperable Latin linguistic resources developed in the context of the LiLa: Linking Latin project. Interoperability is obtained by applying the Linked Open Data principles. Particularly, an extensive collection of Latin lemmas is used to link the (distributed) resources. For the etymology, we rely on the Ontolex-lemon ontology and the lemonEty extension to model the information, while the source data are taken from a recent etymological dictionary of Latin. As a result, the collection of lemmas LiLa is built around now includes 1,465 Proto-Italic and 1,393 Proto-Indo-European reconstructed forms that are used to explain the history of 1,400 Latin words. We discuss the motivation, methodology and modeling strategies of the work, as well as its possible applications and potential future developments.

View on aclanthology.org

Martin, Anaïs. 2023. “VÉgA (Vocabulaire de l’Égyptien Ancien): A New Definition of a Dictionary.” In Ancient Egypt, New Technology, edited by Rita Lucarelli, Joshua A. Roberson, and Steve Vinson, 281–97. Leiden: BRILL. doi:10.1163/9789004501294_013.

View on brill.com

Moscato, Vincenzo, Marco Postiglione, and Giancarlo Sperlí. 2023. “Few-Shot Named Entity Recognition: Definition, Taxonomy and Research Directions.” ACM Trans. Intell. Syst. Technol. 14 (5). Association for Computing Machinery. doi:10.1145/3609483.

Recent years have seen an exponential growth (+98% in 2022 w.r.t. the previous year) of the number of research articles in the few-shot learning field, which aims at training machine learning models with extremely limited available data. The research interest toward few-shot learning systems for Named Entity Recognition (NER) is thus at the same time increasing. NER consists in identifying mentions of pre-defined entities from unstructured text, and serves as a fundamental step in many downstream tasks, such as the construction of Knowledge Graphs, or Question Answering. The need for a NER system able to be trained with few-annotated examples comes in all its urgency in domains where the annotation process requires time, knowledge and expertise (e.g., healthcare, finance, legal), and in low-resource languages. In this survey, starting from a clear definition and description of the few-shot NER (FS-NER) problem, we take stock of the current state-of-the-art and propose a taxonomy which divides algorithms in two macro-categories according to the underlying mechanisms: model-centric and data-centric. For each category, we line-up works as a story to show how the field is moving toward new research directions. Eventually, techniques, limitations, and key aspects are deeply analyzed to facilitate future studies.

View on doi.org

Multhoff, Anne. 2018. “A Methodological Framework for the Epigraphic South Arabian Lexicography. The Case of the Sabaic Online Dictionary.” In Crossing Experiences in Digital Epigraphy: From Practice to Discipline, edited by Annamaria De Santis and Irene Rossi, 118–32. Warsaw/Berlin: De Gruyter Open Poland. doi:10.1515/9783110607208-010.

View on www.degruyterbrill.com

Palladino, Chiara, and Tariq Yousef. 2024. “Development of Robust NER Models and Named Entity Tagsets for Ancient Greek.” In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, edited by Rachele Sprugnoli and Marco Passarotti, 89–97. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lt4hala-1.11/.

This contribution presents a novel approach to the development and evaluation of transformer-based models for Named Entity Recognition and Classification in Ancient Greek texts. We trained two models with annotated datasets by consolidating potentially ambiguous entity types under a harmonized set of classes. Then, we tested their performance with out-of-domain texts, reproducing a real-world use case. Both models performed very well under these conditions, with the multilingual model being slightly superior on the monolingual one. In the conclusion, we emphasize current limitations due to the scarcity of high-quality annotated corpora and to the lack of cohesive annotation strategies for ancient languages.

View on aclanthology.org

Passarotti, Marco, Francesco Mambrini, Greta Franzini, Flavio Massimiliano Cecchini, Eleonora Litta, Giovanni Moretti, Paolo Ruffolo, and Rachele Sprugnoli. 2020. “Interlinking through Lemmas. The Lexical Collection of the LiLa Knowledge Base of Linguistic Resources for Latin.” Studi e Saggi Linguistici 58 (1): 177–212. doi:10.4454/ssl.v58i1.277.

This paper presents the structure of the LiLa Knowledge Base, i.e. a collection of multifarious linguistic resources for Latin described with the same vocabulary of knowledge description and interlinked according to the principles of the so-called Linked Data paradigm. Following its highly lexically based nature, the core of the LiLa Knowledge Base consists of a large collection of Latin lemmas, serving as the backbone to achieve interoperability between the resources, by linking all those entries in lexical resources and tokens in corpora that point to the same lemma. After detailing the architecture supporting LiLa, the paper particularly focusses on how we approach the challenges raised by harmonizing different strategies of lemmatization that can be found in linguistic resources for Latin. As an example of the process to connect a linguistic resource to LiLa, the inclusion in the Knowledge Base of a dependency treebank is described and evaluated.

View on www.studiesaggilinguistici.it

Pellegrini, Matteo, Eleonora Maria Gabriella Litta Modignani Picozzi, Marco Carlo Passarotti, Mambrini Francesco, and Moretti Giovanni. 2021. “The Two Approaches to Word Formation in the LiLa Knowledge Base of Latin Resources.” In Proceedings of the Third International Workshop on Resources and Tools for Derivational Morphology (DeriMo 2021), 105–13. ATILF & CLLE. doi:10.5281/zenodo.5532501.

View on staging-unicatt.elsevierpure.com

Pellegrini, Matteo, Eleonora Maria Gabriella Litta Modignani Picozzi, Marco Carlo Passarotti, Rachele Sprugnoli, Francesco Mambrini, and Giovanni Moretti. 2021. “LiLa Linking Latin Tutorial.” In Proceedings of the Workshops and Tutorials - Language Data and Knowledge 2021 (LDK 2021). Zaragoza, Spain, September 1-4, 229–34. doi:10.5281/zenodo.6303208.

View on publires.unicatt.it

Ruzicka, Ronald. 2018. “10 KALAM: A Word Analyzer for Sabaic.” In Crossing Experiences in Digital Epigraphy, edited by Annamaria De Santis and Irene Rossi, 133–40. Warsaw/Berlin: De Gruyter. doi:10.1515/9783110607208-011.

View on www.degruyter.com

Salomon, Corinna. 2024. “Lexicon Leponticum – Concept and Implementation.” In Cisalpine Celtic Literacy – Proceedings of the International Symposium Maynooth 23–24 June 2022, edited by Corinna Salomon and David Stifter. Hagen.

Sövegjártó, Szilvia, and Márton Vér, eds. 2024. Exploring Multilingualism and Multiscriptism in Written Artefacts. Studies in Manuscript Cultures 38. Berlin; Boston: De Gruyter. doi:10.1515/9783111380544.