Epigraphy is witnessing a growing integration of artificial intelligence, notably through its subfield of machine learning (ML), especially in tasks like extracting insights from ancient inscriptions. However, scarce labeled data for training ML algorithms severely limits current techniques, especially for ancient scripts like Old Aramaic. Our research pioneers an innovative methodology for generating synthetic training data tailored to Old Aramaic letters. Our pipeline synthesizes photo-realistic Aramaic letter datasets, incorporating textural features, lighting, damage, and augmentations to mimic real-world inscription diversity. Despite minimal real examples, we engineer a dataset of 250 000 training and 25 000 validation images covering the 22 letter classes in the Aramaic alphabet. This comprehensive corpus provides a robust volume of data for training a residual neural network (ResNet) to classify highly degraded Aramaic letters. The ResNet model demonstrates 95% accuracy in classifying real images from the 8th century BCE Hadad statue inscription. Additional experiments validate performance on varying materials and styles, proving effective generalization. Our results validate the model’s capabilities in handling diverse real-world scenarios, proving the viability of our synthetic data approach and avoiding the dependence on scarce training data that has constrained epigraphic analysis. Our innovative framework elevates interpretation accuracy on damaged inscriptions, thus enhancing knowledge extraction from these historical resources.
The ERC project DASI is aimed at digitizing the overall epigraphic heritage of the ancient Arabian peninsula, in order to enhance knowledge of the pre-Islamic Arabian languages and cultures. This paper describes the challenges faced and the solutions proposed in the construction of a digital lexicon tool for under-resources languages such as those attested in the epigraphic documentation of pre-Islamic Arabia.
The paper describes the main challenges faced, and the solutions adopted in the frame of the project DASI - Digital Archive for the study of pre-Islamic Arabian inscriptions. In particular, the methodological and technological issues emerged in the conversion from a domain-specific text-based project of digital edition of an epigraphic corpus, to an objective-driven archive for the study and dissemination of inscriptions in different languages and scripts are discussed. With a view to keeping pace with, and possibly fostering reasoning on best practices in the community of digital epigraphers beyond each specific cultural/linguistic domain, special attention is devoted to: the modelling of data and encoding (XML annotation vs database approach; the conceptual model for the valorization of the material aspect of the epigraph; the textual encoding for critical editions); interoperability (pros and cons of compliance to standards; harmonization of metadata; openness; semantic interoperability); lexicography (tools for under-resourced languages; translations).
Le VÉgA, ou Vocabulaire de l’Égyptien Ancien, est un dictionnaire en ligne qui vise à devenir pour l’égyptologie une source incontournable et actualisée, ainsi qu’un support de collaborations scientifiques pour les décennies à venir. Le VÉgA permet de modéliser et représenter les connaissances évolutives en égyptien ancien, en regroupant et recoupant les mots, leurs attestations, leurs références, leurs graphies exactes en hiéroglyphes. Cet outil est le fruit d’une collaboration public/privé dans le cadre du LabEx Archimede de l’Université Paul-Valéry Montpellier 3. Le VÉgA résulte d’une approche nouvelle dans les sciences humaines et sociales, en l’occurrence l’égyptologie, intégrant les méthodes et les outils du design dans la recherche scientifique. Le LabEx Archimede s’est adjoint le savoir-faire d’une agence disposant d’une expérience de recherche par le design qui correspondait parfaitement aux besoins des égyptologues.
The Tesserae Project offers a free online intertextual search tool for ancient Greek, Latin, and English. Tesserae has in the past allowed for a pairwise searching of literary texts in these languages for exact word or lemma similarities. This paper describes two new types of search now offered by Tesserae, by meaning (semantic search) and by sound.
This article is a report about the progress and current status of the World Historical Gazetteer (whgazetteer.org) (WHG) in the context of its value for helping to organize and record digital and paleographic information. It summarizes the development and functionality of the WHG as a software platform for connecting specialist collections of historical place names. It also reviews the idea of places as entities (rather than simple objects with single labels). It also explains the utility of gazetteers in digital library infrastructure and describes potential future developments.
In this paper we describe the process of inclusion of etymological information in a knowledge base of interoperable Latin linguistic resources developed in the context of the LiLa: Linking Latin project. Interoperability is obtained by applying the Linked Open Data principles. Particularly, an extensive collection of Latin lemmas is used to link the (distributed) resources. For the etymology, we rely on the Ontolex-lemon ontology and the lemonEty extension to model the information, while the source data are taken from a recent etymological dictionary of Latin. As a result, the collection of lemmas LiLa is built around now includes 1,465 Proto-Italic and 1,393 Proto-Indo-European reconstructed forms that are used to explain the history of 1,400 Latin words. We discuss the motivation, methodology and modeling strategies of the work, as well as its possible applications and potential future developments.
Recent years have seen an exponential growth (+98% in 2022 w.r.t. the previous year) of the number of research articles in the few-shot learning field, which aims at training machine learning models with extremely limited available data. The research interest toward few-shot learning systems for Named Entity Recognition (NER) is thus at the same time increasing. NER consists in identifying mentions of pre-defined entities from unstructured text, and serves as a fundamental step in many downstream tasks, such as the construction of Knowledge Graphs, or Question Answering. The need for a NER system able to be trained with few-annotated examples comes in all its urgency in domains where the annotation process requires time, knowledge and expertise (e.g., healthcare, finance, legal), and in low-resource languages. In this survey, starting from a clear definition and description of the few-shot NER (FS-NER) problem, we take stock of the current state-of-the-art and propose a taxonomy which divides algorithms in two macro-categories according to the underlying mechanisms: model-centric and data-centric. For each category, we line-up works as a story to show how the field is moving toward new research directions. Eventually, techniques, limitations, and key aspects are deeply analyzed to facilitate future studies.
This contribution presents a novel approach to the development and evaluation of transformer-based models for Named Entity Recognition and Classification in Ancient Greek texts. We trained two models with annotated datasets by consolidating potentially ambiguous entity types under a harmonized set of classes. Then, we tested their performance with out-of-domain texts, reproducing a real-world use case. Both models performed very well under these conditions, with the multilingual model being slightly superior on the monolingual one. In the conclusion, we emphasize current limitations due to the scarcity of high-quality annotated corpora and to the lack of cohesive annotation strategies for ancient languages.
This paper presents the structure of the LiLa Knowledge Base, i.e. a collection of multifarious linguistic resources for Latin described with the same vocabulary of knowledge description and interlinked according to the principles of the so-called Linked Data paradigm. Following its highly lexically based nature, the core of the LiLa Knowledge Base consists of a large collection of Latin lemmas, serving as the backbone to achieve interoperability between the resources, by linking all those entries in lexical resources and tokens in corpora that point to the same lemma. After detailing the architecture supporting LiLa, the paper particularly focusses on how we approach the challenges raised by harmonizing different strategies of lemmatization that can be found in linguistic resources for Latin. As an example of the process to connect a linguistic resource to LiLa, the inclusion in the Knowledge Base of a dependency treebank is described and evaluated.