The paper describes the main challenges faced, and the solutions adopted in the frame of the project DASI - Digital Archive for the study of pre-Islamic Arabian inscriptions. In particular, the methodological and technological issues emerged in the conversion from a domain-specific text-based project of digital edition of an epigraphic corpus, to an objective-driven archive for the study and dissemination of inscriptions in different languages and scripts are discussed. With a view to keeping pace with, and possibly fostering reasoning on best practices in the community of digital epigraphers beyond each specific cultural/linguistic domain, special attention is devoted to: the modelling of data and encoding (XML annotation vs database approach; the conceptual model for the valorization of the material aspect of the epigraph; the textual encoding for critical editions); interoperability (pros and cons of compliance to standards; harmonization of metadata; openness; semantic interoperability); lexicography (tools for under-resourced languages; translations).
This article discusses the challenges addressed in the digital scholarly encoding of the fragmentary texts of the languages of Ancient Italy according to the TEI/EpiDoc Guidelines in XML format. It describes the solutions and customisations that have been adopted for dealing with the peculiarities of our epigraphical documentation and with the formalisation of epigraphical information deemed interesting for data retrieval in a historical linguistic perspective. The making of a digital corpus consisting of new critical editions of selected inscriptions is a work carried out in the context of the project “Languages and Cultures of Ancient Italy. Historical Linguistics and Digital Models”, which aims to investigate the languages of Ancient Italy by combining the traditional methods, proper to historical linguistics, with methods and technology proper to the digital humanities and computational lexicography. More specifically, the purpose of the project is to create a set of interrelated digital language resources which comprise: (1) a digital corpus of texts editions; (2) a computational lexicon compliant with the Web Semantic requirements; (3) a relevant bibliographic reference dataset encoded according to the FRBRoo/LRMoo specifications. Additionally, selected textual data and scientific interpretations will be encoded using CIDOC CRM and its extensions, namely CRMtex and CRMinf. The present contribution thus tackles one of the main aspects of the project, and proposes significant innovations in the encoding of critical editions for epigraphic texts of fragmentary languages, which will hopefully foster future interoperability and integration with other external datasets, a paramount concern of the project.
This paper provides an overview of diverse applications of parallel corpora in ancient languages, particularly Ancient Greek. In the first part, we provide the fundamental principles of parallel corpora and a short overview of their applications in the study of ancient texts. In the second part, we illustrate how to leverage on parallel corpora to perform various NLP tasks, including automatic translation alignment, dynamic lexica induction, and Named Entity Recognition. In the conclusions, we emphasize current limitations and future work.