Digital Corpus for Graeco-Arabic Studies

About the Digital Corpus

The composition of the corpus


The Digital Corpus, which currently has ca. 1.2M Arabic and 3.3M Greek words, consists of about 230 texts, three fifths of which are Greek and the rest Arabic. The texts range in length from a couple of pages to several hundred pages, and they represent more than 180 works by 28 authors. In addition to Greek and Arabic primary sources, the corpus also contains a number of important Arabic secondary sources, mainly commentaries on ancient Greek writings, important secondary works and major bio-bibliographical sources.

The choice of texts was mainly determined by three factors: the importance of their authors for the Greek-Arabic translation movement; the availability of printed editions; and the copyright status of these editions. The corpus therefore consists for the most part of editions of major authors that were easily accessible and not subject to copyright restrictions. The most important sources for Greek texts were the complete editions of Galen by Karl Gottlob Kühn (1821–1833), of Hippocrates by Émile Littré (1839–1861) and of Aristotle by Immanuel Bekker (1831).

The texts assembled in the corpus cover a wide range of subjects, but as a result of availability and copyright considerations, philosophical and medical works, especially by Aristotle, Galen and Hippocrates, are particularly prominent. The corpus also contains a sizable sample of mathematical texts. Other fields represented by one or more texts are astronomy, biology, zoology and psychology as well as doxography.

Modern editors of most of the Arabic translations included in the corpus have attempted to ascribe or at least date these texts based on internal (style, terminology etc.) and external evidence (information from bio-bibliographical sources etc.) According to these ascriptions, the translators that worked at the height of the translation movement during the second half of the eighth and the beginning of the ninth century are particularly well represented in the corpus, most importantly the members of the circle of translators working with Ḥunayn ibn Isḥāq (d. 873). They created almost all of the medical texts in the corpus, a substantial part of the Aristotelian Organon and a number of texts by Alexander of Aphrodisias. Earlier and later phases of the translation movement, however, are also well represented, e.g. through translations of Alexander of Aphrodisias, parts of Aristotle’s Organon and his zoological writings and many of the pseudonymous texts.

The creation of the corpus


The Digital Corpus for Graeco-Arabic Studies is the result of a collaborative project at Harvard and Tufts University, funded by the Andrew W. Mellon Foundation. The main aim of the project was the creation of a public-domain corpus of Greek and Arabic philosophical and scientific works. It was initiated and supervised by Mark J. Schiefsky at the Department of the Classics, Harvard University, and Gregory R. Crane, then at the Department of Classics, Tufts University; Uwe Vagelpohl, Department of Classics, University of Warwick, was responsible for assembling the Arabic corpus, vetting and tagging the raw texts and importing the corpus into the Digital Corpus database.

The initial step consisted of identifying a sample of printed editions of Greek and Arabic works that were both representative for the output of the Greek-Arabic translation movement as well as easily available and out of copyright. These editions were then digitised by a data entry company, Digital Divide Data. The raw texts delivered by DDD were then checked for errors, cleaned up and then XML-tagged. The tagging includes structural information about the sourced edition (e.g. page numbers), any internal divisions (books, chapters) of the source text and requisite metadata.

Parallel texts, which include pairs of Greek texts and their corresponding translations as well as different editions of the same Greek or Arabic text, were then aligned on the level of chapters. A small number of texts, e.g. Hippocrates’ Aphorisms and the pseudo-Menandrian Sentences, were further aligned to the sentence-level.

The final step consisted in the creation of the database and web page infrastructure, including the front end to view and download texts from the corpus and the back end to add or modify texts and add metadata, by net4media.

Contribute to the corpus


The Digital Corpus is a work in progress. We welcome submissions of Greek and Arabic digital texts in any format. Copyrighted texts can also be included in the database and can be viewed or searched, but they will not be made available for download. Please contact us if you want to find out more.

Project members


Mark Schiefsky

Harvard University


Gregory R. Crane

Universität Leipzig


Uwe Vagelpohl

University of Warwick 


Design and programming