Are you the publisher? Claim or contact us about this channel

Embed this content in your HTML


Report adult content:

click to rate:

Account: (login)

More Channels

Channel Catalog

Channel Description:

Reading Traditional Sources with Nontraditional Methods
    0 0

    Back to al-Dhahabī’s Ta’rīkh al-islām. The present dynamic cartogram shows how the prominence of major urban centers was changing over time. The focus is again on “descriptive names” (nisba) and the “size” of each urban center on the cartogram reflects the number of individuals with “descriptive names” that refer to that urban center. A “prominent center” in the current dataset is a place with which at least 10 individuals from Ta’rīkh al-islām are associated1 (the overall number of individuals in the current dataset is slightly over 29,000 for the period of 661–1300 CE). Each frame features the names of the top 15 urban centers (the largest among them gradually change their hue from green to red).

    The cradle of Islam, central Arabia is the most prominent region in the early period. Major urban centers of this region are Mecca/Makka (269) and Medina/al-Madīna (691), but their cultural prominence soon shifts to the main garrison cities of lower Iraq: Kufa/al-Kūfa and Basra/al-Baṣra. The decline of central Arabia starts around 100/719 CE and by 250/865 CE this region is diminished to a marginal province. (The south Arabian cluster displays a similar trend.)

    Major Early Bloomers: Medina/al-Madīna (691); Kufa/al-Kūfa (1,432), and Basra/al-Baṣra (1,595).

    Iraq very quickly becomes the central region and maintains this status for the most of the period covered in Ta’rīkh al-islām. During the early period its prominent urban centers are Basra/ al-Baṣra (1,595) and Kufa/ al-Kūfa (1,432), but the prominence of these garrison towns is soon dwarfed by Baghdad, the new capital city, and they practically disappear from the social map of the Islamic world by around 300/913 CE. Baghdad remains the dominant urban center not only for Iraq, but for the entire Islamic world until the beginning of the 13th century CE. Other major urban centers of this region are Wāsiṭ (401) and al-Anbār (83).

    The rapid growth of Iraq comes to a halt around 200/816 CE—at this period the Caliphate is being torn apart by the civil war between al-Amīn and al-Maʾmūn, the sons of great Hārūn al-Rashīd (r. 786-809 CE), who decided to divide the Empire between them. The province falls into clearly visible decline. In the course of the 9th century the power slips from the ʿAbbāsid caliphs: first into the hands of the military commanders of their slave armies, then—the Būyids (932-1055 CE) and the Saljūqs (1038-1194 CE).

    480/1088 CE marks the beginning of a century-long recovery for Iraq—the ʿAbbāsid caliphs gradually manage to shake off the ‘‘protectorship” of the military (at this point, the Saljūq sulṭāns) and temporarily regain their independence. Caliphs, sulṭāns, and viziers (wazīr) vie for for influence with each other, seeking the support of religious scholars and relying on various mechanisms of promoting different legal schools—respectively, the Ḥanbalīs, the Ḥanafīs, and the Shāfiʿīs. The data from Ta’rīkh al-islām shows that it is during this period that these groups start growing quite noticeably.

    The number of Baghdādīs drops quite noticeably before the Mongol sack of the capital city in 656/1258 CE. Numbers of deaths reported for the 20-lunar-year periods after 600/1204 CE: 244 for 600-620 AH/1204-1224 CE CE; 256 for 621-640 AH/1225-1243 CE; 98 for 641-660 AH/1244-1262 CE CE; 27 for 661-680 AH/1263-1282 CE CE; 51 for 681-700 AH/1283-1301 CE.

    By the end of the period covered in Ta’rīkh al-islām, Iraqi élites drastically decrease in numbers, practically disappearing from the social map of the Islamic world. Although the Mongol invasion is often considered the main cause, the data from Ta’rīkh al-islām shows that the ranks of Iraqi élites start thinning well before the coming of the Mongols. Despite these vicissitudes, the number of notable men in Iraq remains quite significant over the most part of our period, and the prominence of Iraq is rivaled only by Iran, with all its clusters combined.

    Major “middle bloomers,” Iranian provinces gain prominence between 100/719 CE and 200/816 CE. The curve of northeastern Iran (Khurāsān) reaches its highest point quite quickly around 200/816 CE and remains there, fluctuating slightly, for over three centuries, and goes into a rapid decline after 520/1127 CE. It takes longer for northwestern Iran to reach its peak—around 350/962 CE—and then it slowly goes down. Unlike northeastern Iran, it is still visible on the maps of the Islamic world by the end of our period. The curve of southwestern Iran reaches its highest point around 280/894 CE, then goes into a temporary decline during the 4th/10th century, recovers by 400/1010 CE and begins to go down slowly, increasing its pace of decline around 520/1127 CE. The major urban centers are: Nishapur/Naysābūr (1,038), Merv/Marw [al-shāhijān] (385), Herat/Harāt (392), Balkh (171) and Ṭūs (136) in northeastern Iran (Khurāsān); Rey/al-Rayy (280), Hamadhān (254) and Qazwīn (118) in northwestern Iran; and Isfahan/Iṣbaḥan (1,124) and Shīrāz (100) in southwestern Iran.

    Major Middle Bloomers: Baghdād (3,086); Isfahan/Iṣbahān (1,100); Nishapur/Naysābūr (1,038); Cordova/Qurṭuba (634); Andalusia/al-Andalus (582).

    The curves of Iranian clusters correspond to what scholars of Islam often refer to as ‘‘Iranian intermezzo, ”2 a period of Iranian independent dynasties (roughly 750-1150 CE): the Ṭāhirids (821-873), Ṣaffārids (867-903) and Sāmānids (875-999) in the east and the Būyids (932-1055) in the north and west. All Iranian clusters practically come to naught by the end of the period covered in Ta’rīkh al-islām.

    The two-peaked curve of the last “middle bloomer,” al-Andalus, seems to correspond to the zenith of the Umayyad caliphate in Spain (756-1031 CE) around 380/991 CE, followed by its disintegration and the recovery under the Almoravids/al-Murābiṭūn (1056-1147 CE) and the Almohads/al-Muwaḥḥidūn (1130-1269 CE)—beginning around 470/1078 CE and peaking around 590/1195 CE; after that Andalusia is erased from the map of the Islamic world by the Christian Reconquista. The major Andalusian urban centers are Cordova/Qurṭuba (633), Seville/Ishbīliya (248), Valencia/Balansiyya (141) and Toledo/Ṭulayṭila (89).

    Major Late Bloomers: Damascus/Dimashq (1,573); Egypt/Miṣr (1,501); Alexandria/al-Iskandarīya (212).

    Regional clusters that can be characterized as “late bloomers” often have earlier peaks of prominence: around 100/719 CE for Syria, when the first great Islamic dynasty, the Umayyads (661-750 CE), rules from there; around 200/816 CE for the Jazīra and Jordan; and around 300/913 CE for Egypt—followed by equally noticeable decline until around 500/1107 CE. However, their main peaks of prominence fall on the end of the period, by which the “late bloomers” form what can be considered as one continuous crescent-shaped macro-region stretching from Egypt/Miṣr in the south, through Jordan/al-Urdunn, Syria/al-Shām, the Jazīra/Upper Mesopotamia, the northern part of Iraq, the very south of the Caucasian cluster in the north, and even touches northwestern Iran (Zanjān). The prominence of these regions rises noticeably after 500/1107 CE—right at the onset of the rule of dynasties that unify the region: the Zangids (1127-1222 CE), the Ayyūbids (1169-1250 CE), and the Mamlūks (1250-1517 CE). The major urban centers are: Mosul/al-Mawṣil (313) and Ḥarrān (224) in the Jazīra; Damascus/Dimashq (1,769), Homs/Ḥimṣ (268), Aleppo/Ḥalab (231) and Hamah/Ḥamā (103) in Syria; Jerusalem/al-Quds (315) in Jordan; and Alexandria/al-Iskandariyya (211) in Egypt/Miṣr.3 By the end of the main period covered in the Ta’rīkh al-islām, Syria becomes the new center of the Islamic world, with Egypt being next in the line.

    The Eastern Urban Crescent of the 7th/13th Century. A similar shift toward the Mediterranean shore happens with the western urban centers a century earlier (most clearly visible in Andalusia). This return to the Mediterranean can be interpreted as a sign of the formation of the new Mediterranean commonwealth with the Italian “Maritime Republics” (Genoa, Pisa, Venice, Almalfi and others) actively trading in the region.

    Cited Works

    al-Dhahabī (1990):
    al-Dhahabī. Ta’rīkh al-islām wa-wafayāt al-mashāhīr wa-al-a‘lām. Dār al-Kitāb al-‘Arabī, Bayrūt, 2 edition, 1990.

    Minorsky (1953):
    Vladimir Minorsky. Studies in Caucasian History. CUP Archive, January 1953.

    al-Samʿānī (1998):
    al-Sam‘ānī. al-Ansāb. 5 vols. Bayrūt: Dār al-fikr, 1998.

    Romanov (2013):
    Maxim G. Romanov. Computational Reading of Arabic Biographical collections with Special reference to Preaching (661-1300 CE). Ph.D., University of Michigan, Ann Arbor, MI, 2013.

    Current Dataset

    SourcesTa’rīkh al-islām of al-Dhahabī (d. 748/1347)
    Period:  41-700 AH / 661-1300 CE (Volumes 4-52)
    Biographies: ~29,000
    Unique Nisbas: ~700
    Total number of Nisbas: over 70,000


    1. The nature of nisbas is not unproblematic and anyone who has worked with biographical collections is likely to object saying that, for example, not every individual identified as “al-Madanī” was actually a Medinan; besides there definitely are Medinans who are not identified as such with this specific toponymic nisba, not to mention that the “descriptive name” al-Madanī (and its variation al-Madīnī) may refer to urban centers other than Medina. (See, for example, al-Samʿānī (1998), 5:235–239.) While such objections are not invalid, at this point of our knowledge and understanding of the overabundant biographical data from Arabic sources we simply do not know to what extent the presence of false positives (i.e., Madanīs who have nothing to do with Medina) and the absence of false negatives (i.e., the Medinans who are not identified as Madanīs) actually affects the overall picture. Working with big data requires some clearly identified methodological assumptions regarding the types of data used in modeling. My computational analysis of data from the Ta’rīkh al-islām yields about 700 unique nisbas (with over 300 toponymic ones) that identify at least 10 different individuals, while the overall number of these nisbas runs into over 70,000 instances, considering that individuals are often described with more than one nisba. While 70,000 data points can hardly be called “big data” by any scientific standards, this dataset is too big to make exact identification of each and every nisba possible. Thus, under these circumstances, treating nisbas at their face values is simply the most logical way to begin large scale analysis of biographical data from Arabic sources; as our knowledge about the “behavior” of nisbas in biographical collections improves—and this can be achieved only through large-scale exploratory analysis—these methodological assumptions can and will be adjusted. For the detailed discussion of methodological assumptions see, Romanov (2013), 28–40.  []
    2. The term was introduced by Vladimir Minorsky (Minorsky (1953), 110-116). []
    3. Cairo/al-Qāhira is not yet identifiable through onomastic data; most individuals from Egypt have the nisba al-Miṣrī (1,501) that associate them with the entire province. Although this nisba may also refer to Cairo, at the moment it does not appear possible to differentiate efficiently. []

    0 0

    A Screenshot of al-Thurayyā. Click on the screenshot to open the Gazetteer in full screen.

    This is our first usable demo of al-Thurayyā Gazetteer. Currently it includes over 2,000 toponyms and almost as many route sections georeferenced from Georgette Cornu’s Atlas du monde arabo-islamique à l'époque classique: IXe-Xe siècles (Leiden: Brill, 1983). The gazetteer is searchable (upper left corner), although English equivalents are not yet included; in other words, look for Dimashq/دمشق, not Damascus.

    You can browse the Gazetteer by clicking on any toponym marker. The popup will show the toponym both in Arabic script and transliterated. We are using a slightly modified transliteration system that facilitates conversion between fully transliterated, transliterated, and Arabic forms of toponyms. It should be easily understandable. There may be typos, because of the nature of how the data has been generated, so please, let us know if something should be corrected. The popup also offers a selection of possible sources on a toponym in question. You can check Arabic Sources: currently, al-Samʿānī’s Kitāb al-ansāb and Yāqūt’s Muʿjam al-buldān. Currently, the Gazetteer will only check for exact matches, which means that in some cases there will not be any entry at all, while in other cases there may be more than one and they may refer to other places with the same name. Improving the precision of this lookup is on our to-do list. You can also check if there is information on a toponym in question in Brill's Encyclopaedia of Islam, Pleiades, and Wikipedia.

    Credits & Acknowledgments

    Many thanks to Adam Tavares (programmer @ Perseus Project, Tufts) and, particularly, Cameron Jackson (senior, double-majoring in Arabic and Computer Science, Tufts) for the technical development; to Vickie Sullivan (Chair, Classics Department), Gregory Crane and the entire Perseus team on the both sides of the Atlantic for support and inspiration.

    0 0
  • 02/06/15--16:00: BetaCode for Arabic
  • Arabic betaCode

    Although both Windows and Mac OS now support Arabic, it is still quite difficult to type and edit Arabic texts. It is particularly frustrating to edit and manipulate fully vocalized texts, since most fonts either render “short vowels” (ḥarakāt) invisible, or do not render them properly. Because of the “stacking,” i.e. “short vowels” being placed on top of letters and on top of each other, it becomes impossible to edit texts and one is often forced to go into delete-and-retype mode (and there is still no guarantee, because of visual issues, that all the letters and “short vowels” will actually be in the right order). betaCode can make it easy to type fully-vocalized Arabic texts on any machine through the use of simple character combinations and automatic rendering into various transliteration schemes and the Arabic script (scroll below for examples).

    betaCode is first converted into a one-to-one transliteration scheme, which combines conventions from various academic transliteration schemes. Such scheme is necessary, since none of the existing academic schemes (American/Library of Congress, British, French, German, etc.) allow representing Arabic text unambiguously for computational purposes. Arabic betaCode transliteration can be then converted into any transliteration convention. At the moment the following schemes are implemented:

    • Library of Congress Romanization of Arabic
    • Simplified transliteration (LOC without diacritics)
    • Arabic script (the rules of hamzaŧ orthography are implemented, but may require some additional testing)

    NB: The idea of betaCode is borrowed from the Classicists who developed a method of representing, using only ASCII characters, characters and formatting found in ancient Greek texts. The current betaCode is inspired by, and is therefore quite similar to, the arabTex scheme. Linguists working with Arabic are commonly using Buckwalter transliteration, which is very similar to the current betaCode, but less readable.

    betaCode and One-To-One Transliteration

    betacode translit Arabic letter
    _a ā alif
    b b bāʾ
    t t tāʾ
    _t thāʾ
    ^g, j ǧ jīm
    *h, .h ḥāʾ
    _h khāʾ
    d d dāl
    _d dhāl
    r r rā’
    z z zayn
    s s sīn
    ^s š shīn
    *s, .s ṣād
    *d, .d ḍād
    *t, .t ṭāʾ
    *z, .z ẓāʾ
    ` ʿ ‘ayn
    *g, .g ġ ghayn
    f f fāʾ
    *k, .k, q qāf
    k k kāf
    l l lām
    m m mīm
    n n nūn
    h h hā’
    w w wāw
    _u ū wāw
    y y yāʾ
    _i ī yāʾ

    Non-alphabetic letters

    betacode translit Arabic
    ' ʾ hamzaŧ
    /a á alif maqṣūraŧ
    :t ŧ tāʾ marbūṭaŧ


    betacode translit Arabic
    ~a ã dagger alif
    u u ḍammaŧ
    i i kasraŧ
    a a fatḥaŧ
    .n ȵ n of tanwīn
    .a å silent alif
    .w ů silent wāw
    ?u final ḍammaŧ *
    ?i final kasraŧ *
    ?a final fatḥaŧ *

    * “finals” are those final vowels that are usually dropped in transliteration and pronounciation (i.e., al-kitāb, instead of al-kitābủ, al-kitābỉ, al-kitābả), vs those that are not (huwa, hiyya, ḏãlika, tilka).

    Basic principles:

    Every Arabic letter is betaCoded with its one-letter equivalent, preceded (if necessary) with a technical character that is similar to a diacritical mark in the transliterated version. Thus, most common symbols are as follows:


    • _ (underscore), if a letter can be transliterated with macron/breve below or above (ā, , , , ū, ī)
    • . (period), or * (asterisk), if a letter can be transliterated transliterated with dot below or above (, , , , , ġ, )
    • ^ (caret), if a letter can be transliterated with caron (ǧ, š)


    • attached prepositions/conjunctions and pronominal suffixes must be separated with “-” (mostly relevant for text alignment, treebanking, and general readability):
      • bi-Llah?i
      • fa-_dahaba
    • add “?” before “optional” final vowels that are usually dropped in transliteration and pronounciation (mostly relevant for transliteration):
      • bi-Llah?i, but not:
      • fa-_dahaba
    • tāʾ marbūṭaŧ: add “+” after tāʾ marbūṭaŧ, if the first word of iḏāfaŧ (mostly relevant for transliteration):
      • `_amma:t+u Ba.gd_ada, but:
      • al-`_amma:tu f_i Ba.gd_ada
    • transliterating tanwīn:
      • .n
        • ?u.n
        • ?i.n
        • ?a.n
    • silent wāw and alif:
      • .w (Amr?u.n.w, for عَمْرٌو)
      • .a (wa-fa`al_u.a, for وَفَعَلُوا)

    Running the converter

    • (Python 3.xx must be installed on the machine)
    • clone git repository
    • save texts that must be transliterated (i.e., the text is in English, but has some Arabic terms that must be transliterated) into ./to_translit/ (follow the format given in the example file).
    • save texts that must be fully transliterated or/and converted into Arabic script (i.e., the entire texts is in Arabic) into ./to_arabic/ (follow the format given in the example file).
    • run the script (in Mac terminal: python3; on Windows: double-click on the script should work).
    • converted texts (in all available modes of conversion) will be appended to the file.
    • if you need to make any changes, edit your initial betaCode text and run the script again, converted results will be replaced with relevant updated versions.


    betaCode Example

    NB: These are examples of converting betaCode to full transliteration and Arabic script. The very last paragraph showcases conversion of hamzaŧ in different positions.

    q_ala 'ab_u Mas`_ud?i.n :: 'an_a qad sami`tu h~a_d_a min ras_ul?i All~ah?i ( .sl`m )

    .hadda_ta-n_a `Amr?u.w bn?u R_afi`?i.n , .hadda_ta-n_a `Abd?u All~ah?i bn?u al-Mub_arak?i , `an Mu.hammad?i bn?i 'Is.h_aq?a , `an Mu.hammad?i bn?i ^Ga`far?i.n , `an `Ubayd?i All~ah?i bn?i `Abd?i All~ah?i bn?i `Umar?a , `an 'Ab_i-hi , `an?i al-Nabiyy?i ( .sl`m ) na.hwa-hu

    'a_hbara-n_a Qutayba:t?u q_ala , .hadda_ta-n_a Sufy_an?u , `an Ya.hy/a bn?i Sa`_id?i.n , `an 'Ab_i Bakr?i bn?i Mu.hammad?i.n , `an `Umar?a bn?i `Abd?i al-`Az_iz?i , `an 'Ab_i Bakr?i bn?i `Abd?i bn?i al-.H_ari_t?i bn?i Hi^s_am?i.n , `an 'Ab_i Hurayra:t?a mi_tla-hu

    Ta.hw_il?u al-hamza:t?i ( kalim_at?u.n mufrada:t?u.n )

    'amr?u.n 'uns?u.n 'ins?u.n '_im_an?u.n '_aya:t?u.n '_amana mas'ala:t?u.n sa'ala ra's?u.n qur'_an?u.n ta'_amara _di'b?u.n as'ila:t?u.n q_ari'i-hi su'l?u.n mas'_ul?u.n tak_afu'u-hu su'ila q_ari'i-hi _di'_ab?u.n ra'_is?u.n bu'isa ru'_uf?u.n ra'_uf?u.n su'_al?u.n mu'arri_h?u.n abn_a'a-hu abn_a'u-hu abn_a'i-hi ^say'?a.n _ha.t_i'a:t?u.n .daw'u-hu .d_u'u-hu .daw'a-hu .daw'i-hi mur_u'a:t?u.n 'abn_a'i-hi bar_i'u-hu s_u'ila f_il?u.n f_ann?u.n f_unn?u.n s_a'ala fu'_ad?u.n ^surak_a'u-hu ri'_asa:t?u.n tahni'a:t?u.n daf_a'a:t?u.n .taff_a'a:t?u.n ta'r_i_h?u.n fa'r?u.n ^say'?u.n ^say'?i.n ^say'?a.n .daw'?u.n .daw'?i.n .daw'?a.n juz'?u.n juz'?i.n juz'?a.n mabda'?u.n mabda'?i.n mabda'?a.n naba'a q_ari'?u.n tak_afu'?u.n tak_afu'?i.n tak_afu'?a.n abn_a'u abn_a'i abn_a'a jar_i'?u.n maqr_u'?u.n .daw'?u.n ^say'?u.n juz'?u.n `ulam_a'u al-`ulam_a'i al-`ulam_a'a `Amr?u.n.w wa-fa`al_u.a

    betaCode converted into one-to-one translit

    ḳāla ʾabū Masʿūdỉȵ :: ʾanā ḳad samiʿtu hãḏā min rasūlỉ Allãhỉ ( ṣlʿm )

    ḥaddaṯa-nā ʿAmrủů bnủ Rāfiʿỉȵ , ḥaddaṯa-nā ʿAbdủ Allãhỉ bnủ al-Mubārakỉ , ʿan Muḥammadỉ bnỉ ʾIsḥāḳả , ʿan Muḥammadỉ bnỉ Ǧaʿfarỉȵ , ʿan ʿUbaydỉ Allãhỉ bnỉ ʿAbdỉ Allãhỉ bnỉ ʿUmarả , ʿan ʾAbī-hi , ʿanỉ al-Nabiyyỉ ( ṣlʿm ) naḥwa-hu

    ʾaḫbara-nā Ḳutaybaŧủ ḳāla , ḥaddaṯa-nā Sufyānủ , ʿan Yaḥyá bnỉ Saʿīdỉȵ , ʿan ʾAbī Bakrỉ bnỉ Muḥammadỉȵ , ʿan ʿUmarả bnỉ ʿAbdỉ al-ʿAzīzỉ , ʿan ʾAbī Bakrỉ bnỉ ʿAbdỉ al-Raḥmãnỉ bnỉ al-Ḥāriṯỉ bnỉ Hišāmỉȵ , ʿan ʾAbī Hurayraŧả miṯla-hu

    Taḥwīlủ al-hamzaŧỉ ( kalimātủȵ mufradaŧủȵ )

    ʾamrủȵ ʾunsủȵ ʾinsủȵ ʾīmānủȵ ʾāyaŧủȵ ʾāmana masʾalaŧủȵ saʾala raʾsủȵ ḳurʾānủȵ taʾāmara ḏiʾbủȵ asʾilaŧủȵ ḳāriʾi-hi suʾlủȵ masʾūlủȵ takāfuʾu-hu suʾila ḳāriʾi-hi ḏiʾābủȵ raʾīsủȵ buʾisa ruʾūfủȵ raʾūfủȵ suʾālủȵ muʾarriḫủȵ abnāʾa-hu abnāʾu-hu abnāʾi-hi šayʾảȵ ḫaṭīʾaŧủȵ ḍawʾu-hu ḍūʾu-hu ḍawʾa-hu ḍawʾi-hi murūʾaŧủȵ ʾabnāʾi-hi barīʾu-hu sūʾila fīlủȵ fānnủȵ fūnnủȵ sāʾala fuʾādủȵ šurakāʾu-hu riʾāsaŧủȵ tahniʾaŧủȵ dafāʾaŧủȵ ṭaffāʾaŧủȵ taʾrīḫủȵ faʾrủȵ šayʾủȵ šayʾỉȵ šayʾảȵ ḍawʾủȵ ḍawʾỉȵ ḍawʾảȵ ǧuzʾủȵ ǧuzʾỉȵ ǧuzʾảȵ mabdaʾủȵ mabdaʾỉȵ mabdaʾảȵ nabaʾa ḳāriʾủȵ takāfuʾủȵ takāfuʾỉȵ takāfuʾảȵ abnāʾu abnāʾi abnāʾa ǧarīʾủȵ maḳrūʾủȵ ḍawʾủȵ šayʾủȵ ǧuzʾủȵ ʿulamāʾu al-ʿulamāʾi al-ʿulamāʾa ʿAmrủȵů wa-faʿalūå

    betaCode converted into Arabic script

    قَالَ أَبُو مَسْعُودٍ :: أَنَا قَدْ سَمِعْتُ هٰذَا مِنْ رَسُولِ الـلّٰـهِ ( صْلْعْمْ )

    حَدَّثَنَا عَمْرُو بْنُ رَافِعٍ ، حَدَّثَنَا عَبْدُ الـلّٰـهِ بْنُ الْمُبَارَكِ ، عَنْ مُحَمَّدِ بْنِ إِسْحَاقَ ، عَنْ مُحَمَّدِ بْنِ جَعْفَرٍ ، عَنْ عُبَيْدِ الـلّٰـهِ بْنِ عَبْدِ الـلّٰـهِ بْنِ عُمَرَ ، عَنْ أَبِيهِ ، عَنِ النَّبِيِّ ( صْلْعْمْ ) نَحْوَهُ

    أَخْبَرَنَا قُتَيْبَةُ قَالَ ، حَدَّثَنَا سُفْيَانُ ، عَنْ يَحْيٰى بْنِ سَعِيدٍ ، عَنْ أَبِي بَكْرِ بْنِ مُحَمَّدٍ ، عَنْ عُمَرَ بْنِ عَبْدِ الْعَزِيزِ ، عَنْ أَبِي بَكْرِ بْنِ عَبْدِ الرَّحْمٰنِ بْنِ الْحَارِثِ بْنِ هِشَامٍ ، عَنْ أَبِي هُرَيْرَةَ مِثْلَهُ

    تَحْوِيلُ الْهَمْزَةِ ( كَلِمَاتٌ مُفْرَدَةٌ )

    أَمْرٌ أُنْسٌ إِنْسٌ إِيمَانٌ آيَةٌ آمَنَ مَسْأَلَةٌ سَأَلَ رَأْسٌ قُرْآنٌ تَآمَرَ ذِئْبٌ أَسْئِلَةٌ قَارِئِهِ سُؤْلٌ مَسْؤُولٌ تَكَافُؤُهُ سُئِلَ قَارِئِهِ ذِئَابٌ رَئِيسٌ بُئِسَ رُؤُوفٌ رَؤُوفٌ سُؤَالٌ مُؤَرِّخٌ أَبْنَاءَهُ أَبْناؤُهُ أَبْنائِهِ شَيْئًا خَطِيئَةٌ ضَوْءُهُ ضُوؤُهُ ضَوْءَهُ ضَوْئِهِ مُرُوءَةٌ أَبْنائِهِ بَرِيؤُهُ سُوئِلَ فِيلٌ فَانٌّ فُونٌّ سَاءَلَ فُؤَادٌ شُرَكاؤُهُ رِئَاسَةٌ تَهْنِئَةٌ دَفَاءَةٌ طَفّاءَةٌ تَأْرِيخٌ فَأْرٌ شَيْءٌ شَيْءٍ شَيْئًا ضَوْءٌ ضَوْءٍ ضَوْءًا جُزْءٌ جُزْءٍ جُزْءًا مَبْدَأٌ مَبْدَأٍ مَبْدَأً نَبَأَ قَارِئٌ تَكَافُؤٌ تَكَافُؤٍ تَكَافُؤًا أَبْناءُ أَبْناءِ أَبْناءَ جَريءٌ مَقْروءٌ ضَوْءٌ شَيْءٌ جُزْءٌ عُلَماءُ الْعُلَماءِ الْعُلَماءَ عَمْرٌو وَفَعَلُوا

    betaCode into Translit

    betaCode in English text

    NB: This is an example of the English text with terms, names and toponyms given in betaCode and automatically converted into different transliteration flavors (exerpts are from Brill’s Encyclopaedia of Islam).

    Dima^s.k, Dima^s.k al-^S_am or simply al-^S_am , (Lat. Damascus, Fr. Damas) is the largest city of Syria. It is situated ... very much at the same latitude as Ba.gd_ad and F_as, at an altitude of nearly 700 metres, on the edge of the desert at the foot of ^Gabal .K_asiy_un.

    al-_Dahab_i, ^Sams al-D_in Ab_u `Abd All~ah Mu.hammad b. `U_tm_an b. .K_aym_a.z b. `Abd All~ah al-Turkum_an_i al-F_ari.k_i al-Dima^s.k_i al-^S_afi`_i, an Arab historian and theologian, was born at Damascus or at Mayy_afari.k_in on 1 or 3 Rab_i` II (according to al-Kutub_i, in Rab_i` I) 673/5 or 7 October 1274, and died at Damascus, according to al-Subk_i and al-Suy_u.t_i, in the night of Sunday-Monday on 3 _D_u al-.Ka`da:t 748/4 February 1348, or, according to A.hmad b. `Iy_as, in 753/1352-3. He was buried at the B_ab al-.Sa.g_ir.

    betaCode converted into one-to-one translit

    Dimašḳ, Dimašḳ al-Šām or simply al-Šām , (Lat. Damascus, Fr. Damas) is the largest city of Syria. It is situated ... very much at the same latitude as Baġdād and Fās, at an altitude of nearly 700 metres, on the edge of the desert at the foot of Ǧabal Ḳāsiyūn.

    al-Ḏahabī, Šams al-Dīn Abū ʿAbd Allãh Muḥammad b. ʿUṯmān b. Ḳāymāẓ b. ʿAbd Allãh al-Turkumānī al-Fāriḳī al-Dimašḳī al-Šāfiʿī, an Arab historian and theologian, was born at Damascus or at Mayyāfariḳīn on 1 or 3 Rabīʿ II (according to al-Kutubī, in Rabīʿ I) 673/5 or 7 October 1274, and died at Damascus, according to al-Subkī and al-Suyūṭī, in the night of Sunday-Monday on 3 Ḏū al-Ḳaʿdaŧ 748/4 February 1348, or, according to Aḥmad b. ʿIyās, in 753/1352-3. He was buried at the Bāb al-Ṣaġīr.

    betaCode converted into the Library of Congress scheme

    Dimashq, Dimashq al-Shām or simply al-Shām , (Lat. Damascus, Fr. Damas) is the largest city of Syria. It is situated ... very much at the same latitude as Baghdād and Fās, at an altitude of nearly 700 metres, on the edge of the desert at the foot of Jabal Qāsiyūn.

    al-Dhahabī, Shams al-Dīn Abū ʿAbd Allāh Muḥammad b. ʿUthmān b. Qāymāẓ b. ʿAbd Allāh al-Turkumānī al-Fāriqī al-Dimashqī al-Shāfiʿī, an Arab historian and theologian, was born at Damascus or at Mayyāfariqīn on 1 or 3 Rabīʿ II (according to al-Kutubī, in Rabīʿ I) 673/5 or 7 October 1274, and died at Damascus, according to al-Subkī and al-Suyūṭī, in the night of Sunday-Monday on 3 Dhū al-Qaʿda 748/4 February 1348, or, according to Aḥmad b. ʿIyās, in 753/1352-3. He was buried at the Bāb al-Ṣaghīr.

    betaCode converted into a searcheable string (diacritics removed)

    Dimashq, Dimashq al-Sham or simply al-Sham , (Lat. Damascus, Fr. Damas) is the largest city of Syria. It is situated ... very much at the same latitude as Baghdad and Fas, at an altitude of nearly 700 metres, on the edge of the desert at the foot of Jabal Qasiyun.

    al-Dhahabi, Shams al-Din Abu Abd Allah Muhammad b. Uthman b. Qaymaz b. Abd Allah al-Turkumani al-Fariqi al-Dimashqi al-Shafii, an Arab historian and theologian, was born at Damascus or at Mayyafariqin on 1 or 3 Rabi II (according to al-Kutubi, in Rabi I) 673/5 or 7 October 1274, and died at Damascus, according to al-Subki and al-Suyuti, in the night of Sunday-Monday on 3 Dhu al-Qada 748/4 February 1348, or, according to Ahmad b. Iyas, in 753/1352-3. He was buried at the Bab al-Saghir.

    0 0

    A DH Exercise: Mapping the Greco-Roman World

    “Envy is not a very good thing. Yet envy is precisely what an early Islamicist feels when he reads Roger Bagnall and Bruce Frier’s The Demography of Roman Egypt.” 1 These words stuck in my head since the very moment I read them and over the past two years of working among and with the classicists my classics envy has been growing—on top of 300 original census declarations that were at the disposal of of the above mentioned scholars, there are way too many things to envy, especially when it comes to all things digital.

    The Pleiades Gazetteer is a particularly interesting case: with almost 35,000 places, it offers several well-populated categories of geographical objects. The categories include settlements, forts, temples, villas, stations, [amphi]theaters, churches, bridges, baths, cemetaries, plazas, archs. What makes it even more interesting is that most of these objects have chronological markers, i.e. they belong to one or more of the following periods: archaic (750–550BC), classical (550–330BC), hellenistic-republican (330–30BC), roman (30BC–300CE), late-antique (300–640CE).

    This data offers a opportunity for an interesting digital exersize with historical data. I assigned it to my students as a part of introduction to R (within my “Introduction to Text Mining for the Students of Humanities”, Tufts University, Spring 2015). The task was to explore the Pleiades data set, find out what is what and what can be done with it. The goal was to discover that 1) geographical objects are categorized, and that 2) they also have chronological markers, which can be used 3) to maps the geography of the Greco-Roman world over time.

    The map of forts turned out to be particularly interesting.

    Below is the code and some of the resulting visualizations.

    # Rlibrary(ggplot2)library(maps)library(mapdata)library(rgeos)library(maptools)library(mapproj)library(PBSmapping)library(data.table)
    xlim=c(-12,55); ylim=c(20,60)
    dataFolder=""# ideally, full path to the folder
    locsRaw=read.csv(csvName,stringsAsFactors=F,header=T,sep=',')# url: ---: download the latest csv, unzip 
    land="grey"; water="grey80"; bgColor="grey80"
    locPleiades=geom_point(data=locsRaw,color="grey70",alpha=.75,size=1,aes(y=reprLat,x=reprLong))for(i in1:nrow(features)){
      locs=locsRaw[with(locsRaw,grepl(features[i,1],featureTypes)),]for(ii in1:nrow(periods)){
        dataLabel="Data: Pleiades Project"
        header=paste0(features[i,2]," in the ",periods[ii,1]," period (",periods[ii,2],")")
          locPleiades+ locPer+ labs(y="",x="")+theme_grey()

    Using Image Magick to animate maps

    The fastest and easiest way to animate the results is to use ImageMagick, a free command-line utility. The following command will take all .png files whose names begin with Pleiades_Settle and convert them into an animated GIF file Pleiades_Settlements.gif, which will play continuously (-loop 0), with each frame downsized (-resize 1200x900) and paused for .75 of a second (-delay 75).

    convert -resize 1200x900 -delay 75 -loop 0 Pleiades_Settle*.png Pleiades_Settlements.gif

    Chronological Cartograms

    All Locations



    All categories

    Amphitheaters, arches, baths, bridges, cemeteries, churches, forts, locations, plazas, settlements, stations, temples, theaters, villas.


    1. al-Qādī, Wadād. “Population Census and Land Surveys under the Umayyads (41-132/661-750).” Der Islam 83, no. 2 (2006), p. 341 

    0 0
    0 0
  • 11/07/15--16:00: Introducing mARkdown
  • TEI XML has long become the standard for tagging humanistic texts for research purposes. It is the standard in most digital libraries (including the Perseus Digital Library). Having texts in a TEI XML format that conforms to the standards of a long-standing library allows one to take advantage of libraries’ infrastructure and analytical tools that have been developed since the appearance of TEI XML. Converting texts into XML, however, is a rather long and complicated process.

    Texts in Arabic make things even more complicated. Right-to-left (RTL) and left-to-right (LTR) text in one file is one the major challenges. Since the cursor changes the direction of its movement when crossing the boundary between RTL and LTR text, it is difficult to place the cursor properly, and one often ends up changing a wrong part of the text. The direction of paired characters is visually confusing, and it is often next to impossible to say whether a given angle bracket—perhaps the most important XML character—is an opening character or a closing one. Moreover, the shapes of Arabic letters in a text file are dynamically changing as one types or edits Arabic text, and many text editors do not handle this properly (particularly on Mac). In addition to these technical challenges, there are too many Arabic texts to convert—and most of them are multivolume titles—and too few people who have both training and willingness to do that.

    In the beginning of my digital research I have considered TEI XML as a working format, but I had to give up on this option, since converting a 50-volume book (~3,4 million words) would have taken forever. After reviewing existing approaches, I came up with a rather simple tagging system that allowed me to create a structured, machine-readable text, without sacrificing years of my life. In many ways, this system was inspired by markdown—“a text-to-HTML conversion tool ... that allows [one] to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).”

    The main goal of mARkdown is to provide a simple system for tagging structural elements in Arabic texts that would facilitate algorithmic analysis in the same way as more complex TEI XML does. In principle, mARkdown does not require any special editor, but my current workflow relies on EditPad Pro, which supports right-to-left languages, Unicode, and large files. However, it is the support of custom highlighting and navigation schemes that makes this text editor particularly convenient for mARkdown.

    Since I have been using my mARkdown for my own research purposes, it has not yet been developed into an easily reusable system. This is my first attempt to provide a detailed description and explain how it can be used. I expect that mARkdown will undergo some minor changes in the upcoming months. The most recent description can be accessed from the main menu above.

    mARkdown in EditPad Pro activated with the “magic value” in test_textFile

    0 0

    A Screenshot of al-Thurayyā. Click on the screenshot to open the Gazetteer in full screen.

    With Teams Pelagios and Pleiades—in alphabetical order: Elton Barker, Tom Elliot, Leif Isaksen, Rainer Simon—visiting Tufts University within the framework of the Perseids Named Entity Hackathon (organized and led by Bridget Almas) at the Perseus Project, we had a chance to test how their systems work with Arabic texts.1 Pelagios offers a convenient workflow for “geographical” reading of texts, which consists of two main steps: first, one tags places that occur in the text, then one “geo-resolves” tagged places into geographical locations that get displayed on an interactive geographical map (For more details, see Pelagios Website). The first step is smooth and easy and works nice for texts in any language as long as it is provided in Unicode. The second step depends on the availability of relevant gazetteers, to which Pelagios is or can be connected. Thus, Pelagios does a great job when it comes to the “geo-resolution” of toponyms included into Pleiades, which now has almost 35,000 places from the Ancient world. Since there is no gazetteer for the classical Islamic world, “geo-resolution” of classical Arabic sources is problematic at the moment. A gazetteer for the Islamic world is badly needed in general.

    As is the case with a creation of any database, creating a gazetteer is an extremely time-consuming task. The key seems to be in generating a snowball effect: creating enough database entries that would encourage a community of potentially interested individuals to start contributing to an already substantial databank by offering new data, references, corrections and additions. Pleiades has successfully used this model. Having incorporated content from such extensive editions as “Digital Atlas of Roman and Medieval Civilizations” (DARMC</span>) and “Barrington Atlas of the Greek and Roman World” (BAGRW), Pleiades offered a significant foundation for potential users to contribute to. It seems only logical to follows in the footsteps of such a successful project as Pleiades, and to use their infrastructure for developing an Islamic gazetteer, which will feature in Pleiades as al-Thurayya: a Supplement for the Islamic World. (In this light, the name al-Thurayya, Arabic for Pleiades, seems quite appropriate; Tom Elliot, one of the managing editors of Pleiades, will be providing support for the integration of al-Thurayya into Pleiades.)

    In the case of the classical Islamic world, there are, unfortunately, very few publications that offer geographical data of magnitude that would be comparable to that of DARMC and BAGRW. In fact, there is only one edition that can provide a solid backbone of geographical data for the initial stage of the creation of an Islamic gazetteer: Georgette Cornu’s Atlas du monde arabo-islamique à l’époque classique: IXe-Xe siècles (Brill, 1985; maps by Olivier Chareire). Largely based on M.J. de Goeje’s Bibliotheca Geographorum Arabicorum, this Atlas represents early geographical and travelogue literature in Arabic and, to some extent, in Persian (9 geographical treatises from BGA, plus 18 other works).

    The Atlas consists of 20 maps, which cover the extent of the Islamic world in 9-10th centuries, and an extensive gazetteer that briefly characterizes every place, providing succinct verbal description of its geographical location, its place in the geographical hierarchy, and coded references to primary and secondary sources. Maps vary in scale, but, in general, they are very detailed, dense in places and provide trade routes.2

    A Screenshot of al-Thurayyā

    *Geographical Coverage of Cornu’s Atlas*

    Cornu’s Atlas represents most of early Islamic geographical sources in general, but none of them in particular—peculiarities of each geographer are preserved in the gazetteer, but not reflected on maps. Although somewhat a “Frankenstein” of the early Islamic geography, Cornu’s Atlas is an incredible piece of scholarly work that does offer the best starting point for studying Islamic geography as well as various topics in Islamic history with digital methods.

    Unlike DARMC and BARGW, Cornu’s Atlas was published only once3 in 1980s and has never made it into a digital form (at least to my knowledge). Nor does the gazetteer offer coordinates for places. So, creating a digital gazetteer is a bit of a methodological challenge. The most effective way is to “georeference” Cornu’s maps in a GIS program (for example, QGIS) and then to collect necessary geographical features from these georeferenced maps. “Georeferencing” can be described as a process of deforming the image of a map in such a way that its coordinate grid corresponds to the coordinates within a GIS software. In other words, if one georeferences specific points—for example, intersections of parallels of latitude and meridians of longitude—a GIS program will deform the image of a map in such a way that all geographical features—cities, villages, and trade routes—will correspond to their geographical locations. In most cases…

    A Screenshot of al-Thurayyā

    *Georeferenced Cornu’s Atlas*

    As a method, georeferencing is precise, but its results depend on the quality of original maps, and some particular factors often complicate things. Ideally, for georeferencing one needs to know the projection of the map—something which all Cornu’s maps lack (as is the case with most historical academic maps). Fortunately, Cornu’s maps have rather detailed coordinate grids, in most cases covering every or every other degree of latitude and longitude.4 By georeferencing coordinate grids one can still produce quite reliable overlays. An example below shows a section of one of Cornu’s maps overlaid on top of Google physical map: medieval al-Mawsil corresponds to modern Mosul, and medieval Tall A‘far—to modern Tal Afar, while some other features—in this case, the Tigris river—are slightly off.

    A Screenshot of al-Thurayyā

    *A section of a georeferenced Cornu’s map overlaid on top of Google physical map*

    Converted into a digital dataset, contents of Cornu’s Atlas will become the backbone of geographical data that can be improved, expanded, corrected. An example of a searchable digital map based on one the maps from Cornu’s Atlas can be found below: the map on the left shows dynamic clustering of toponyms; the map in the middle shows places; the map on the right shows trade routes (click on the image to view dynamic searchable map; layers can be switched on/off in the upper right corner of the map). NB: “Place Filter” supports search using Arabic and simplified transliteration (omitting hamzas and ‘ayns, and disregarding macrons of long vowels and dots of emphatic consonants). Make sure to switch on the Places layer. There may be typos in transliteration (and Arabic, since it is Arabic names are automatically converted from transliteration); I will appreciate if you email corrections/suggestions.

    A Screenshot of al-Thurayyā

    *[View in full screen](*
    *Toponymic data from the map of Greater Syria (Province du Šām).
    Special thanks to Rainer Simon @ Pelagios and Adam Tavares @ Perseus for their help with building this interactive map.*


    1. For more details, see Marie-Claire Beaulieu’s post on Perseids Website. []
    2. It is not clear what the lines of the trade routes are based on. Unlike maps/cartograms of trade and postal routes created by Aloys Sprenger (Die Post- und Reiserouten des Orients, Leipzig 1864) and Guy Le Strange (The Lands of the Eastern Caliphate, Cambridge 1905), who connected locations with straight lines, Cornu’s maps offer realistic routes. []
    3. The gazetteer was published in three gradually updated versions. []
    4. In this regard, maps from Brill’s An Historical Atlas of Islam (1981, 2002) are not suitable for this task, since they lack information on projection, and do not provide values for the coordinate grid, which significantly affects the precision of georeferencing. See this example: A georeferenced map of Iran in the 4th-5th / 10th-11th Centuries. NB: Routes are straight lines between two points; georeferenced in QGIS. []

    0 0

    While looking for a way to identify all biographical collections and chronicles (and, by extension, all other texts that offer data for time-series analysis) in a collection of 0ver 10,000 texts, it occurred to me that all these texts share the same common feature—they are teeming with dates. So, what if we try to identify such texts computationally?! Not only will this help us to find all relevant titles in the sea of text—without overlooking or missing anything!—we, arguably, can get an insight into the chronological coverage of each of those titles, the chronological focus of individual historians, the chronological coverage of the entire collection of historical texts, and identify texts that focus on particular periods. The blogpost begins with an overview of several digital collections and then explains the methodology of the experiment. Appendices offer one to explore the chronological coverage of about 1,000 individual texts as well as the coverage of particular periods (here, hijri centuries—i.e., which texts focus on particular periods).


    Digital collections of classical Arabic texts have mushroomed over the past decade and a half. The three major libraries—al-Ǧāmiʿ al-kabīr (HDD),,—include over 10,000 titles. There is probably another dozen collections that offer texts in hundreds and thousands (for example,,,,,,,,

    ShiaOnlineLibrary.comShamela.wsal-Ǧāmiʿ al-kabīr118501, 1,810 5,999 titlesal-Ǧāmiʿ al-kabīr: 2,364 titlesUNIQUE: 7,895 titles (~1,1 billion words)
    Overlap among collections. There is significant overlap among available digital collections. Thus, while their cumulative volume may run into tens of thousands, the count of unique titles—excluding the exact copies and texts based on different editions—is significantly lower. Additionally, it is very difficult to identify duplicates among the collections. The Venn diagram above shows the overlap—over 2,000 titles—among the three major collections (the count it still work in progress). NB The diagram generated with Ben Frederickson’s code.

    The number of these collections appears to be growing and their content expanding. This new research environment offers scholars an opportunity to check whether a particular text is included into in a certain collection, to browse and read it—often in a page-by-page manner—and to search for particular bits of information. These collections work well for looking for something that we know or expect to find—a book, a person, an event, a term. What we cannot do is to look into how books are related, how they overlap and complement each other; how each individual fits among his contemporaries as well as his predecessors and successors; how different historical events are intertwined; how terms, notions and concepts are related to each other and evolve across time and space. Yet, having full texts of our sources at our disposal, we can definitely go beyond simplistic linear searches. By asking a series of interconnected questions—and relying on digital methods of text analysis—we can move toward a new understanding of the entire Arabic written tradition (starting, of course, with what is digitally available in one form or another).

    The question of chronology is one of such foundational questions. What I offer in this experiment is to explore the content of three such collections in order to understand better the chronological coverage of each collection, each author, and each book. In order to get insights into these issues we can turn to different kinds of data. To get a perspective on the scope of each collection we shall start with looking into descriptions of books and their authors. More specifically—into when authors died.


    While metadata in most collections is not complete, it can still be quite useful. Major digital collections—al-Ǧāmiʿ al-kabīr (HDD),, and—display the same clear trend: strong emphasis on the period from the 3rd–6th centuries AH (912–1203 CE), with an extra peak in the 8th century (1300–1397 CE), a steady decline during the 9th–12th centuries AH (1494–1785 CE), a slow recovery during the 13th century AH (1785–1882 CE), and skyrocketing in the 14th century AH (1882–1979 CE).

    Note on graphs. Data points of each graphed line show frequencies for periods of time that end at that point. For example, on the graph below that shows distribution of data by 100 lunar years (titles in al-Ǧāmiʿ al-kabīr), the value for 300/912 CE is 280, which means that there are 280 titles written by authors who died during 200–300 AH / 815–912 CE. A “step-before” type of graph displays such data most appropriately, but it is not suitable for comparative graphs, since there is too much overlap among the lines which makes the entire graph unreadable. Data on the most recent authors (after 1400/1979 CE) is excluded from the graphs, since it tends to overshadow earlier periods.

    al-Ǧāmiʿ al-kabīr (HDD) has the most complete chronological metadata on its authors. (online). Almost half of its metadata do not have chronological metadata. (online). The collection has a rather complete chronological metadata. Almost 1/3 of all titles are books of modern Šīʿīte scholars (excluded from the graph so that they do not overshadow earlier periods). (online) has the most incomplete metadata, but it still suggests the same trend.

    The developers of these collections were most interested in the early Islamic period (roughly the first half of the first Islamic millennium). According to the data of such sources as the Hadiyyaŧ al-ʿārifīn by Ismāʿīl Bāšā al-Baġdādī (d. 1338/1919 CE), a bibliographical collection that builds upon the famous Kašf al-ẓunūn of Ḥāǧī Ḫalīfaŧ (d. 1067/1656 CE), and Ḫizānaŧ al-turāṯ, a Saudi catalog of manuscripts (al-Riyāḍ: Šarikaŧ al-ʿArīs lil-Kumbiyūtir, 2007), the number of contributors to the Islamic written treasury is continuously growing at least up until the beginning of the 13th century AH.

    The “growth” of authors, according to the data from the Hadiyyaŧ al-ʿārifīn and the Ḫizānaŧ al-turāṯ.

    Ḫizānaŧ al-turāṯ is a Saudi catalog of manuscripts that was first published on a CD (al-Riyāḍ: Šarikaŧ al-ʿArīs lil-Kumbiyūtir, 2007); currently its full text is included into The catalog includes over 160,000 records, but unfortunately suffers from a number of problems, such as inconsistency of typing conventions, duplicate records, selective coverage of different manuscript collections (for example, only about 1,000 Arabic manuscripts from St.Petersburg, Russia are covered, while St.Petersburg academic institutions house at least 11,000 Arabic manuscripts).

    Even though existing digital collections often awe us by their volume, the comparative graphs below shows that they cover only a fraction of the Arabic written tradition—even by comparison with an early 20th-century bibliography, which itself is hardly complete in its coverage. Additionally, the graphs also clearly highlights the fact that the chronological coverage of these collections is skewed heavily in favor of the earlier period of Islamic history.

    Chronological distribution of book titles in the Hadiyyaŧ al-ʿārifīn,, al-Ǧāmiʿ al-kabīr (HDD), and
    Chronological distribution of book titles in the Hadiyyaŧ al-ʿārifīn,, al-Ǧāmiʿ al-kabīr (HDD), and

    A note on the Hadiyyaŧ al-ʿārifīn. The decline of both graphs after 1200/1785 CE indicates unavailability of bibliographical information to the author more than anything else. The geographical coverage of the collection starts shrinking roughly at the same period. It should be noted that most chronological datasets exhibit a similar trend. For example, the trend can be observed in al-Ḏahabī’s own Ḏayl to his Taʾrīḫ al-islām, where the number of biographies drops dramatically; one can equally see the same trend in Brill’s Index Islamicus and Harvard Open Metadata (on 12 million books). The only difference is that the lag gets shorter as we get closer to our time—for premodern Arabic sources this lag is 100 to 150 years; in modern datasets—10 to 20 years.

    Another way to evaluate chronological coverage is too explore the actual texts. Ideally, the number of discrete units of information—such as, for example, biographies and events—by periods should show the distribution of chronological emphasis of a particular source. Furthermore, the summary of such data from all [available] titles written by a specific author should indicate this author’s interest in specific periods. (The interpretation of such “interest” is a different subject altogether. For example, the fact that the Hadiyyaŧ al-ʿārifīn has more information on the 11th and the 12th centuries AH (1591–1785 CE), may indicate either Ismāʿīl Bāšā al-Baġdādī’s interest in this particular period, or the availability of information for this period, or the genuine growth in numbers of people contributing to the Islamic written treasury.)

    Date Statements

    Almost none of the texts, however, are tagged in a manner that would allow to do such a detailed evaluation. Yet, it is possible to analyze date statements in each texts and offer an evaluation of their chronological coverage based on the frequencies of references to different periods. The consistency of date statements in Arabic texts—essentially, a word for “year” (ʿām or sanaŧ) followed by either digits or spelled-out numbers—makes it possible to represent this pattern with a regular expression, a special text string for describing a search pattern (see Figure below). This regular expression can be worked into a script, with which one can check available texts. It should be noted, of course, that this approach is tuned to analyze hiǧrī dates, since other dating systems are used only infrequently.

    Words sanaŧ and ʿām in the histories of Islam. Overall, the word sanaŧ is used most frequently in date statements: of about 1,362,000 date statements from across 10,000 texts only 2.9% of statements start with the word ʿām (~40,000), while 97.1% begin with the word sanaŧ (~1,322,000). Closer look also reveals that the word ʿām is favored in texts written in the 20th century; with regards to premodern texts, it can be said that authors from the western part of the Islamic world—al-Andalus and al-Maġrib—tend to use it more frequently, than their eastern counterparts.

    Note: Adding “in,” into the mix changes the picture into: of about 1,670,000 statements, 79.2% start with sanaŧ (~1,322,000), 18.5% with (~308,000), and 2.4% with ʿām (~40,000). The problem is that even a quick look at the ngrams of -statements—the words that immediately follow each -statement—shows that more than a half of these statements are quantitative phrase of different kind (for example, fī arbaʿ mujalladāt). For this reason, -statements are excluded from the analysis.

    [Top] A regular expression for capturing year statements in premodern Arabic sources. You can copy it and test it on some text. [Bottom] The image demonstrates this regular expression highlighting year statements (bright green) in the Taʾrīḫ al-islām of al-Ḏahabī (d. 748/1347 CE). Program used: EditPad Pro.

    Such an approach is not without its problems, of course, but it may serve well as an exploratory technique. The results of the experiment are intriguing in a number of ways, even though not entirely consistent. The most important outcome is that it allowed to discover that the collection of 10,000 texts contains only about 785 texts with more than 100 date statements per text (and since the included collections overlap, the number of unique titles is even smaller). Needless to say, that working with 785 texts is significantly easier than working with 10,000 titles. Additionally, frequencies of date statements for each text offer an opportunity to focus one’s efforts on texts that contain most data suitable for time-series analysis.

    Choronolgical coverage. The graphs show the chronological coverage for the same text generated with two different approaches: while the orange dotted line represents the ideal situation—data collected through the manual tagging of the entire source, the blue solid line represents the only realistic situation—data extracted computationally. While the absolute results differ, the relative distribution is very similar and emphasizes the same periods. On the problem of the 1st century AH (622–718 CE) see below.

    The graph above shows two different representations of the chronological coverage of the Hadiyyaŧ al-ʿārifīn by Ismāʿīl Bāšā al-Baġdādī (d. 1338/1919 CE), a bibliographical collection that builds upon the famous Kašf al-ẓunūn of Ḥāǧī Ḫalīfaŧ (d. 1067/1656 CE). The blue line shows the frequencies of date statements by periods (binned into 50 year periods)—strongly suggesting more emphasis on the 11th an 12th centuries AH (1591–1785 CE). The orange dotted line shows the distribution of biobibliographical records on about 8,800 authors—this actual distribution of discrete information units in the source emphasizes the same period of the 11th and 12th centuries. The similarity in the patterns of distribution shows that reliance on computationally extracted date statements is a viable alternative.

    The 1st Century Problem

    Unfortunately, many texts suffer from what can be characterized as “the 1st century problem”: authors often drop hundreds from date statements (authors from the second millennium also tend to drop thousands), which leads to a very high number of date statements referring—at the face value—to the 1 st century AH (622–718 CE). As a result, the 1st century often gets inflated, overshadowing other periods. The graph below illustrates this issue.

    Since authors often drop hundreds from their date statements, the 1st century AH gets overinflated. As the title suggests, al-Saḫāwī’s (d. 902/1496 CE) al-Ḍawʾ al-lāmiʿ li-ahl al-ḳarn al-tāsiʿ focuses on the 9th century AH (1397–1494 CE), but—as the graph above shows—the number of date statements referring to the 8th (1300–1397 CE) and 9th (1397–1494 CE) centuries is significantly smaller than of those referring to the 1st century (notice the gap in between!). It is clear that al-Saḫāwī is dropping hundreds from his date statements. The problem is that some of those statements may refer to the 8th century, while some others to the 9th, so moving them all to the 9th century is hardly a solution.

    The problem may be resolved through the sequential analysis of date statements in texts. Authors are not likely to drop hundreds from their statements without letting their readers know what century they are talking about. In other words, an incomplete date statement must be preceded by a complete one. Thus, one can check if there are other date statements—and if there is, the incomplete date can be fit into the period of the preceding statement.

    The actual implemented algorithm grabs a 100-word chunk before a 1st-century date statement and checks if there are other date statements in that chunk. The procedure is repeated up to five times, that is checking up to 500 words—an equivalent of 1 to 3 printed pages—before the date statement in question, until either the text limit is reached or a date statement found. If a date statement is found, its century gets applied to the starting date statement that we treated as incomplete. In other words, if we start with “the year 65”, and we find “the year 530” preceding it, we change the first date into “the year 565” (1169 CE). If the preceding date is also from the 1st century, the starting date remains unchanged; the date also remains unchanged, if no other date statements have been found. Additionally, the algorithm runs in two different ways—in the first case, it does not build on updated date statements (Lines B); while in the second, it does, extrapolating from corrected date statements (Line C). The graph below shows the results.

    The graph shows new results for al-Saḫāwī’s (d. 902/1496 CE) al-Ḍawʾ al-lāmiʿ li-ahl al-ḳarn al-tāsiʿ: A (solid blue line) shows unmodified date statements (as in the previous graph); B (dotted orange line) shows the results of the first run of the algorithm—over 2,800 statements were updated, but there is still a lot of dates for the 1st century; C (dashed green line) shows the results of the second run of the algorithm, which builds on the updated dates—almost 12,000 date statements were redistributed, now clearly showing that the book is about 9th century.
    Note: a6675 is the identifier of a particular version of the text—title #6675 from al-Maktabaŧ al-Šāmilaŧ; the same title from a different collection will have a different identifier.

    The question is, of course, how reliable such projections are. In order to check this we need to compare algorithmically produced results with manually disambiguated data. The graphs below show such comparisons for four different sources: A (orange dotted) shows the initial results of computational date statements collection; B (green dashed)—modified dates without extrapolation; C (red dashed)—modified results with extrapolation; and, finally, D (blue solid)—shows manually disambiguated 1st-century date statements.

    al-Wafayāt al-aʿyān of Ibn Ḫallikān (d. 681/1282 CE)

    Results for Ibn Ḫallikān’s al-Wafayāt al-aʿyān are very good—algorithmically modified dates are very close to manually disambiguated. Results of Algorithm B—modified results without extrapolation—are slightly closer to the benchmark (line D) than the results of Algorithm C. Yet, both are somewhat “overfitting” 1st-century dates. Good news: algorithmic lines B and C lead to the same conclusion as the benchmark Line D—Ibn Ḫallikān covers the period of 450–650 AH / 1058–1252 CE most thoroughly.

    al-Kāmil fī-l-taʾrīḫ of Ibn Aṯīr (d. 630/1232 CE)

    Results for Ibn Aṯīr’s al-Kāmil fī-l-taʾrīḫ are less precise: both algorithms overfitted 1st-century dates, inflating other centuries, if compared to manually disambiguated data (D). The peaks of distribution—the shape of the curve—are much closer to the benchmark than the preprocessed results (A), but computational analysis suggests that Ibn Aṯīr focuses more on the later period, while (according to manually disambiguated data) his attention is spread more evenly.

    Ṭabaḳāt al-šāfiʿiyyaŧ of Ibn Ḳāḍī Šuhbaŧ (d. 851/1447 CE)

    Results for the Ṭabaḳāt al-šāfiʿiyyaŧ of Ibn Ḳāḍī Šuhbaŧ are not ideal, but still much better than the initial results. Extending the check range from 500 words to 1,000 gets the graph—line C in particular—much closer to the benchmark (click on the image to see the graph based on the extended range of 1,000 words). The problem, however, is that for other sources 1,000-word range does not generate better results.

    Some general observations

    We are clearly not getting 100% match with the benchmark, but that is not to be expected anyway—none of the exploratory computational methods work that way. Our model does not take into account the stylistic differences among authors. While the ballpark of date statements do fall into the proposed pattern there are occasionally slight variations that are peculiar to particular authors. Some of such peculiarities may be helpful. For example, Ibn Ḫallikān often uses phrases li-l-hiǧraŧ or min al-hiǧraŧ with the true 1st-century date statements (which is still 75-80%)—and such markers can be worked into the algorithm; other authors—about half a dozen that I checked thoroughly—use such additional phares only occasionally. Other peculiarities are too complicated and cannot be resolved with simple algorithms. For example, Ibn Ḳāḍī Šuhbaŧ occasionally “spells” out ones in his date statements to ensure that his readers get it right: sanaŧ sabʿ bi-taḳdīm al-sīn wa-ʿišrīn …, “the year seven, with sīn in the beginning…”), which, again, breaks the general pattern for date statements. The most complicated issue, however, is that even for a scholar it may occasionally be difficult to figure what century a certain date refers to (for example, when a biographee was born close to the middle of one century and died close to the middle of the next one). Natural languages will always pose such difficulties, yet, the results produced with the offered approach are quite suitable for the goal: even when we do not get the exact results, we are still getting close enough to the benchmark for a useful distant reading of a large corpus.

    The precision of results also varies because of differencies in book structure. We get more precise projections for books organized alphabetically—in this case authors cannot afford to use too many incomplete dates (see graphs for the Hadiyyaŧ al-ʿārifīn and Wafayāt al-aʿyān above); and less precise for books organized chronologically. It would make sense to develop different subroutines for processing texts based on their organization. Having robust metadata on each text would help triggering analytical routines adjusted to various peculiarities, although the structure of a book can be inferred computationally (on this see below). Additionally, a more precise logic can be implemented if our texts are properly divided into logical units. Thus, in a book organized alphabetically, the analysis of dates would be limited to a single logical unit, while in a book organized chronologically the precision of analysis can be inforced by looking into date statements in the neighboring units. At this point, results are provocatively suggestive—but in most cases some familiarity with a specific book will help make sense of its graphs.

    Complementary coverage of “continuations”

    Date statements may also offer other useful insights into Arabic historical sources. Comparing chronological coverage of different texts may offer an illustration of how text related to each other. Graphs below show a few examples of how certain texts are overlapping chronologically with their “continuations” (ḏayl, takmilaŧ, ṣilaŧ) and are complemented by them.

    Complementary coverage of “continuations”. [Top left] al-Ḏahabī’s Taḏkiraŧ al-ḥuffaẓ and its three ḏayls. [Top right] Ibn Abī Yaʿlá’s Ṭabaḳāt al-ḥanābilaŧ continued by Ibn Raǧab’s Ḏayl ʿalá Ṭabaḳāt al-ḥanābilaŧ. [Bottom left] Ḥaǧǧī Ḫalīfaŧ’s Kašf al-ẓunūn continued by Ismāʿīl Bāšā al-Baġdādī’s Iḍāḥ al-maknūn fī ḏayl ʿalá Kašf al-ẓunūn. [Bottom right] al-Ḫaṭīb’s Taʾrīḫ Baġdād continued by Ibn Naǧǧār’s Ḏayl (excerpted by Ibn al-Dimyāṭī in his al-Mustafād min Ḏayl Taʾrīḫ Baġdād).
    Complementary coverage of “continuations.”Taʾrīḫ mawlid al-ʿulamāʾ wa-wafayati-him of Ibn ʿAbd Allãh al-Rabaʿī (d. 397/1006 CE) is another interesting example, since we have its “continuation”, Ḏayl taʾrīḫ mawlid al-ʿulamāʾ wa-wafayati-him of ʿAbd al-ʿAzīz al-Kattānī (d. 466/1073 CE), and “the continuation of the continuation”, Ḏayl ḏayl taʾrīḫ mawlid al-ʿulamāʾ wa-wafayati-him of Hibaŧ Allãh al-Akfānī (d. 524/1130 CE). The graph vividly demonstrates how these collections complement each other chronologically.

    Date statements and the structure of books

    Patterns of date statements distribution across texts—in other words, if we graph dates in the order they occur in a text—can also tell us a lot about the structural organization of books. As the illustrations below show, alphabetical and chronological structures have distinct visual patterns. Such patterns can be helpful in assessing new corpora and identifying texts relevant for specific research purposes. Different routines can be developed for the identification and analysis of texts of other forms and genres.

    Note on graphs below: Each line represents a date statement, where the length of the line corresponds to the year that a date statement refers to. The left side of each graph is the beginning of the book; the right one—its end. Regression analysis—here visualized with the red line for linear regression, and the blue one for LOWESS regression—can be used for identifying the patterns of distribution without graphing. (1st-century dates were removed to make patterns more clear.)

    Distribution of dates across historical texts: Dates in the Taʾrīḫ Dimašḳ (top) are randomly distributed across the entire length of the text, which corresponds to its alphabetical organization; the same pattern can be seen in the al-Wāfī bi-l-wafayāt (bottom), which is also organized alphabetically.
    Distribution of dates across historical texts: Dates in the Taʾrīḫ al-islām, which covers the period of Islamic history up to 700/1300 CE, display a clear rising pattern, which reflects its chronological organization.
    Distribution of dates across historical texts: Dates in the Hadiyyaŧ al-ʿārifīn display a zig-zag pattern, which reflects its alphabetical organization, where biobibliographical records within each letter are organized chronologically (This last thing was quite a discovery—even though I have spent quite a lot of time working with this text, I did not realize that biographies within each letter are organized chronologically until I saw this graph).

    Concluding remarks

    One thing that must be voiced is that if we had a corpus properly prepared by scholars and for scholars that would include robust metadata and texts tagged into logical units, the results of such an experiment would have been significantly more precise and reliable, not to mention that such a corpus would also allow to run a number of other exploratory experiments. To put it differently, we—scholars who study the premodern Islamic world, and who are actively using collections developed in Arab countries and Iran for non-academic purposes (and let’s be honest, most of us do)—must invest time and effort into the development of a digital library that would allow all of us to engage in methodologically novel research. Such a library would also allow to build on the each other’s research more consistently, which would also help to forge a new collaborative culture that will be beneficial to the entire field.

    Appendix I: Exploring coverage of historical sources

    You can explore the chronological coverage of historical texts using Chronoplot (it may take a moment to load). Current data includes about 3,000 texts (including versions of the same text from different libraries). Keep in mind the following:

    1. Each text has a unique identifier: letter + number, where the former refers to a collection, and the latter—to the number of a text in that collection:
    2. Each text has three variations of date statement distribution. (Consider comparing variations for the text with the same identifier.) Texts of the same title from different collections occasionally give different distributions (especially when electronic texts are based on different printed editions).
      • A— unmodified dates (“1st century problem”);
      • B— updated dates (“single pass”);
      • C— updated dates (”double pass”)
    3. Selector (right) can be used to select titles for graphing their chronological coverage. Choosing multiple titles will allow to compare their coverages.
    4. Filter (right top) can be used to find specific titles: type a part of an author’s name or a book’s title, and the list will be filtered to show only items that have your keywords.
    5. Linetype (right bottom) is a drop-down menu that offers several ways graphing the results. The most appropriate linetype for displaying chronological coverage is “step-before,” since it shows the frequencies of date statements per 50-year periods in the most clear manner. However, this works well only for single texts. For comparative purposes “monotone” seems to be a better option.

    Appendix II: Exploring coverage of historical periods

    The table below lists sources by frequencies of date statements. Like Chronoplot, this table also has three variations of each text (A, B, C). Since variations A, B, and C differ only in how dates are distributed across periods, the initial table shows only variation A. Selecting a particular century will show only texts (with variations) that have dates for those periods.

    Metadata on texts is not always complete. The missing information may be available online—where applicable, links to the online manifestations of texts are provided.

    By centuries:

    0 0

    Click on the image to download the Reader.

    Bringing DH methods into a language classroom

    Learning classical Arabic is a long process. Most of us took great pleasure in advanced reading classes with our professors, but, often struggling with an overwhelming volume of new vocabulary, we also—at least occasionally—had a feeling that a traditional method is not necessarily the most effective one. While advanced students usually overcome this difficulty by their sheer passion for the subject, the introduction of excessive vocabulary creates a serious obstacle to less advanced yet capable students.

    Pervasive availability of electronic texts and computational methods of text analysis allows us to rethink how we teach difficult languages. We can identify the most frequent features within a corpus and focus our attention on them. For example, the 100 most frequent lexical items constitute about 56% of the entire vocabulary of over 34,000 Prophetic sayings (ḥadīṯ) from the Six [Sunnī] Collections (al-kutub al-sittaŧ, approximately 2.8 million words). Relying on such data, one can generate a frequency-based reader that will introduce students to the shortest texts with the most frequent vocabulary and grammatical structures. With a paced increase in difficulty of texts and incremental expansion of vocabulary, students are capable of digesting much larger volumes of text both in class and at home, and such an extended exposure enables students to internalize the authentic language more efficiently. For example, in the course of one semester, we managed to cover about 400 ḥadīṯs, while at the same time reviewing the grammar of classical Arabic and having regular discussions of thematic readings that helped students to understand the cultural importance of the Ḥadīṯ across almost 14 centuries of Islamic history.1

    While developed primarily with classical Arabic in mind, the approach is actually universal and can be used for any language. It works best with serialized texts—that is a large corpus of relatively short text of the same type (in the case of Arabic that would be ḥadīṯ collections, chronicles, biographical collections, poetic anthologies, contemporary newspapers, etc.). Considering that in terms of vocabulary various forms and genres may differ from each other quite significantly (Figure 1 shows that such difference may go up to 80%!), this method can be used to introduce students to the language of particular genres in the most efficient manner. Courses based on such readers can be a valuable addition to any language program and will be particularly welcomed by graduate students who often face the need to develop their readings skills as quickly and efficiently as possible.

    Figure 1. The matrix shows lexical overlap across the frequency lists (top 3,000 items) that represent large thematic specimens of Arabic language. The specimens are arranged chronologically, staring with the earliest (right-top corner, 9th century) to the latest (20th century). The most dramatic lexical difference is between al-Kutub al-Sittaŧ, the Six [Sunnī] Collections of ḥadīṯs, and al-Šarḳ al-awsaṭ, the modern newspaper: the frequency lists of these two sources (again, top 3,000 items) share only 20% of word forms (tokens). Even among the “classical” works the lexical distance is quite significant, with the percentage of shared vocabulary fluctuating mainly between 38% and 58% (for the interquartile range).

    Texts compared: al-Kutub al-Sittaŧ (2,8 mln. words), the 6 Sunnī collections of Ḥadīṯ (~9th century CE); Tafsīr al-Ṭabarī (or Ǧāmiʿ al-bayān, 3 mln. words), a commentary to the Qurʾān of al-Ṭabarī (d. 310/922 CE); Kitāb al-Aġānī (1,5 mln. words), a poetic anthology of Abūl-l-Faraǧ al-Iṣbahānī (d. 356/967 CE); al-Futūḥāt al-Makkiyyaŧ (1,7 mln. words), an extensive Ṣūfī text of Ibn al-ʿArabī (d. 638/1240 CE); Fatāwá Ibn Taymiyyaŧ (2,9 mln. words), a collection of legal decisions and epistles of Ibn Taymiyyaŧ (d. 728/1327 CE); Taʾrīḫ al-Islām (3,2 mln. words), a biographical collection and chronicle of al-Ḏahabī (d. 748/1347 CE); Maǧallaŧ al-Risālaŧ (16 mln. words), an early 20th-century Egyptian literary journal; Tafsīr al-Mīzān (2,3 mln. words), a modern Šīʿī commentary to the Qurʾān of al-Sayyid al-Ṭabāṭabāʾī (d. 1981 CE); and al-Šarḳ al-Awsaṭ (2,5 mln. words), a modern Arabic newspaper (collected by Tariq Yousef from

    Description of the method

    The overall procedure is rather simple and runs as described below.

    Step I. Ḥadīṯ collections were downloaded from Then, initial texts were reformatted and normalized.2 (There are multiple way how specimens of other genres can be obtained and the processed for a similar reader).

    Step II. All vocabulary from the corpus was collected and converted into a frequency list. This list was then converted into a ranking list, where the most frequent item receives rank 1, the second—2, the third—3, and so on; items with the same frequency are assigned the same rank. It should be noted that vocabulary items have not been parsed with a morphological analyser, so different forms of the same word are treated separately (i.e., ḳāla, ḳīla, ḳālat, fa-ḳāla, etc. have their own frequencies and ranked separately). The main reason for not using the results of automatic morphological analysis is largely technical, since existing morphological analyzers are meant to work with modern standard Arabic and do not perform well on classical Arabic.3 At the same time, using frequencies of word forms (tokens) rather than dictionary forms (lexemes) has its advantages, since more frequent forms will be given more frequently in the reading materials (such as, for example, very frequent ḳāla [sing. masc.] vs. rather rare ḳālā [dual masc.]).4

    Step III. The average mean of ranking values was calculated for each ḥadīṯ. The resultant values then served as difficulty indices, where texts with the most frequent vocabulary would have the lowest average means, and vice versa. These indices were then used as sorting values that allowed rearranging all 34,000 ḥadīṯs by the difficulty of their vocabulary. The advantage of the average mean here is that even a single low frequency lexical item increases the difficulty index of a text, which is pushed down the list. This approach turned up a couple of unforeseen positive effects. First, as the length of a text increases so does the probability of more rare lexical items—as a result, the “easiest” texts are also the shortest ones. This convenient outcome allows students to begin with the shortest texts and move gradually to the longer ones. The second effect is that the most frequent vocabulary also tend to appear in the most frequent grammatical and syntactic structures.

    Step IV. The rearranged collections of ranked ḥadīṯs was not quite useable, since this method also groups together items that are almost the same. Here manual input was required to exclude ḥadīṯs that are too similar.

    Step V. At last, the selection of ḥadīṯs was converted into format and typeset into the reader in front of you. As you will see, quite a few ḥadīṯs in the beginning of the reader feature only isnāds, “the chains of transmitters”, and do not have matns, the actual texts of ḥadīṯs. I used these matn-less ḥadīṯs to introduce students to the concept of transmission of knowledge in Islamic culture, which most were not familiar with; next time around I will modify the reader to avoid having very similar texts next to each other, which can be done by the retagging of the selection of ḥadīṯs and regenerating the entire reader anew.

    In the classroom

    In my teaching, I used this reader in combination with ‘micropublications’, which provided each student with a thorough practice of foundational skills necessary for mastering the language: for each ḥadīṯ students provided full vocalization, morphological stemming, and translation aligned with its Arabic original. Such ‘micropublications’ help monitoring students’ progress, and, later, can be used to automatically grade such assignments, thus freeing up time for in-class discussions. Last but not least, by producing these micropublications, students make a valuable contribution as they generate training data that can be used for various teaching and research purposes.


    1. “Classical Arabic through the Words of the Prophet” (Tufts University, Winter/Spring 2015), with the following two additional readings: W. M. Thackston, An Introduction to Koranic and Classical Arabic: An Elementary Grammar of the Language (Bethesda, Md.: Ibex Publishers, 2000), Jonathan Brown, Hadith: Muhammad’s Legacy in the Medieval and Modern World (Oxford: Oneworld, 2009).

    2. On normalization, see: Nizar Y. Habash, Introduction to Arabic Natural Language Processing ([San Rafael, Calif.]: Morgan & Claypool Publishers, 2010), 21–23.

    3. For example, Buckwalter Morphological Analyser, which has been tested with this corpus (using Perseus morphological services), returned no results for about 25% of tokens, single results for another 25%, and more than one for the rest 50%. Needless to say, such results are hardly useable for our purposes.

    4. An ability to recognize rare forms is important, of course, but it can be practiced through grammatical and morphological exercises (examples can be found at the end of the reader).

    0 0

    By: Maxim Romanov, Matthew Thomas Miller,
    Sarah Bowen Savant, and Benjamin Kiessling

    The OpenITI team—building on the foundational open-source OCR work of the Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities—has achieved Optical Character Recognition (OCR) accuracy rates for classical Arabic-script texts in the high nineties. These numbers are based on our tests of seven different Arabic-script texts of varying quality and typefaces, totaling over 7,000 lines (~400 pages, 87,000 words). These accuracy rates not only represent a distinct improvement over the actual accuracy rates of the various proprietary OCR options for classical Arabic-script texts, but, equally important, they are produced using an open-source OCR software called Kraken (developed by Benjamin Kiessling, LU), thus enabling us to make this Arabic-script OCR technology freely available to the broader Islamic, Persian, and Arabic Studies communities in the near future. Unlike more traditional OCR approaches, Kraken relies on a neural network—which mimics the way we learn—to recognize letters in the images of entire lines of text without trying first to segment lines into words and then words into letters. This segmentation step—a mainstream OCR approach that persistently fails on connected scripts—is thus completely removed from the process, making Kraken uniquely powerful for dealing with a diverse variety of ligatures in connected Arabic script. In the process we also generated over 7,000 lines of “gold standard” (double-checked) data that can be used by others for Arabic-script OCR training and testing purposes.

    Our working paper can be found on

    Kraken ibn Ocropus. Based on a depiction of an octopus from a manuscript of Kitāb al-ḥašāʾiš fī hāyūlā al-ʿilāj al-ṭibbī (Leiden, UB : Or. 289); special thanks to Emily Selove for help with finding an octopus in the depths of the Islamic MS tradition.