metadata {

authority_id: bgnpcgn
id: 2007
language: iso-639-3:prs
# prs stands for Dari (https://iso639-3.sil.org/code/prs&_ga=GA1.2.2054538372.1574092823)
source_script: Arab
destination_script: Latn
name: National Romanization System for Afghanistan (2007)
url: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/693661/ROMANIZATION_FOR_AFGHANISTAN.pdf
creation_date: 2007
confirmation_date: 2017-11
description: |
  This romanization system agreed by BGN and PCGN in November 2007,
  accommodates the linguistic complexity of Afghanistan as manifest in
  its geographical names.

  The following tabulation shows the original Perso-Arabic script with
  accompanying Unicode value (columns 1a and b), the Yaghoubi
  romanization (column 2), the BGN/PCGN romanization with accompanying
  Unicode value (columns 3a and b), an English phonetic example (column
  4), and an example toponym (columns 5b and c).

  [The Yaghoubi romanization system was developed in 1959 by
  Muzaffarud Din Yaqubi (commonly seen as Yaghoubi). It is a native
  official system designed to reflect Afghan names, both Dari and Pashto,
  and both pronunciation and genuine linguistic truth.]

  The tables function as both a romanization system for Afghanistan (i.e.
  with access to the original script, these tables can be applied to get
  a standardized Roman result - moving from columns 1 to 3) and as a
  means of converting the available Yaghoubi Roman-script spellings, as
  appear on the Fairchild Aerial Surveys map series, to standard BGN/PCGN
  spellings (moving from columns 2 to 3).

  The points used in Arabic to mark short vowels and certain other
  diacritical marks are infrequently written in Afghanistan.
  Consequently, a reference source may sometimes be required to aid
  correct identification of the standard spellings and proper vowels and
  elimination of dialectal and idiosyncratic variations. In the interests
  of clarity, the example columns show script with vowel pointing from
  Arabic to indicate the short vowels that are included alongside the
  unpointed form that will usually be encountered. However it should be
  noted that the pronunciation of short vowels will vary.

  Note: it is recommended that a font such as Scheherazade, available
  from www.sil.org, which includes the Unicode extended Arabic sub-range,
  be used to view this system. [Please note that the identification of a
  particular font does not represent an endorsement of any specific
  product or manufacturer.]

notes:
  - |
    Alif (ا) should be romanized as follows:

    a. Initially, it indicates that the word begins with a vowel or
      diphthong; the alif itself is not romanized, but rather the short vowel
      it “carries” is romanized; e.g., ميړ أَسَلم ژرَندَه → Mī Aslam Zhrandah
    b. When it carries a maddah (آ) (see vowel table, row 6), it represents ā; e.g., آب بَند → Āb Band.
    c. Medially and finally it represents ā (see vowel table, row 5); e.g., ماڼۍ → Māṉêy
    d. Medially and finally in words of Arabic origin, alif may serve as the bearer of hamzah, e.g. رأس → ra’s.

  - Occasionally the letter sequences سه ,زه ,که, and گه occur without
    intervening vowels. They may be romanized k·h, z·h, s·h, and g·h in
    order to differentiate these romanizations from the digraphs kh, zh,
    sh, and gh, which are used to represent the letters ش ,ژ ,خ, and غ.
    Additionally, the Pashto letters څ and ځ, routinely romanized ts and
    dz, may be alternatively romanized s and z تس when for special reasons
    it is desired that confusion be avoided with the character sequences
    (ts) and دز (dz), respectively.

  - "The vagaries of written Afghan languages, as pertains to spacing
    and word division, are addressed as follows:
    Spaces may be added to or subtracted from Afghan words written in
    Arabic script, for the purposes of standardization. This is
    particularly relevant when the words are hand-written, are rendered
    “art st cally”, or express other s ch non-standard flourishes, as long
    as the sense of the toponym, word, or phrase is not compromised.
    Romanized toponyms are typically divided into constituent words
    (spaces and other grammatical rules applied) when those words can stand
    independently, for purposes of standardization and minimization of
    confusion, particularly in situations where Afghan writers are
    inconsistent in their application of spacing and word breaks. When the
    Afghan word or suffix is only used in combination with other nouns or
    adjectives, then it should be appended to the preceding word in its
    romanization. This includes (but is not limited to) - ābā , -zaī, -zā
    ah, - ū, -wand, -gaī, -kaī, -pūr, - ēsh, -lar, -lī, -lū and ullāh, as,
    for example, seen in Raḩmatābād (رحمت آباد) and Raḩmatullāh (رحمت االله),
    but Raḩmat Khēl (رحمتخيل) and Raḩmat Shahr (رحمتشهر)."

  - The one-letter words د (Pashto) and و (Dari) are romanized dê and
    wa, respectively.

  - The word الله, meaning God, should always be romanized Allāh,
    except as specified in note 3. Note that the Unicode value FDF2 spells
    Allāh, but omits the alif in some common fonts, including Times New
    Roman. If in doubt, try in Arial Unicode MS to verify. Also note that
    the “dagger al f” ( ) above the second ل (lām) n the ord الله, is not
    written but should be romanized ā, like a full-size alif.

  - In names of Arabic origin, the l of the definite article al s ass m
    lated before the ‘s n letters’ , , , , r, z, s, sh, ş, ẕ, , z, l and n.
    In its romanization, the article should be separated from the name it
    precedes and should not be capitalized except at the beginning of a
    name, e.g. جبل السراج→ Jabal
    as Sarāj

  - In Arabic names, a shaddah, ّ is used to denote the doubling of a
    particular consonant character, e.g. ُم َح َمد → Muḩammad. Ho ever, n
    Pashto th s ‘do bl ng’ s freq ently om tted n both Perso- Arabic script
    and the resulting romanization. Guidance on doubling may be taken from
    an authoritative names source, such as an Afghan government source or
    Pashto dictionary; for example, it is usual to see Ḩājī without and
    ‘Abbās with the doubled consonant. The doubled y consonant is almost
    always retained, as in Sayyid or Qayyūm.

  - In Afghan names which contain an iẕāfah, it should be romanized as
    -e or –ye according to
    common pronunciation, but generally, -e is used if the preceding word
    ends with a consonant other
    than silent heh, and -ye if the preceding word ends with a vowel
    sound e.g. غر ِحصار → Ghar-e ِ
    Ḩ şār; َقل َع ٔه َنو → Qal‘ah-ye Now. Scholarly sources indicate that
    heh is silent in darah and qal‘ah (thus darah-ye, qal‘ah-ye), but
    lightly spoken in kōh and chāh (thus kōh-e, chāh-e).

  - The character sequence خو, where followed by ا or ی should be
    romanized khwā or khwī, although the w is either not pronounced, or
    only weakly so, as in خواجه → khwājah.

  - Plural nouns ending in -hā or -ān should always be romanized as a
    single word, regardless of whether a space appears in a Perso-Arabic
    script source.

  - Unicode values listed in the tables above are required to ensure
    standardization and to minimize confusion from competing
    representations of a given character. It should be noted that the
    Persian Unicode value 0643 or FEDA( ك Unicode value 06A9) is
    recommended rather than the Arabic( ک or FED9), the Persian گ (Unicode
    value 06AF) is recommended rather than ګ (Unicode value 06AB) or ڰ
    (Unicode value 06B0) or ك (Unicode value 0643 or FEDA or FED9), and the
    Pashto character ځ (Unicode value 0681) is recommended rather than the
    heh with a dot above and a dot below (no Unicode value). For the letter ی
    in its many variations, care must be exercised to follow this romanization
    guide's recommendations to eliminate confusion for search engines
    and software. BGN/PCGN does not use the Unicode encoding FEEF for the
    character ی in any Afghan word.

  - |
    An inventory of letter-diacritic combinations in addition to the
    unmodified letters of the basic Roman script is:

    ‘ (U+2018)
    Ā (U+0100)
    Á (U+00C1)
    Ḏ (U+0044+0031)
    Ē (U+9112)
    Ê (U+00CA)
    Ḩ (U+1E28)
    Ī (U+012A)
    N-bar-top (U+004E+0304)
    Ō (U+014C)
    R-bar-bottom (U+0052+0031)
    Ş (U+015E)
    S-bar-top (U+0053+0304)
    Ṯ (U+0054+0031)
    Ţ (U+0162)
    Ū (U+918A)
    Z-comma-bottom (U+005A+0327)
    Z-bar-top (U+005A+0304)
    Ẕ (U+005A+0331)
    ẔH (U+005A+0048+035F)

    ʼ (U+2019)
    ā (U+0101)
    á (U+00E1)
    ḏ (U+0064+00031)
    ē (U+0113)
    ê (U+00EA)
    ḩ (U+1E29)
    ī (U+912B)
    n-bar-top (U+004E+0304)
    ō (U+014D)
    r-bar-bottom (U+0072+0031)
    ş (U+015F)
    s-bar-top (U+0073+0304)
    ṯ (U+0074+0031)
    ţ (U+0163)
    ū (U+918B)
    z-comma-bottom (U+007A+0327)
    z-bar-top (U+007A+0304)
    ẕ (U+007A+0331)
    zh-under-bar (U+007A+0068+035F)

  - The Romanization columns show only lowercase forms but, when
    romanizing, uppercase and lowercase Roman letters as appropriate should
    be used.

}

tests {

test "بَغْلان", "Baghlān"
test "پُوټَكَى", "Pōṯakay"
test "شِيرِين تَگَاب", "Shīrīn Tagāb"
test "کُوْټ", "Kōṯ"
test "ثَابِر", "S̄ābir"
test "جَلال آبَاد", "Jalālābād"
test "چَارِيكَار", "Chārīkār"
test "ځَدْرَاڼ", "Dzadrāṉ"
test "څَوکۍ", "Tsowkêy"
test "حَضْرَتِ إِمَام", "Ḩaẕrat-e Imām"
test "خُوْسْت", "Khōst"
test "سْپِين بُوْلْدَک", "Spīn Bōldak"
test "ډَنْډ وَ پَتَان", "Ḏanḏ Wa Patān"
# - source: گُذَرْگَاهٔ نور
#   expected: Guz̄argāh-e nūr
test "كَنْدَهَار", "Kandahār"
test "أَنْدَړ", "Andaṟ"
test "كُنْدُز", "Kunduz"
test "مِير أَسْلَم ژْرَنْدَه", "Mīr Aslam Zhrandah"
test "ږِيرَه", "Z͟hīrah"
test "سَمَنْگَان", "Samangān"
# - source: مَزَارِ شَريف
#   expected: Mazār-e sharīf
test "كښٙتَه كَلا", "Ks͟hêtah Kalā"
test "قَيْصَار", "Qayşār"
test "فَيض آبَاد", "Faīẕābād"
test "حَضْرَتِ سُلْطَان", "Ḩaẕrat-e Sulţān"
test "ظَاهِر كَلا", "Z̧āhir Kalā"
test "پُلِ عَلَم", "Pul-e ‘Alam"
test "غَزْنِي", "Ghaznī"
test "مَزَارِ شَرِيف", "Mazār-e Sharīf"
test "قَيْصَار", "Qayşār"
test "كَنْدَهَار", "Kandahār"
test "گَرْدېز", "Gardēz"
test "کَابُل", "Kābul"
test "مَيمَنَه", "Maīmanah"
test "خَان آبَاد", "Khānābād"
test "مَاڼۍ", "Māṉêy"
test "وَاخَان", "Wākhān"
# - source: هِرَات
#   expected: Herāt
test "يَنْگِي قَلعَه", "Yangī Qal‘ah"
test "جَلال آبَاد", "Jalālābād"
# - source: هِرات پُلِ حِصَار
#   expected: Herāt Pul-e Ḩişār
test "مُرْغَاب کَابُل", "Murghāb Kābul"
test "گٙردُون", "Gêrdōn"
test "آب بَنْد", "Āb Band"
test "سْپِين بُوْلْدَک", "Spīn Bōldak"
# - source: بَالا بُلُوک
#   expected: Bālā Bulūk
test "جَوزجَان", "Jowzjān"
# - source:  غَزْنِى سْپِين
#   expected: Ghaznī spīn
# - source: ريگ مَيوَنْد
#   expected: Maywand, Rēg
test "گَرْدېز", "Gardēz"
test "مَیدان شَهْر", "Maīdān Shahr"
test "ډَنْډِ سُفْلىٰ", "Ḏanḏ-e Suflá"
# - source: څَوْکۍ
#   expected: Tsowkêy
# - source: هَوائِي ډَگَر
#   expected: Hawā’ī ḏagar
# - source: مَزارِ شَريف
#   expected: Mazār-e sharīf
# - source: دايکندی
#   expected: Dāykundī
# - source: زيارت
#   expected: Zīārat
# - source: غوريان
#   expected: Ghōriyān
# - source: ميا
#   expected: Myā
test "جَبَل السَرَاج", "Jabal as Sarāj"

}

stage {

# CHARACTERS
parallel {

  # word-medial or word-final form where so appearing in a word.
  # '\u0627': '-'

  # # Vowel, Diphthong and Diacritical Characters

  # '\u064E': 'a'

  # # Both e and i are available to romanize this short vowel,
  # # depending on local usage and/or root language. In cases where the sound
  # # is uncertain, i is the default romanization in BGN/PCGN standardization
  # # procedures.
  # '\u0650':
  #   - 'e'
  #   - 'i'

  # # Both o and u are available to romanize this short vowel,
  # # depending on local usage and/or root language. In cases where the sound
  # # is uncertain, u is the default romanization in BGN/PCGN standardization
  # # procedures.
  # '\u064F':
  #   - 'o'
  #   - 'u'
  # '\u0659': 'ê'

  # # An alif with mad ( آ ) is written only in the initial position by
  # # BGN/PCGN standardization procedures, in keeping with Persian language
  # # family standards of use of the Arabic alphabet. The same letter written
  # # in a medial or final position is written . . .
  # '\u0622': 'ā'

  # pending issue #442
  # '\u0648': 'ō'
  # '\u0648': 'ū'
  # '\u0648': 'ow'
  # '\u06CC': 'ī'

  # # Or 'ē'. The character ی should be romanized ay or ē according to
  # # its root language or local pronunciation. In case of uncertainty a
  # # reference source (such as the Fairchild Aerial Surveys map series, or a
  # # BGN/PCGN approved policy document/list of recommended spellings) should
  # # be consulted.
  # '\u06CC': 'ay'
  # '\u06D0': 'ē'

  # # Or 'aī'. Both the combination ay and aī are available to romanize
  # # this character according to its root language or local pronunciation.
  # # In cases where the sound is uncertain ay is the default romanization in
  # # BGN/PCGN standardization procedures
  # '\u06CC':
  #   - 'ay'
  #   - 'á'
  # '\u06CD': 'êy'
  # '\u0621': '’'
  # '\u0674':
  #   - '-e'
  #   - '-ye'

  # # Other Diacritical Marks and Language Conventions

  # '\u0627': 'āy'

  # '\u0648': 'w'
  # '\u0626': '’'
  # '\u06C0': ''
  # '\u0651': ''

  # special rules

  sub space, "", after: "\u0622\u0628\u064E\u0627\u062F" # space followed by abad is removed
  sub "\ufdf2", "Allāh" # See note 5

  # pointing
  sub "\u064e", "a" # َ fatha
  sub "\u064e", "", after: "\u0629" # َ fatha followed by ta' marboota
  sub "\u064e", "", after: "a" + any("ht") # َ fatha followed by ta' marboota, handling different order of conversion

  # Both e and i are available to romanize this short vowel,
  # depending on local usage and/or root language. In cases where the sound
  # is uncertain, i is the default romanization in BGN/PCGN standardization
  # procedures.
  sub "\u0650", any("ie")
  sub "\u0650" + boundary, "-e" # ِ kasra

  # Both o and u are available to romanize this short vowel,
  # depending on local usage and/or root language. In cases where the sound
  # is uncertain, u is the default romanization in BGN/PCGN standardization
  # procedures.
  sub "\u064f", any("uo") # ُ damma

  sub "\u0652", "" # ْ sokoon
  sub "\u0659", "ê"

  # special pointed letters
  sub "\u0639\u064e", "‘a" # عَ
  sub "\u0639\u0650", "‘i" # عِ
  sub "\u0639\u064f", "‘ū" # عُ
  # handle MacOS regex difference
  sub "\u0639\u064f\u0648", "‘ū" # عُو damma followed by و

  sub "\u0650\u064a", "ī" # ـِي kasra followed by ي
  sub "\u0650\u06cc", "ī" # ـِي kasra followed by ي
  sub "\u0650\u064a\u0651\u064e", "īy" # ـِيَّ
  sub "\u0650\u064a", "iy", after: any(["\u064e", "u064f"]) # ـِي kasra followed by ي
  sub "\u064f\u0648", "ō" # ـُو damma followed by و
  sub "\u064e\u0627", "ā" # ـَا fatha followed by ا
  sub "\u064e\u0649", "ay" # ـَى fatha followed by ى which is ا not ي
  sub "\u064e\u0648\u0652", "aw" # ـَوْ
  sub "\u064e\u0648", "ow" # ـَو
  sub "\u064e\u064a\u0652", "ay" # ـَيْ
  sub "\u0650\u06cc\u0651\u064e", "īy" # ـِيَّ
  sub "\u064e\u064a", "aī" # ـَي
  sub "\u064e\u06cc", "aī" # ـَي
  sub "\u0649\u0670", "á" # ىٰ
  sub "\u0674", "-e" # ٴ
  sub "\u0654", "-e" #  ٔ
  # - '-ye'

  # An alif with mad ( آ ) is written only in the initial position by
  # BGN/PCGN standardization procedures, in keeping with Persian language
  # family standards of use of the Arabic alphabet. The same letter written
  # in a medial or final position is written . . .
  sub "\u0622", "ā" # آ

  # ta' marboota
  sub "\u0629", "at" # ة in the middle of the sentence
  sub "\u0629" + line_end, "ah"
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")

  # shadda

  sub "\u0628\u0651", "bb" # ب
  sub "\u062a\u0651", "tt" # ت
  sub "\u062b\u0651", "thth" # ث
  sub "\u062c\u0651", "jj" # ج
  sub "\u062d\u0651", "ẖẖ" # ح
  sub "\u062e\u0651", "khkh" # خ
  sub "\u062f\u0651", "dd" # د
  sub "\u0630\u0651", "z̄z̄" # ذ
  sub "\u0631\u0651", "rr" # ر
  sub "\u0632\u0651", "zz" # ز
  sub "\u0633\u0651", "ss" # س
  sub "\u0634\u0651", "sh" # ش
  sub "\u0635\u0651", "şş" # ص
  sub "\u0636\u0651", "ḏḏ" # ض
  sub "\u0637\u0651", "ţţ" # ط
  sub "\u0638\u0651", "z̧z̧" # ظ
  sub "\u063a\u0651", "ghgh" # غ
  sub "\u0641\u0651", "ff" # ف
  sub "\u0642\u0651", "qq" # ق
  sub "\u0643\u0651", "kk" # ك
  sub "\u0644\u0651", "ll" # ل
  sub "\u0645\u0651", "mm" # م
  sub "\u0646\u0651", "nn" # ن
  sub "\u0647\u0651", "hh" # ه
  sub "\u0648\u0651", "ww" # و
  sub "\u064a\u0651", "yy" # ي

  sub "\u0621", "’" # ء
  sub "\u0626", "’" # ئ

  sub "\u0623", "" # أ
  sub "\u0625", "" # إ
  sub "\u0627", "ā" # ا

  # See note B
  sub boundary + "\u0627\u0644", "al " # ال
  # '\uFE8E' : ''  # ﺎ

  # Sun letters
  sub boundary + "\u0627\u0644\u062a" + maybe("\u0651"), "at t" # الت
  sub boundary + "\u0627\u0644\u062b" + maybe("\u0651"), "as̄ s̄" # الث
  sub boundary + "\u0627\u0644\u062f" + maybe("\u0651"), "ad d" # الد
  sub boundary + "\u0627\u0644\u0630" + maybe("\u0651"), "az̄ z̄" # الذ
  sub boundary + "\u0627\u0644\u0631" + maybe("\u0651"), "ar r" # الر
  sub boundary + "\u0627\u0644\u0632" + maybe("\u0651"), "az z" # الز
  sub boundary + "\u0627\u0644\u0633" + maybe("\u0651"), "as s" # الس
  sub boundary + "\u0627\u0644\u0634" + maybe("\u0651"), "ash sh" # الش
  sub boundary + "\u0627\u0644\u0635" + maybe("\u0651"), "aş ş" # الص
  sub boundary + "\u0627\u0644\u0636" + maybe("\u0651"), "aẕ ẕ" # الض
  sub boundary + "\u0627\u0644\u0637" + maybe("\u0651"), "aţ ţ" # الط
  sub boundary + "\u0627\u0644\u0638" + maybe("\u0651"), "az̧ z̧" # الظ
  sub boundary + "\u0627\u0644\u0644" + maybe("\u0651"), "al l" # الل
  sub boundary + "\u0627\u0644\u0646" + maybe("\u0651"), "an n" # الن

  # consonant characters

  sub "\u0628", "b" # ب
  sub "\u067E", "p" # پ
  sub "\u062a", "t" # ت
  sub "\u067C", "ṯ" # ټ
  sub "\u062B", "s̄" # ث
  sub "\u062c", "j" # ج
  sub "\u0686", "ch" # ‫چ‬

  # # The variant form ج is seen infrequently and does not have a
  # # single Unicode encoding.
  sub "\u0681", "dz" # Note 2 # ‫ځ‬

  sub "\u0685", "ts" # Note 2 # ‫څ

  sub "\u062d", "ḩ" # ح
  sub "\u062e", "kh" # خ
  sub "\u062f", "d" # د
  sub "\u0689", "ḏ" # ‫ډ‬
  sub "\u0630", "z̄" # ذ
  sub "\u0631", "r" # ر
  sub "\u0693", "ṟ" # ړ
  sub "\u0632", "z" # ز
  sub "\u0698", "zh" # ‫ژ‬
  sub "\u0696", "z͟h" # ږ
  sub "\u0633", "s" # س
  sub "\u069A", "s͟h" # ښ
  sub "\u0634", "sh" # ش
  sub "\u0635", "ş" # ص
  sub "\u0636", "ẕ" # ض
  sub "\u0637", "ţ" # ط
  sub "\u0638", "z̧" # ظ
  sub "\u0639", "‘" # ع
  sub "\u063a", "gh" # غ
  sub "\u0641", "f" # ف
  sub "\u0642", "q" # ق
  sub "\u0643", "k" # ك
  sub "\u06A9", "k" # ک
  sub "\u06AF", "g" # ‫گ‬
  sub "\u0644", "l" # ل
  sub "\u0645", "m" # م
  sub "\u0646", "n" # ن
  sub "\u06BC", "ṉ" # ڼ
  sub "\u0647", "h" # ه
  sub "\u0648", "w" # و
  sub "\u064a", "y" # ي
  sub "\u0649", "y" # ي
  sub "\u06D0", "ē" # ې
  sub "\u06CD", "êy" # ‫ۍ‬
}

# POSTRULES
sub any("\u0061".."\uFFFF"), upcase, before: boundary, not_before: boundary + any("‘’'-")
# don't capitalize defined article in the middle of a sentence
sub " At T", " at T" # الت
sub " As̄ S̄", " as̄ S̄" # الث
sub " Ad D", " ad D" # الد
sub " Az̄ Z̄", " az̄ Z̄" # الذ
sub " Ar R", " ar R" # الر
sub " Az Z", " az Z" # الز
sub " As S", " as S" # الس
sub " Ash Sh", " ash Sh" # الش
sub " Aş Ş", " aş Ş" # الص
sub " Aẕ Ẕ", " aẕ Ẕ" # الض
sub " Aţ Ţ", " aţ Ţ" # الط
sub " Az̧ Z̧", " az̧ Z̧" # الظ
sub " Al L", " al L" # الل
sub " An N", " an N" # الن
sub " Al ", " al " # ال

compose

}