metadata {

authority_id: bgnpcgn
id: 1968
language: iso-639-3:prs
source_script: Arab
destination_script: Latn
name: Romanization of Pashto (1968)
url: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/693760/ROMANIZATION_OF_PASHTO.pdf
creation_date: 1968
confirmation_date: 2017-11
description: |
  Pashto is an Indo-Iranian language and is one of two
  nationally official languages in Afghanistan and one of
  five regionally recognised languages in Pakistan. The
  romanization system presented here may be applied to all
  Pashto geographical names. Although the BGN/PCGN policy for
  geographical names in Afghanistan is to apply the BGN/PCGN
  national system of romanization for Afghanistan (2007),
  which incorporates Dari elements, when applied to a Pashto
  geographical name, the romanized results of the BGN/PCGN
  national system for Afghanistan are the same as those of
  this Pashto romanization system1 . The Pashto alphabet uses
  a modified form of the Perso-Arabic script, and contains
  twelve additional consonants not present in standard
  Arabic, as well as three additional vowel characters and an
  additional vowel point. ڼ گ ښ ژ ږ ړ ډ ځ څ چ ټ پ :Consonants
  ٙ :Point Vowel; ې ۍ ى :Vowels The points used in Arabic to
  mark short vowels and certain other diacritical marks are
  not written in Pashto. Consequently, a reference source may
  sometimes be required to aid correct identification of the
  standard spellings and proper vowels and elimination of
  dialectal and idiosyncratic variations. In the interests of
  clarity, a column showing vowel pointing from Arabic to
  indicate short vowels has been included in the examples
  below, alongside the unpointed form that will usually be
  encountered. However it should be noted that the
  pronunciation of short vowels will vary. (Note: it is
  recommended that a font such as Scheherazade, available
  from www.sil.org, which includes the Unicode extended
  Arabic sub-range, be used to view this system2 .)

notes:
  - 1. Alif ( ‫ا‬ ) should be romanized as follows
    a. Initially,it indicates that the word begins with a vowel or
    diphthong; the alif itself is not romanized, but rather the
    short vowel it “carries” is romanized; e.g., Aslam Zhrandah
    ‫ه‬ َ‫د‬ ‫ن‬ ‫ژر‬ ‫سلَم‬ َ‫أ‬ ‫ميړ‬ → b. When it carries a
    maddah (‫)آ‬ (see vowel table, row 3), it represents ā;
    e.g., Band. Mīṟ ‫د‬ ‫ن‬ ‫ب‬ َ ‫آب‬ → Āb c. Medially and
    finally it represents ā (see table 2, row 2); e.g., ‫ۍ‬
    ‫ماڼ‬ → Māṉêy d. Medially and finally in words of Arabic
    origin, alif may serve as the bearer of hamzah, e.g.
    ‫رأس‬ → ra’s. See also note 4.

  - 2. The characters tsē ( ‫څ‬ ) and dzē ( ‫ځ‬ ) may be
    romanized t͡ s and d͡ z (the combining double breve (
    Unicode 0361) appearing over the digraph) when for special
    reasons it is desired that confusion be avoided between
    ‫ت‬ (t) plus ‫س‬ (s) and between ‫د‬ (d) plus ‫ز‬ (z),
    respectively.

  - 3. Occasionally the character sequences ‫ه‬ ‫ك‬ , ‫ه‬ ‫ز‬ ,
    ‫ه‬ ‫س‬ , and ‫ه‬ ‫گ‬ occur . They may be romanized k·h, z·
    h, s·h, and g·h in order to differentiate these
    romanizations from the digraphs kh, zh, sh, and gh, which
    are used to represent the characters ‫خ‬ , ‫ژ‬, ‫ش‬ , and
    ‫غ‬ respectively .

  - 4. Hamzah ( ‫ء‬ ) should be romanized as follows a. In
    word-initial position, where it will appear either above or
    below alif ( indicates a short vowel and should not itself
    be romanized. romanized by an apostrophe, e.g. ‫أ‬ or
    ‫إ‬ ), it In other positions it should be ‫جُزء‬ → juz’. b.
    Yeh with hamzah ( ‫ئ‬ ) should be romanized êy, unless it
    represents the compound (iẕāfah) morpheme, in which case it
    is romanized according to note 9 below.

  - 5. The division of words utilized in Pashto writing is
    followed in romanization, except that the elements –ābād, -
    khwā, -shahr, -zādah, -zay and -ullāh are always romanized
    as part of the preceding word, e.g. ‫آباد‬ ‫ت‬ ‫م‬ َ ْ‫ح‬
    ‫ر‬ َ → Raḩmatābād and ‫الله‬ ‫ت‬ ‫م‬ َ ْ‫ح‬ ‫ر‬ َ →
    Raḩmatullāh. However, when the word for God ( ‫الله‬ )
    appears as a standalone word it should be written Allāh.
    Note also the “dagger alif” ( ٙ) above the second ‫ل‬ (lām)
    in the word ‫الله‬ ; this, like the short vowels, is not
    written in Pashto but should be romanized ā, like a full-
    size alif. Persian derivational endings such as –vand and
    endings of Turkish origin such as –lar, -lī, -lū, -i, -u, -
    si, and –su, should be written together with the preceding
    word.

  - 6. The Pashto preposition ‫د‬ should be romanized dê in
    agreement with its pronunciation, despite the fact that
    it is sometimes pointed with kasrah ( ٙ ).

  - 7. In names of Arabic origin, the l of the definite article
    al/ul is assimilated before the ‘sun letters’ t, s̄ , d,
    z̄ , r, z, s, sh, ş, ẕ, ţ, z̧ , l and n. In romanization,
    the article will be written al or its assimilated
    equivalent in name-initial position but ul or its
    assimilated equivalent elsewhere; the article should be
    separated from the name it precedes and should not be
    capitalized, except at the beginning of a name, e.g. جَبَل
    السَرَاج → Jabal us Sarāj

  - 8. In Arabic names, a shaddah, ٙ is used to denote the
    doubling of a particular consonant character, e.g. ‫مَّد‬
    َ‫ح‬ ‫م‬ ُ → Muḩammad. However, in Pashto this ‘doubling’
    is frequently omitted in both Perso-Arabic script and the
    resulting romanization. Guidance on doubling may be taken
    from an authoritative names source, such as an Afghan
    government source or Pashto dictionary; for example, it is
    usual to see Ḩājī without and ‘Abbās with the doubled
    consonant. The doubled y consonant is almost always
    retained, as in Sayyid or Qayyūm

  - 9. The iẕāfah morpheme is not a grammatical feature of
    Pashto and, if encountered in a linguistically hybrid
    geographical name (i.e. combining features of both Pashto
    and Dari), it should be treated according to the BGN/PCGN
    national system of romanization for Afghanistan, 2007, as –
    e, unless the preceding word ends with a silent heh (‫)ه‬
    or a vowel when it should be shown – ye, e.g. 10. The
    character sequence ‫خو‬ , ‫صار‬ ‫ح‬ ِ ‫غر‬ → Ghar-e Ḩişār;
    ‫و‬ ‫ن‬ َ ‫ه‬ ٔ ‫لع‬ َ ‫ق‬ َ → when followed by ‫ا‬ or
    ‫ی‬ , Qal‘ah-ye Now.

  - 10. The character sequence خو when followed by ‫ا‬ or
    ‫ی‬ ,should be romanized khw, although the w is either not
    pronounced, or only weakly pronounced; e.g. ‫خواجه‬ →
    khwājah.

  - 11. An inventory of letter-diacritic combinations in addition to the unmodified letters of the
    basic Roman script is
    ‘ (U+2018)
    ʼ (U+2019)
    Ā (U+0100)
    ā (U+0101)
    Á (U+00C1)
    á (U+00E1)
    Ḏ (U+0044+0031)
    ḏ (U+0064+00031)
    Ē (U+0112)
    ē (U+0113)
    Ê (U+00CA)
    ê (U+00EA)
    Ḩ (U+1E28)
    ḩ (U+1E29)
    Ī (U+012A)
    ī (U+012B)
    N̄ (U+004E+0304)
    n̄ (U+004E+0304)
    Ō (U+014C)
    ō (U+014D)
    Ṟ (U+0052+0031)
    ṟ (U+0072+0031)
    Ş (U+015E)
    ş (U+015F)
    S̄ (U+0053+0304)
    s̄ (U+0073+0304)
    Ṯ (U+0054+0031)
    ṯ (U+0074+0031)
    Ţ (U+0162)
    ţ (U+0163)
    Ū (U+016A)
    ū (U+016B)
    Z̧ (U+005A+0327)
    z̧ (U+007A+0327)
    Z̄ (U+005A+0304)
    z̄ (U+007A+0304)
    Ẕ (U+005A+0331)
    ẕ (U+007A+0331)
    Z͟ H (U+005A+0048+035F)
    z͟ h (U+007A+0068+035F)

}

tests {

test "بَغْلان", "Baghlān"
test "پُوټَكَى", "Pōṯakay"
test "شِيرِين تَگَاب", "Shīrīn Tagāb"
test "کُوْټ", "Kōṯ"
test "ثَابِر", "S̄ābir"
test "جَلال آبَاد", "Jalālābād"
test "چَارِيكَار", "Chārīkār"
test "ځَدْرَاڼ", "Dzadrāṉ"
test "څَوکۍ", "Tsowkêy"
test "حَضْرَتِ إِمَام", "Ḩaẕrat-e Imām"
test "خُوْسْت", "Khōst"
test "سْپِين بُوْلْدَک", "Spīn Bōldak"
test "ډَنْډ وَ پَتَان", "Ḏanḏ Wa Patān"
test "كَنْدَهَار", "Kandahār"
test "أَنْدَړ", "Andaṟ"
test "كُنْدُز", "Kunduz"
test "مِير أَسْلَم ژْرَنْدَه", "Mīr Aslam Zhrandah"
test "ږِيرَه", "Z͟hīrah"
test "سَمَنْگَان", "Samangān"
test "كښٙتَه كَلا", "Ks͟hêtah Kalā"
test "قَيْصَار", "Qayşār"
test "فَيض آبَاد", "Faīẕābād"
test "حَضْرَتِ سُلْطَان", "Ḩaẕrat-e Sulţān"
test "ظَاهِر كَلا", "Z̧āhir Kalā"
test "پُلِ عَلَم", "Pul-e ‘Alam"
test "غَزْنِي", "Ghaznī"
test "مَزَارِ شَرِيف", "Mazār-e Sharīf"
test "قَيْصَار", "Qayşār"
test "كَنْدَهَار", "Kandahār"
test "گَرْدېز", "Gardēz"
test "کَابُل", "Kābul"
test "مَيمَنَه", "Maīmanah"
test "خَان آبَاد", "Khānābād"
test "مَاڼۍ", "Māṉêy"
test "وَاخَان", "Wākhān"
test "يَنْگِي قَلعَه", "Yangī Qal‘ah"
test "جَلال آبَاد", "Jalālābād"
test "مُرْغَاب کَابُل", "Murghāb Kābul"
test "گٙردُون", "Gêrdōn"
test "آب بَنْد", "Āb Band"
test "سْپِين بُوْلْدَک", "Spīn Bōldak"
test "جَوزجَان", "Jowzjān"
test "گَرْدېز", "Gardēz"
test "مَیدان شَهْر", "Maīdān Shahr"
test "ډَنْډِ سُفْلىٰ", "Ḏanḏ-e Suflá"
test "جَبَل السَرَاج", "Jabal us Sarāj"

}

dependency “bgnpcgn-prs-Arab-Latn-2007”, as: arablatn

stage {

run map.arablatn.stage.main

# CHARACTERS
parallel {

  sub "\u0650", "i" # ِ kasra
  sub "\u064f", "u" # ُ damma

  sub "\u0650" + boundary, "-e" # ِ kasra

  sub space + "\u0627\u0644\u0644\u0651\u064e\u0647", "ullāh" # Note5
  sub space + "\u062E\u0648\u0627", "khwā" # Note5
  sub space + "\u0634\u064E\u0647\u0631", "shahr" # Note5
  sub space + "\u0632\u0627\u062F\u0629", "zādah" # Note5
  sub space + "\u0632\u064E\u064a", "zay" # Note5
  sub "\u0652", "" # ْ sokoon
  sub "\u0659", "ê"

  # Sun letters
  sub boundary + "\u0627\u0644\u062a" + maybe("\u0651"), "ut t" # الت
  sub boundary + "\u0627\u0644\u062b" + maybe("\u0651"), "us̄ s̄" # الث
  sub boundary + "\u0627\u0644\u062f" + maybe("\u0651"), "ud d" # الد
  sub boundary + "\u0627\u0644\u0630" + maybe("\u0651"), "uz̄ z̄" # الذ
  sub boundary + "\u0627\u0644\u0631" + maybe("\u0651"), "ur r" # الر
  sub boundary + "\u0627\u0644\u0632" + maybe("\u0651"), "uz z" # الز
  sub boundary + "\u0627\u0644\u0633" + maybe("\u0651"), "us s" # الس
  sub boundary + "\u0627\u0644\u0634" + maybe("\u0651"), "ush sh" # الش
  sub boundary + "\u0627\u0644\u0635" + maybe("\u0651"), "uş ş" # الص
  sub boundary + "\u0627\u0644\u0636" + maybe("\u0651"), "uẕ ẕ" # الض
  sub boundary + "\u0627\u0644\u0637" + maybe("\u0651"), "uţ ţ" # الط
  sub boundary + "\u0627\u0644\u0638" + maybe("\u0651"), "uz̧ z̧" # الظ
  sub boundary + "\u0627\u0644\u0644" + maybe("\u0651"), "ul l" # الل
  sub boundary + "\u0627\u0644\u0646" + maybe("\u0651"), "un n" # الن

  sub "\u0626", "êy" # ئ
}

# POSTRULES
sub any("\u0061".."\uFFFF"), upcase, before: boundary, not_before: boundary + any("‘’'-")
# don't capitalize defined article in the middle of a sentence
sub " Ut T", " ut T" # الت
sub " Us̄ S̄", " us̄ S̄" # الث
sub " Ud D", " ud D" # الد
sub " Uz̄ Z̄", " uz̄ Z̄" # الذ
sub " Ur R", " ur R" # الر
sub " Uz Z", " uz Z" # الز
sub " Us S", " us S" # الس
sub " As S", " us S"   # needed to add it after porting, why?
sub " Ush Sh", " ush Sh" # الش
sub " Uş Ş", " uş Ş" # الص
sub " Uẕ Ẕ", " uẕ Ẕ" # الض
sub " Uţ Ţ", " uţ Ţ" # الط
sub " Uz̧ Z̧", " uz̧ Z̧" # الظ
sub " Ul L", " ul L" # الل
sub " Un n", " un N" # الن

compose

}