metadata {

authority_id: un
id: 2017
language: iso-639-2:ara
source_script: Arab
destination_script: Latn
name: Romanization of Arabic -- Unified Arabic Transliteration System (2017)
url: http://www.eki.ee/wgrs/rom1_ar.pdf
creation_date: 2017
confirmation_date: 2018-06
description: |
  The current United Nations recommended romanization
  system was approved in 2017 (resolution XI/3), based on
  the system adopted by Arabic experts at the conference
  held in Beirut in 2007, the Unified Arabic
  Transliteration System, taking into account the
  practical amendments and corrections carried out and
  agreed upon by the representatives of the Arabic-
  speaking countries at the Fourth Arab Conference on
  Geographical Names, held in Beirut in 2008, and some
  clarifications and amendments agreed in Riyadh in 20171.
  Previously, the United Nations had approved a
  romanization system in 1972 (resolution II/8), based on the
  system adopted by Arabic experts at the conference
  held at Beirut in 1971 with the practical amendments carried out
  and agreed upon by the representatives of the Arabic-speaking
  countries at their conference. The table was published in volume
  II of the conference report.
  In UN resolution XI/3 it is specifically stated that the
  system was recommended for the “romanization of the
  geographical names within those Arabic-speaking countries
  where this system is officially adopted”. There is
  evidence of its partial implementation in Jordan, Oman and
  Saudi Arabia. The UNGEGN Working Group on Romanization
  Systems intends to continue monitoring the UN system’s
  implementation across Arabic-speaking countries.
  In some countries there exist local romanization schemes
  or practices. The geographical names of Algeria, Djibouti,
  Mauritania, Morocco and Tunisia are generally rendered in
  the traditional manner which conforms to the principles of
  the French orthography.
  The previous UN-approved system is still found in
  considerable international usage.
  Arabic is written from right to left. The Arabic script
  usually omits vowel points and diacritical marks from
  writing which makes it difficult to obtain uniform results
  in the romanization of Arabic. It is essential to identify
  correctly the words which appear in any particular name
  and to know the standard Arabic-script spelling including
  the relevant vowels. One must also take into account
  dialectal and idiosyncratic deviations. The romanization
  is generally reversible though there may be some ambiguous
  letter sequences (dh, kh, sh, th) which may also point to
  combinations of Arabic characters in addition to the
  respective single characters.
notes:
  - |
    When the definite article al precedes a word beginning with
    one of the "sun letters" (t, th, d, dh, r, z, s, sh, s̱, ḏ, ṯ,
    d͟h, l, n) the l of the definite article is assimilated with
    the first consonant of the word: الشارقة Ash Shāriqah.
  - |
    The definite article is always written with a capital
    initial: الزيتون Az Zaytūn, البلد Al Balad, منية الضنية Minyat Aḏ
    Ḏinniyyah.
  - |
    Nunation is unlikely to be found in geographical names and
    the last letter remains silent: جبل = جبلٌ Jabal (not Jabalun).
  - |
    In order to disambiguate certain character sequences a
    middle dot (·) may be used: سهيلة S·haylah (cf. شيلة Shaylah), دهيب
    D·hayb (cf. ذيب Dhayb), أدهم Ad·ham (cf. أذم Adham).
  - |
    ta' marboota should be transliterated to 'ah' if it's in
    a definite article, or at the end of the sentence
    otherwise it should be transliterated to 'at'
    to handle words starting with AL and ending with ta' marboota
    which is pronounced as "ah" not "at" divided into multiple
    regex because lookbehind in ruby doesn't support variable length
  - |
    مَكّة should be transliterated to makkah, shadda above ك
    is to double the consonant, same applies to all arabic letters

}

tests {

# Examples taken from:
# https://unstats.un.org/unsd/geoinfo/geonames/
test "مِصر", "Mis̱r"
test "قَطَر", "Qaṯar"
test "المَغرِب", "Al Maghrib"
test "الجُمهُورِيَّة العِراقِيَّة", "Al Jumhūrīyah al ‘Irāqīyah"
test "جُمهُورِيَّة العِراق", "Jumhūrīyat al ‘Irāq"
test "جُمهُورِيَّة مِصر العَرَبِيَّة", "Jumhūrīyat Mis̱r al ‘Arabīyah"
test "بَغداد", "Baghdād"
test "تُونِس", "Tūnis"
test "السُعُودِيَّة", "As Su‘ūdīyah"
test "اليَمَن", "Al Yaman"
test "السُودان", "As Sūdān"
test "الجَزائِر", "Al Jazā'ir"
test "الجُمهُورِيَّة اللُبنانِيَّة", "Al Jumhūrīyah al Lubnānīyah"
test "أسمَرة", "Asmarah"
test "جِدَّة", "Jiddah"
test "مَكَّة", "Makkah"
test "الرِيَاض", "Ar Riyāḏ"

}

stage {

# CHARACTERS
parallel {

  # Tool used for Unicode finding:
  # https://www.branah.com/unicode-converter

  # pointing
  sub "\u064e", "a" # َ fatha
  sub "\u064e", "", after: "\u0629" # َ fatha followed by ta' marboota
  sub "\u064e", "", after: "a" + any("ht") # َ fatha followed by ta' marboota, handling different order of conversion
  sub "\u0650", "i" # ِ kasra
  sub "\u064f", "u" # ُ damma
  sub "\u0652", "" # ْ sokoon, see note A below

  # special pointed letters
  sub "\u0639\u064e", "‘a" # عَ
  sub "\u0639\u0650", "‘i" # عِ
  sub "\u0639\u064f", "‘ū" # عُ
  # handle MacOS regex difference
  sub "\u0639\u064f\u0648", "‘ū" # عُو damma followed by و

  sub "\u0650\u064a", "ī" # ـِي kasra followed by ي
  sub "\u0650\u064a\u0651\u064e", "īy" # ـِيَّ
  sub "\u0650\u064a", "iy", after: any("\u064e\u064f") # ـِي kasra followed by ي
  sub "\u064f\u0648", "ū" # ـُو damma followed by و
  sub "\u064e\u0627", "ā" # ـَا fatha followed by ا
  sub "\u064e\u0649", "á" # ـَى fatha followed by ى which is ا not ي
  sub "\u064e\u0648\u0652", "aw" # ـَوْ
  sub "\u064e\u064a\u0652", "ay" # ـَيْ
  sub "\u0622", "ā" # آ

  # (A) Marks absence of the vowel.
  # (B) Marks doubling of the consonant.

  # Sun letters
  sub boundary + "\u0627\u0644\u062a" + maybe("\u0651"), "at t" # الت
  sub boundary + "\u0627\u0644\u062b" + maybe("\u0651"), "ath th" # الث
  sub boundary + "\u0627\u0644\u062f" + maybe("\u0651"), "ad d" # الد
  sub boundary + "\u0627\u0644\u0630" + maybe("\u0651"), "adh dh" # الذ
  sub boundary + "\u0627\u0644\u0631" + maybe("\u0651"), "ar r" # الر
  sub boundary + "\u0627\u0644\u0632" + maybe("\u0651"), "az z" # الز
  sub boundary + "\u0627\u0644\u0633" + maybe("\u0651"), "as s" # الس
  sub boundary + "\u0627\u0644\u0634" + maybe("\u0651"), "ash sh" # الش
  sub boundary + "\u0627\u0644\u0635" + maybe("\u0651"), "as̱ s̱" # الص
  sub boundary + "\u0627\u0644\u0636" + maybe("\u0651"), "aḏ ḏ" # الض
  sub boundary + "\u0627\u0644\u0637" + maybe("\u0651"), "aṯ ṯ" # الط
  sub boundary + "\u0627\u0644\u0638" + maybe("\u0651"), "ad͟h d͟h" # الظ
  sub boundary + "\u0627\u0644\u0644" + maybe("\u0651"), "al l" # الل
  sub boundary + "\u0627\u0644\u0646" + maybe("\u0651"), "an n" # الن

  # TODO: shorten this fragment
  # ta' marboota
  sub "\u0629", "at" # ة in the middle of the sentence
  sub "\u0629" + line_end, "ah"
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")

  # shadda

  sub "\u0628\u0651", "bb" # ب
  sub "\u062a\u0651", "tt" # ت
  sub "\u062b\u0651", "thth" # ث
  sub "\u062c\u0651", "jj" # ج
  sub "\u062d\u0651", "ẖẖ" # ح
  sub "\u062e\u0651", "khkh" # خ
  sub "\u062f\u0651", "dd" # د
  sub "\u0630\u0651", "dhdh" # ذ
  sub "\u0631\u0651", "rr" # ر
  sub "\u0632\u0651", "zz" # ز
  sub "\u0633\u0651", "ss" # س
  sub "\u0634\u0651", "sh" # ش
  sub "\u0635\u0651", "s̱s̱" # ص
  sub "\u0636\u0651", "ḏḏ" # ض
  sub "\u0637\u0651", "ṯṯ" # ط
  sub "\u0638\u0651", "d͟hd͟h" # ظ
  sub "\u063a\u0651", "ghgh" # غ
  sub "\u0641\u0651", "ff" # ف
  sub "\u0642\u0651", "qq" # ق
  sub "\u0643\u0651", "kk" # ك
  sub "\u0644\u0651", "ll" # ل
  sub "\u0645\u0651", "mm" # م
  sub "\u0646\u0651", "nn" # ن
  sub "\u0647\u0651", "hh" # ه
  sub "\u0648\u0651", "ww" # و
  sub "\u064a\u0651", "yy" # ي

  sub "\u0626", "'" # ئ

  sub "\u0621", any(["’", ""]) # ء# see note A

  sub "\u0623", "a" # أ
  sub "\u0627", "ā" # ا

  # See note B
  sub boundary + "\u0627\u0644", "al " # ال
  # '\uFE8E' : ''  # ﺎ

  sub "\u0628", "b" # ب
  sub "\uFE91", "b" # ﺑ
  sub "\uFE92", "b" # ﺒ
  sub "\uFE90", "b" # ﺐ

  # See note C
  sub "\u062a", "t" # ت
  sub "\ufe97", "t" # ﺗ
  sub "\ufe98", "t" # ﺘ
  sub "\ufe96", "t" # ﺖ

  sub "\u062b", "th" # ث
  sub "\ufe9b", "th" # ﺛ
  sub "\ufe9c", "th" # ﺜ
  sub "\ufe9a", "th" # ﺚ

  sub "\u062c", "j" # ج
  sub "\ufe9f", "j" # ﺟ
  sub "\ufea0", "j" # ﺠ
  sub "\ufe9e", "j" # ﺞ

  sub "\u062d", "ẖ" # ح
  sub "\ufea3", "ẖ" # ﺣ
  sub "\ufea4", "ẖ" # ﺤ
  sub "\ufea2", "ẖ" # ﺢ

  sub "\u062e", "kh" # خ
  sub "\ufea7", "kh" # ﺧ
  sub "\ufea8", "kh" # ﺨ
  sub "\ufea6", "kh" # ﺦ

  sub "\u062f", "d" # د
  sub "\ufeaa", "d" # ﺪ

  sub "\u0630", "dh" # ذ
  sub "\ufeac", "dh" # ﺬ

  sub "\u0631", "r" # ر
  sub "\ufeae", "r" # ﺮ

  sub "\u0632", "z" # ز
  sub "\ufeb0", "z" # ﺰ

  sub "\u0633", "s" # س
  sub "\ufeb3", "s" # ﺳ
  sub "\ufeb4", "s" # ﺴ
  sub "\ufeb2", "s" # ﺲ

  sub "\u0634", "sh" # ش
  sub "\ufeb7", "sh" # ﺷ
  sub "\ufeb8", "sh" # ﺸ
  sub "\ufeb6", "sh" # ﺶ

  sub "\u0635", "s̱" # ص
  sub "\ufebb", "s̱" # ﺻ
  sub "\ufebc", "s̱" # ﺼ
  sub "\ufeba", "s̱" # ﺺ

  sub "\u0636", "ḏ" # ض
  sub "\ufebf", "ḏ" # ﺿ
  sub "\ufec0", "ḏ" # ﻀ
  sub "\ufebe", "ḏ" # ﺾ

  sub "\u0637", "ṯ" # ط
  sub "\ufec3", "ṯ" # ﻃ
  sub "\ufec4", "ṯ" # ﻄ
  sub "\ufec2", "ṯ" # ﻂ

  sub "\u0638", "d͟h" # ظ
  sub "\ufec7", "d͟h" # ﻇ
  sub "\ufec8", "d͟h" # ﻈ
  sub "\ufec6", "d͟h" # ﻆ

  sub "\u0639", "‘" # ع
  sub "\ufecb", "‘" # ﻋ
  sub "\ufecc", "‘" # ﻌ
  sub "\ufeca", "‘" # ﻊ

  sub "\u063a", "gh" # غ
  sub "\ufecf", "gh" # ﻏ
  sub "\ufed0", "gh" # ﻐ
  sub "\ufece", "gh" # ﻎ

  sub "\u0641", "f" # ف
  sub "\ufed3", "f" # ﻓ
  sub "\ufed4", "f" # ﻔ
  sub "\ufed2", "f" # ﻒ

  sub "\u0642", "q" # ق
  sub "\ufed7", "q" # ﻗ
  sub "\ufed8", "q" # ﻘ
  sub "\ufed6", "q" # ﻖ

  sub "\u0643", "k" # ك
  sub "\ufedb", "k" # ﻛ
  sub "\ufedc", "k" # ﻜ
  sub "\ufeda", "k" # ﻚ

  sub "\u0644", "l" # ل
  sub "\ufedf", "l" # ﻟ
  sub "\ufee0", "l" # ﻠ
  sub "\ufede", "l" # ﻞ

  sub "\u0645", "m" # م
  sub "\ufee3", "m" # ﻣ
  sub "\ufee4", "m" # ﻤ
  sub "\ufee2", "m" # ﻢ

  sub "\u0646", "n" # ن
  sub "\ufee7", "n" # ﻧ
  sub "\ufee8", "n" # ﻨ
  sub "\ufee6", "n" # ﻦ

  # See note C
  sub "\u0647", "h" # ه
  sub "\ufeeb", "h" # ﻫ
  sub "\ufeec", "h" # ﻬ
  sub "\ufeea", "h" # ﻪ

  sub "\u0648", "w" # و
  sub "\ufeee", "w" # ﻮ

  sub "\u064a", "y" # ي
  sub "\ufef3", "y" # ﻳ
  sub "\ufef4", "y" # ﻴ
  sub "\ufef1", "y" # ﻱ

  # (A) Not romanized word-initially.

  # (B) Not romanized, but see romanizations accompanying alif (ا) in the table for vowels.

  # (C) In certain endings, an original tā’ (ت) is written ة, i.e., like hā’ (ه) with two dots, and is known as tā’ marbūṯah. It is romanized h, except in the construct form of feminine nouns, where it is romanized t, instead.

  # Vowels, diphthongs and diacritical marks
  # (ـ stands for any consonant)
}

# POSTRULES
sub any("\u0061".."\uFFFF"), upcase, before: boundary, not_before: boundary + any("‘’'")
# don't capitalize defined article in the middle of a sentence
sub " At T", " at T" # الت
sub " Ath Th", " ath th" # الث
sub " Ad D", " ad D" # الد
sub " Adh Dh", " adh Dh" # الذ
sub " Ar R", " ar R" # الر
sub " Az Z", " az Z" # الز
sub " As S", " as S" # الس
sub " Ash Sh", " ash Sh" # الش
sub " As̱ S̱", " as̱ S̱" # الص
sub " Aḏ Ḏ", " aḏ Ḏ" # الض
sub " Aṯ Ṯ", " aṯ Ṯ" # الط
sub " Ad͟h D͟h", " ad͟h D͟h" # الظ
sub " Al L", " al L" # الل
sub " An N", " an N" # الن
sub " Al ", " al " # ال

}