metadata {

authority_id: odni
id: 2017
language: ics-630-01:ara
source_script: Arab
destination_script: Latn
name: ICS-630-01 Romanization of Arabic Personal Names (2015)
source: ICS-630-01 Annex A
creation_date: 2017
confirmation_date: 2018-06
description: |
  This system, adapted from the Board on Geographic Names, is
  the Intelligence Community (IC) standard for the
  transliteration of Arabic names that will be applied to all
  final written reports and products for IC consumers. It is
  not intended to eliminate variations of a name that can
  contribute forensic information. Rather, it is to provide
  an IC standard Romanized (English) transliteration from
  modern standard Arabic that can then be linked to forensic
  information in ways that will help identify the referent of
  the name. Ambiguities can result from the Romanization of
  Arabic names because the Arabic source generally omits
  short vowel markings, double consonant marks, and other
  diacritics that would clearly distinguish the name.
  Linguists use their experience with the language and aids
  such as on-line tools and name dictionaries to determine
  the exact Arabic and the appropriate transliteration into
  the Roman alphabet. In cases where an individual's name has
  already been transliterated, that is to be indicated -- as
  found -- in parentheses immediately following its rendition
  in the transliteration standard (e.g., Muhammad Khulud (
  Mohamed Khulood)). In addition, if the original Arabic-
  script spelling is known, that spelling should also appear
  in parentheses following the name, if possible, following
  best practices of the issuing organization and taking into
  consideration information system capabilities. This
  convention is designed to ensure that vital forensic
  information is not lost. For names of persons who are known
  to not be part of the Arabic-speaking community, use the
  relevant IC transliteration standard for names from that
  language (e.g., Mikhail, Yitzhak). A translator’s note may
  be used to clarify the known origin of the person. Spell
  names of individuals from languages that are written in
  Roman letters as they are spelled in those languages (e.g.,
  George Clooney, Jorge Garcia, Georges Pompidou). In the
  case of active senior government officials in the on-line
  CIA World Factbook and the online directory of Chiefs of
  State and Cabinet Members of Foreign Governments, the
  spellings given in these on-line reference works should be
  used in place of the IC Standard. For any individual who
  has at one time been listed in the Factbook or Chiefs of
  State directory but who no longer appears in those
  resources (i.e. is no longer a government official), the IC
  Standard spelling should appear first, with the spelling,
  if known, as it previously appeared in those resources
  listed within parentheses at the first usage. The primary
  goal of this system is to produce a consistent Romanized
  transcription of the name that is readable to the non-
  specialist. The system uses the 26 letters of the standard (
  English) Roman alphabet plus the apostrophe. Some
  ambiguities in the Romanized form will occur without the
  use of diacritics. However, within the context of a report,
  where additional information about the individual is
  provided, the referent will be clearly identified. This
  system will be used in conjunction with on-line tools, name
  dictionaries, and lists containing conventional spellings
  of names of well-known individuals.
notes: |
  - Long/Short Vowels: Long and short vowels are not
  distinguished in this system Samir (could be Saamir or
  Samiir in Arabic).

  - Double consonants: Double consonants represented by the
  Arabic shaddah are shown in most cases (e.g., Hassan,
  Muhammad). Exceptions: ’ayn and consonants represented by
  digraphs are not doubled (e.g., al-Qadhafi [not
  alQadhdhafi], Mubashir [not Mubashshir]).

  - Hamzah (glottal stop): The hamzah is represented by an
  apostrophe (’). Note that this is the same symbol used to
  represent another consonant, the ’ayn.

  - Ta’ marbutah (feminine ending marker): On the construct
  form or when pronounced “t”, it is represented with a roman
  t. In all other cases, it is represented with an h.

  - Digraphs: No distinction is made between digraphs such as
  sh and single contiguous letters (e.g., s followed by h).

  - Definite article “al” (‘the’): Follows Arabic spelling
  rather than pronunciation. That is, sun letter assimilation
  is not shown in the Romanized form (e.g., ’Abd-alRahman,
  not ’Abd-ar-Rahman).

  - Diphthongs: the second element of the diphthong is
    represented by a y or a w (rather than an i or a u):
    Haytham, Faysal, Tawfiq, Rawdah.

  - Hyphens: Hyphens (-) are used to connect name elements
    within a name: ’Abd- al Rahman, Abu-al-Bashar, Bin-Ladin.
    Exceptions: Names that incorporate “Allah” as part of the
    name (e.g., ’Abdallah, Nasrallah), names marked by the
    lineage/family marker “Al” (e.g., Al Thani) are not
    hyphenated.

  - The definite article, “al”, within name phrases, is
    Romanized as al and not as ul: Nur-al-Din (not Nur-ul-Din).
    It is not capitalized when name-initial.

  - Names that incorporate Allah as part of the name retain the
    a of Allah rather than a grammatical marker u: ’Abdallah (
    not ’Abdullah).

  - Foreign names borrowed or appearing in Arabic are spelled
    according to the standard Western tradition: Georges,
    Michel. However, names of non-Arabic origin no longer
    considered foreign by Arabic speakers follow the IC
    conventions: Butrus (not Peter).

  - Prefix ‫بن‬ (bin ‘son of’) is Romanized Bin unless written
    with an alif, in which case it is Romanized as Ibn. The
    colloquial form Bu (‘father’) should not be standardized as
    Abu. These prefixes are capitalized.

  - In general, Romanization follows the Modern Standard
    Arabic (MSA) form rather than local pronunciation
    standards. For example, the letter ‫ج‬ (jim) is represented
    as a j even when pronounced as a “g” (e.g., Egyptian Gamal
    is Romanized as Jamal).

}

tests {

test "مِصر", "Miṣr"
test "قَطَر", "Qaṭar"
test "المَغرِب", "Al Maghrib"
test "الجُمهُورِيَّة العِراقِيَّة", "Al Jumhuriyah al ’Iraqiyah"
test "جُمهُورِيَّة العِراق", "Jumhuriyat al ’Iraq"
test "جُمهُورِيَّة مِصر العَرَبِيَّة", "Jumhuriyat Miṣr al ’Arabiyah"
test "بَغداد", "Baghdad"
test "تُونِس", "Tunis"
test "حَسّان", "Hassan"
test "مُحَمَّد", "Muhammad"
test "القَذَّافِي", "Al Qadhafi"
test "مُبَشِّر", "Mubashir"
test "الجَزائِر", "Al Jaza’ir"
test "عَبدالرَحمَن", "’Abd al Rahman"
test "هَيْثَم", "Haytham"
test "فَيْصَل", "Fayṣal"
test "تَوْفِيق", "Tawfiq"
test "رَوْضَة", "Rawḍah"
test "نُورُالدِين", "Nur al Din"
test "عَبدُاللَّه", "’Abdallah"

}

stage {

# CHARACTERS
parallel {

  # Tool used for Unicode finding:
  # https://www.branah.com/unicode-converter

  # pointing
  sub "\u064e", "a" # َ fatha
  sub "\u064e", "", after: "\u0629" # َ fatha followed by ta' marboota
  sub "\u064e", "", after: "a" + any("ht") # َ fatha followed by ta' marboota, handling different order of conversion
  sub "\u0650", "i" # ِ kasra
  sub "\u064f", "u" # ُ damma
  sub "\u0652", "" # ْ sokoon, see note A below

  sub "\u0650\u064a", "i" # ـِي kasra followed by ي
  sub "\u0650\u064a\u0651\u064e", "iy" # ـِيَّ
  sub "\u0650\u064a", "iy", after: any(["\u064e", "u064f"]) # ـِي kasra followed by ي
  sub "\u064f\u0648", "u" # ـُو damma followed by و
  sub "\u064e\u0627", "a" # ـَا fatha followed by ا
  sub "\u064e\u0649", "á" # ـَى fatha followed by ى which is ا not ي
  sub "\u064e\u0648\u0652", "aw" # ـَوْ
  sub "\u064e\u064a\u0652", "ay" # ـَيْ
  sub "\u0622", "a" # آ

  # ta' marboota
  sub "\u0629", "at" # ة in the middle of the sentence
  sub "\u0629" + line_end, "ah"
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")
  sub "\u0629", "ah", before: boundary + "\u0627\u0644" + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff") + any("\u0600".."\u06ff")

  # shadda

  sub "\u0628\u0651", "bb" # ب
  sub "\u062a\u0651", "tt" # ت
  sub "\u062b\u0651", "th" # ث
  sub "\u062c\u0651", "jj" # ج
  sub "\u062d\u0651", "hh" # ح
  sub "\u062e\u0651", "kh" # خ
  sub "\u062f\u0651", "dd" # د
  sub "\u0630\u0651", "dh" # ذ
  sub "\u0631\u0651", "rr" # ر
  sub "\u0632\u0651", "zz" # ز
  sub "\u0633\u0651", "ss" # س
  sub "\u0634\u0651", "sh" # ش
  sub "\u0635\u0651", "ṣṣ" # ص
  sub "\u0636\u0651", "ḍḍ" # ض
  sub "\u0637\u0651", "ṭṭ" # ط
  sub "\u0638\u0651", "ẓẓ" # ظ
  sub "\u063a\u0651", "gh" # غ
  sub "\u0641\u0651", "ff" # ف
  sub "\u0642\u0651", "qq" # ق
  sub "\u0643\u0651", "kk" # ك
  sub "\u0644\u0651", "ll" # ل
  sub "\u0645\u0651", "mm" # م
  sub "\u0646\u0651", "nn" # ن
  sub "\u0647\u0651", "hh" # ه
  sub "\u0648\u0651", "ww" # و
  sub "\u064a\u0651", "yy" # ي

  sub "\u0626", "’" # ئ

  sub boundary + "\u0627\u0644\u0644\u0651\u064e\u0647", "Allah"

  sub non_word_boundary + maybe("\u064f") + "\u0627\u0644\u0644\u0651\u064e\u0647", "allah"

  sub "\u0621", any(["’", ""]) # ء

  sub boundary + "\u0627\u0644", "al " # ال
  sub non_word_boundary + maybe("\u064f") + "\u0627\u0644", " al " # ال in middle of composite name

  # '\uFE8E' : ''  # ﺎ

  sub "\u0623", "" # أ
  sub boundary + "\u0627", "" # ا
  sub "\u0627", "a" # ا
  sub "\u0628", "b" # ب
  sub "\u062a", "t" # ت
  sub "\u062b", "th" # ث
  sub "\u062c", "j" # ج
  sub "\u062d", "h" # ح
  sub "\u062e", "kh" # خ
  sub "\u062f", "d" # د
  sub "\u0630", "dh" # ذ
  sub "\u0631", "r" # ر
  sub "\u0632", "z" # ز
  sub "\u0633", "s" # س
  sub "\u0634", "sh" # ش
  sub "\u0635", "ṣ" # ص
  sub "\u0636", "ḍ" # ض
  sub "\u0637", "ṭ" # ط
  sub "\u0638", "ẓ" # ظ
  sub "\u0639", "’" # ع
  sub "\u063a", "gh" # غ
  sub "\u0641", "f" # ف
  sub "\u0642", "q" # ق
  sub "\u0643", "k" # ك
  sub "\u0644", "l" # ل
  sub "\u0645", "m" # م
  sub "\u0646", "n" # ن
  sub "\u0647", "h" # ه
  sub "\u0648", "w" # و
  sub "\u064a", "y" # ي
}

# POSTRULES
sub any("\u0061".."\uFFFF"), upcase, before: boundary, not_before: boundary + any("‘’'")
sub " Al ", " al " # ال

# don't capitalize defined article in the middle of a sentence

}