Draft:Arabic-script-issues

From ASIWG Wiki

Revision as of 05:54, 26 July 2008 by Shahshah (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

This is a draft document, not suitable for usage and/or referencing.

Note: The following is a report of the informal brain-storming that took place in Tehran on July 20-21. It deals with some Level 2 and Level 3 issues. Present were IRNIC staff plus Sarmad Hussain. Please comment.



Contents

Kinds of Homographs

Conditions below are applicable if there is confusion in even one shape.

Shape Problems

Characters with exact same shape

  1. Letters
    • e.g., U+0643 ARABIC LETTER KAF and U+06A9 ARABIC LETTER KEHEH
  2. Numbers
    • e.g., U+0660 ARABIC-INDIC DIGIT ZERO and U+06F0 EXTENDED ARABIC-INDIC DIGIT ZERO
  3. Combining Marks
    • There is no example for now

Characters not of exactly the same shape

Characters which may be confusable even though they do not have exactly the same shape

  • e.g., U+06A9 ARABIC LETTER KEHEH and U+06AA ARABIC LETTER SWASH KAF
  • e.g., U+06D0 ARABIC LETTER E and U+064A ARABIC LETTER YEH

Composition Problems

NFKC Problem

Certain composite forms encoded as one character in the permitted portion of Unicode table are not handled correctly by NFKC

  • e.g., U+06CE ARABIC LETTER YEH WITH SMALL V and [U+06CC ARABIC LETTER FARSI YEH + U+065A ARABIC VOWEL SIGN SMALL V ABOVE] are visually the same.

Similarity Problem

Composition forms are visually confusing in at least one character in the permitted portion of Unicode table

  • e.g., U+0628 ARABIC LETTER BEH and [U+066E ARABIC LETTER DOTLESS BEH + U+065C ARABIC VOWEL SIGN DOT BELOW]
    • This example is true iff both U+066E and U+065C are permitted

Possible Normalization Methods

  1. Pre-determine one code for each set of homographs and map all occurrences to this code.
  2. Look at the code for the first occurrence of a possible homograph and limit all following occurrences of the same character to that code
  3. Give user the choice to select the language table(s) for a label and restrict the label to those language table(-s). This can be done in 2 ways :
    • Get the label first, detect applicable language tables, and ask the user to select
    • First ask the user for the desired language table and then limit the label to that table

Dealing with Homographs

Bundles

Give the registrant all the possible variants

User Decision

Normalize the registered label, then do one of the following(others?) with the normalized version:

  1. Register only the normalized version;
  2. Register the original version and keep normalized version in a table to check all other registrations against it. Registering two domains with the same normalized form is prohibited.
  3. Register the original version and keep normalized version in a table to check all other registrations against it. Variants can be registered only by the first registrant.

Warn the Registrant

Just give a warning in registration process. Do nothing special.

Forget it

Do nothing special.

Tables

List of Homographs

Letters Groups

This list is not complete yet.

ALEF MAKSURA Group
  • U+0649 ARABIC LETTER ALEF MAKSURA
  • U+06CC ARABIC LETTER FARSI YEH
YEH Group
  • U+064A ARABIC LETTER YEH
  • U+06CC ARABIC LETTER FARSI YEH
KAF Group
  • U+0643 ARABIC LETTER KAF
  • U+06A9 ARABIC LETTER KEHEH
HEH DOACHASHMEE Group
  • U+0647 ARABIC LETTER HEH
  • U+06BE ARABIC LETTER HEH DOACHASHMEE
HEH GOAL Group
  • U+0647 ARABIC LETTER HEH
  • U+06C1 ARABIC LETTER HEH GOAL
  • U+06D5 ARABIC LETTER AE
HEH WITH YEH ABOVE Group
  • U+06C0 ARABIC LETTER HEH WITH YEH ABOVE
  • U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
FEH Group
  • U+0641 ARABIC LETTER FEH
  • U+06A7 ARABIC LETTER QAF WITH DOT ABOVE
VEH Group
  • U+06A4 ARABIC LETTER VEH
  • U+06A8 ARABIC LETTER QAF WITH THREE DOTS ABOVE
DOTLESS FEH Group
  • U+06A1 ARABIC LETTER DOTLESS FEH
  • U+066F ARABIC LETTER DOTLESS QAF
    • This group exist iff U+066F is permitted

Comments