|
Bidirectional Support
Bidirectional support in IBM SDK : A user guide
Introduction
Arabic shaping options
The JAVABIDI system property
Known limitations
Introduction
Within Java, character data are manipulated as Unicode UTF-16 values. However, character data outside Java frequently conform to different encodings. For this reason, file input/output operations, along with conversion of bytes to characters and vice-versa, also involve conversion from an external encoding to UTF-16 and back. The external encoding may be explicitly specified (e.g. in the constructor of an InputStreamReader or OutputStreamWriter), or fall back to a default.
Via its implementation of Unicode, Java supports many languages with various alphabets or scripts, among them Arabic and Hebrew, whose scripts are written from right to left. Since Arabic and Hebrew text is frequently mixed with other languages, and numbers, which are written from left to right, there emerges the need to handle bidirectional (or Bidi) data.
Bidi data raises the level of diversity, as compared to non-Bidi data, because it may be stored not only in various encodings, but also in various layouts, each layout being a combination of rules relative to ordering of the characters (Arabic and Hebrew) and shaping of Arabic letters (choosing the appropriate shape of an Arabic letter among several possible).
For the same reasons that Java translates data from external encodings into the encoding used internally and vice-versa, it should transform Bidi data from external layouts to the layout used within Java, and vice-versa. For example, legacy applications store data in visual layout while Java APIs assume an implicit layout (also known as logical layout).
The Java SDK allows users to request that layout transformations be performed for Bidi data whenever encoding conversions are performed. This feature was added since SDK version 1.4.1. In order to maintain compatibility with previous releases, these transformations are disabled by default. To enable them, users must assign an appropriate value to the system property JAVABIDI.
Arabic shaping options
Some Arabic characters need special handling during conversion between different code pages. Because they are not represented in all code pages, a normal conversion would result in substitute control characters (SUB) -- which is, a loss of data.
The characters with different representation across code pages are:
- Lam-Alef
- This is represented as a single character in code pages 420, 864 and 1046 used for visual presentation in addition to the Unicode Arabic Presentation Forms-B (uFExx range). It is represented as two characters Lam and Alef in code pages 425, 1089 and 1256 used for implicit representation in addition to the Unicode Arabic u06xx range.
- Tail of Seen family of characters
- The visual code pages 420, 864 and 1046 represent the final form of the Seen family of characters as 2 adjacent characters: the three quarters shape and the Tail. The implicit code pages 425, 1089, 1256 and the Unicode Arabic u06xx range do not represent the Tail character. In Unicode Arabic Presentation Forms-B (uFExx range), the final form for characters in the Seen family is represented as one character.
- Yeh-Hamza final form
- Code pages 420 and 864 have no unique character for the Yeh-Hamza final form; it is represented as 2 characters: Yeh final form and Hamza. In other code pages (like 425, 1046, 1089, 1256 and Unicode), the Yeh-Hamza final form is represented as one character or two characters depending on user's input; whether it is one key stroke (Yeh-Hamza key) or two strokes (Yeh key + Hamza key). The conversion from the previous code pages to 420 or 864 would result in converting the Yeh-Hamza final form character to the Yeh-Hamza initial form; a special handling must convert it to the Yeh final form and Hamza.
- Tashkeel or diacritic characters except for Shadda
- These characters are not represented in code pages 420 and 864. Conversion of Tashkeel from code pages 425, 1046, 1089, 1256 and Unicode to 420 or 864 results in SUB.
In order to avoid the loss of such characters during conversion, a group of Arabic shaping options are proposed to properly handle them.
Arabic shaping options available in this release
For each character in the previous list, there is a set of available shaping options. This is illustrated in the following:
- For Lam-Alef:
-
- Near
- When converting from visual to implicit code pages, each Lam-Alef character is expanded to Lam plus Alef, consuming the blank space next to it. If no blank space is available, the Lam-Alef character remains as is in the Unicode uFExx range, it will become a substitute control character (SUB) when converted to implicit single-byte code pages. When converting from implicit to visual code pages, the space resulting from Lam-Alef compression is positioned next to each generated Lam-Alef character.
- At Begin
- When converting from visual to implicit code pages, each Lam-Alef character is expanded to Lam plus Alef, consuming a blank space at the absolute beginning of the buffer*. If no blank space is available, the Lam-Alef character remains as is in the Unicode uFExx range; it will become a substitute control character (SUB) when converted to implicit single byte code pages. When converting from implicit to visual code pages, the space resulting from Lam-Alef compression is positioned at the absolute beginning of the buffer.
- At End
- When converting from visual to implicit code pages, each Lam-Alef character is expanded to Lam plus Alef, consuming a blank space at the absolute end of the buffer**. If no blank space is available, the Lam-Alef character remains as is in the Unicode uFExx range, it will become a substitute control character (SUB) when converted to implicit single byte code pages. When converting from implicit to visual code pages, the space resulting from Lam-Alef compression is positioned at the absolute end of the buffer.
- Auto
- When converting from visual to implicit code pages, each Lam-Alef character is expanded to Lam plus Alef, consuming a blank space at the beginning of the buffer with respect to the orientation, i.e. buffer[0] in case of left to right and buffer[length - 1] in case of right to left. If no blank space is available, the Lam-Alef character remains as is in the Unicode uFExx range, it will become a substitute control character (SUB) when converted to implicit single byte code pages. When converting from implicit to visual code pages, the space resulting from Lam-Alef compression is positioned at the beginning of the buffer with respect to the orientation.
- For Seen Tail:
-
- Near
- Conversion from visual to implicit converts the final form of the Seen family which is represented by two characters (the three quarters shape and the Tail character) to the Seen family of characters final form represented by one character and replaces the Tail by a space and positions this space next to the Seen final form. In conversion from implicit to visual, each Seen family of characters final form represented by one character is converted to the final form of the Seen family which is represented by two characters, consuming the space next to the Seen character. If there is no space available, it will be converted to one character only which is the three quarters shape Seen.
- Auto
- For Yeh-Hamza:
-
- Near
- Conversion from visual to implicit converts each Yeh character followed by a Hamza character to a Yeh-Hamza character, the space resulting from the contraction process is positioned next to the original Yeh-Hamza character. In conversion from implicit to visual, each Yeh-Hamza character is expanded to two characters (Yeh and Hamza), consuming the space located next to the original Yeh-Hamza character. If there is no space available, it will be converted to one character which is Yeh.
- Auto
- For Tashkeel:
-
- Keep
- No special processing is done.
- Customized At Begin
- All Tashkeel characters except for Shadda are replaced by spaces. The resulting spaces are moved to the absolute beginning of the buffer*.
- Customized At End
- All Tashkeel characters except for Shadda are replaced by spaces. The resulting spaces are moved to the absolute end of the buffer**.
- Auto
Note:
- For all Arabic shaping options, the behavior of the Auto value will be enhanced in future releases to provide optimized support in more situations.
Arabic shaping options to be implemented in future releases
The following Arabic shaping options are planned to be enhanced in future releases of the JDK.
- For Lam-Alef:
-
- Resize Buffer
- When converting from visual to implicit code pages, each Lam-Alef character is expanded to Lam plus Alef, the buffer is enlarged to have room for the newly added Alef characters. When converting from implicit to visual code pages, every sequence of Lam followed by Alef is contracted to a Lam-ALef character, the buffer is then reduced to eliminate the spaces resulting from the contraction process.
- For Seen Tail:
-
- At Begin
- Conversion from visual to implicit converts the final form of the Seen family which is represented by two characters (the three quarters shape and the Tail character) to the Seen family of characters final form represented by one character and replaces the Tail by a space. The spaces resulting from this process are moved to the absolute beginning of the buffer*. In conversion from implicit to visual, each Seen family of characters final form represented by one character is converted to the final form of the Seen family which is represented by two characters, consuming spaces at the absolute beginning of the buffer.
- At End
- Conversion from visual to implicit converts the final form of the Seen family which is represented by two characters (the three quarters shape and the Tail character) to the Seen family of characters final form represented by one character and replaces the Tail by a space. The spaces resulting from this process are moved to the absolute end of the buffer**. In conversion from implicit to visual, each Seen family of characters final form represented by one character is converted to the final form of the Seen family which is represented by two characters, consuming the spaces at the absolute end of the buffer.
- For Yeh-Hamza:
-
- One Cell
- In conversion from visual to implicit, each Yeh character followed by a Hamza character is contracted to a Yeh-Hamza character (one character), the resulting space is positioned next to the generated character. In conversion from implicit to visual, each Yeh-Hamza character is expanded to two characters (Yeh and Hamza), consuming the space located next to the original Yeh-Hamza character. The case of not having available spaces is not currently supported.
- Near
- For a better behavior of this option, it will be modified in the next release so that in conversion from visual to implicit, each Yeh character followed by a Hamza character remains as is (Yeh followed by Hamza). In conversion from implicit to visual, each Yeh-Hamza character is expanded to two characters (Yeh and Hamza), consuming the space located next to the original Yeh-Hamza character.
- At Begin
- In conversion from visual to implicit, each Yeh character followed by a Hamza character is contracted to a Yeh-Hamza character (one character), the resulting space is positioned at the absolute beginning of the buffer*. In conversion from implicit to visual, each Yeh-Hamza character is expanded to two characters (Yeh and Hamza), consuming the space located at the absolute beginning of the buffer.
- At End
- In conversion from visual to implicit, each Yeh character followed by a Hamza character is contracted to a Yeh-Hamza character (one character), the resulting space is positioned at the absolute end of the buffer**. In conversion from implicit to visual, each Yeh-Hamza character is expanded to two characters (Yeh and Hamza), consuming the space located at the absolute end of the buffer.
- For Tashkeel:
-
- Customized with Zero width
- All Tashkeel characters are converted to their correspondents as non-spacing (zero-width) characters.
- Customized with width
- All Tashkeel characters are converted to their correspondents as spacing characters. This option is not available in case of visual to implicit conversion because Tashkeel characters in the Arabic u06xx range are only represented using non-spacing (zero-width) characters.
* The absolute beginning of the buffer is buffer[0]. ** The absolute end of the buffer is buffer[bufferlength - 1].
The JAVABIDI system property
The JAVABIDI system property may be specified by adding -DJAVABIDI=xxxx to the command that launches Java, where "xxxx" represents parameters for the Bidi layout transformations.
JAVABIDI may be set to "NO", the default, in that case no Bidi layout transformations are performed, which is compatible with the behavior of previous releases.
When JAVABIDI is not set to "NO", its value may contain 1 to 3 parts, separated by commas without intervening spaces. Each part starts with a letter identifier followed by a value within parentheses.
The letter identifiers are:
S for the SBCS part which describes the Bidi attributes of the SBCS data consumed or produced by the conversions. Note: SBCS stands for "Single Byte Character Set" and designates the data as stored outside Java.
U for the Unicode part which describes the Bidi attributes of the Unicode data consumed or produced by the conversions.
C for the CodePage part which specifies one or more encodings: if this part is specified, only data with encodings listed in this part will be submitted to the Bidi layout transformation. If this part is omitted, the layout transformations will be performed for all encodings except Cp850.
Note: Applications should not try to modify the value of the JAVABIDI property after the initialization of the Java Virtual Machine. For performance reasons, JVM implementations may choose to check the value of JAVABIDI only at start-up time, so that any change applied later will have no effect.
S Part
The S part has the format: S(TOSHNALEYZ) with the following meaning:
| Symbol |
Meaning |
Valid Values |
Default |
Applicability |
| T |
Text Type |
I (implicit) V (visual) |
V |
Arabic and Hebrew |
| O |
Orientation |
L (LTR) R (RTL) C (Contextual LTR) D (Contextual RTL) |
L |
Arabic and Hebrew |
| S |
Swapping |
Y (yes) N (no) |
N |
Arabic and Hebrew |
| H |
Text Shaping |
N (Nominal) S (Shaped) I (Initial) M (Middle) F (final) B (isolated) |
S |
Arabic only |
| N |
Numerals |
N (Nominal) H (National) C (Contextual) |
N |
Arabic only |
| A |
Bidi Algorithm |
U (Unicode) R (Roundtrip) |
U |
Arabic and Hebrew |
| L |
Lam-Alef mode |
R (Resize) N (Near) B (at Begin) E (at End) A (Auto) |
A |
Arabic only |
| E |
Seen Tail mode |
N (Near) B (at Begin) E (at End) A (Auto) |
A |
Arabic only |
| Y |
Yeh-Hamza mode |
O (One cell) N (Near) B (at Begin) E (at End) A (Auto) |
A |
Arabic only |
| Z |
Tashkeel mode |
K (Keep) Z (Zero width) W (with Width) B (at Begin) E (at End) A (Auto) |
A |
Arabic only |
Notes:
- The part identifier and the values are case sensitive.
- Values for one or more symbols may be specified as hyphen ("-"), in that case the default value will be applied.
U Part
The U part has the format: U(TOSHNALEYZ) with the following meaning:
| Symbol |
Meaning |
Valid Values |
Default |
Applicability |
| T |
Text Type |
I (implicit) V (visual) |
I |
Arabic and Hebrew |
| O |
Orientation |
L (LTR) R (RTL) C (Contextual LTR) D (Contextual RTL) |
L |
Arabic and Hebrew |
| S |
Swapping |
Y (yes) N (no) |
Y |
Arabic and Hebrew |
| H |
Text Shaping |
N (Nominal) S (Shaped) I (Initial) M (Middle) F (final) B (isolated) |
N |
Arabic only |
| N |
Numerals |
N (Nominal) H (National) C (Contextual) |
N |
Arabic only |
| A |
Bidi Algorithm |
U (Unicode) R (Roundtrip) |
U |
Arabic and Hebrew |
| L |
Lam-Alef mode |
R (Resize) N (Near) B (at Begin) E (at End) A (Auto) |
A |
Arabic only |
| E |
Seen Tail mode |
N (Near) B (at Begin) E (at End) A (Auto) |
A |
Arabic only |
| Y |
Yeh-Hamza mode |
O (One cell) N (Near) B (at Begin) E (at End) A (Auto) |
A |
Arabic only |
| Z |
Tashkeel mode |
K (Keep) Z (Zero width) W (with Width) B (at Begin) E (at End) A (Auto) |
A |
Arabic only |
Notes:
- The part identifier and the values are case sensitive.
- Values for one or more symbols may be specified as hyphen ("-"), in that case the default value will be applied.
C Part
The C part has the format: C(xxx;yyy;zzz) where "xxx", "yyy", "zzz" represent Bidi code pages. When more than one code page is listed, the code pages must be separated by semi-colons(";") without intervening spaces.
Bidi supported code pages
| Code page |
Canonical name for NIO |
Language |
| Cp420 |
IBM-420 |
Arabic |
| Cp424 |
IBM-424 |
Hebrew |
| Cp856 |
IBM-856 |
Hebrew |
| Cp862 |
IBM-862 |
Hebrew |
| Cp864 |
IBM-864 |
Arabic |
| Cp867 |
IBM-867 |
Hebrew |
| Cp1046 |
IBM-1046 |
Arabic |
| Cp1255 |
windows-1255 |
Hebrew |
| Cp1256 |
windows-1256 |
Arabic |
| ISO8859_6 |
ISO8859_6 |
Arabic |
| ISO8859_8 |
ISO8859_8 |
Hebrew |
| MacArabic |
MacArabic |
Arabic |
| MacHebrew |
MacHebrew |
Hebrew |
Examples of values for JAVABIDI
JAVABIDI=U(ILYNNUNNNK),S(VLNSNUNNNK),C(Cp420)
-
-
JAVABIDI=C(Cp420),S(VLNSNUNNNK),U(ILYNNUNNNK)
- The order of the part specifications is not significant.
-
JAVABIDI=U(ILYNNUNNNK),S(VLNSN---NK),C(Cp420;IBM-420)
- The hyphens in the
S part represent default values for the corresponding symbols.
-
JAVABIDI=C(Cp420)
- Since both the
S and the U parts are omitted, they receive defaults values for all the symbols.
Known limitations
This is the first release where support for Bidi data is implemented and limitations are known to exist.
- If an application program reads from a file or writes to a file pieces of text that do not constitute a logical unit, the Bidi layout transformations will not provide expected results. For instance, an application which reads or writes characters one at a time will not benefit from the new Bidi support. This limitation is not likely to be removed in future releases.
- When unmappable characters appear in SBCS data (characters which are not valid in the declared code page), they may cause previous and following data to be transformed independently from one another, which can lead to unexpected results.
- When an application reads or writes a unit of text (e.g. a line) which may cross the boundary between buffers used by the input or output file, the Bidi transformation may be done independently on the part of the text unit which is included in each buffer, leading to unexpected results. When the file is not too large, this can be avoided by setting the buffer size large enough to contain the whole file (e.g. by specifying the buffer size when constructing a BufferedInputStream or a BufferedOutputStream).
P.S. If you want to access the Bidirectional Layout Engine API directly, refer to the javadoc provided with the SDK package.
|