DiType User Guide/Linguistic Algorithms

From Docs

Jump to: navigation, search

< DiType User Guide


Line-Breaking Algorithm

The following rules comprise DiType's line-breaking algorithm:

  1. Line-break is permitted if one of the following conditions is fulfilled:
    • Line-break is forced by the explicit linefeed characters: U+000A, U+000D, U+2028, and U+2029. Note, however, that the default behavior of DiType is to perform linefeed normalization , which treats all linefeed characters like spaces. Therefore, the linefeed characters actually force a line-break only if the linefeed-treatment attribute is set to "preserve."
    • Line-break is permitted at space characters: U+0009, U+0020, U+2000 - U+200B , and U+3000.

  2. Line-break is not allowed in the following cases, unless one of the conditions of rule 1 is fulfilled:
    • Immediately preceding or following non-breaking spaces ( U+00A0 ) and non-breaking hyphens ( U+200C ).
    • Immediately preceding trailing punctuation characters, closing brackets and quotes, small Katakana and Hiragana characters, superscript characters, etc.
    • Immediately after opening brackets and quotes, Spanish leading punctuation, currency symbols, etc.

  3. If the hyphenate attribute is set to "true" and all hyphenation conditions ( hyphenation-push-character-count, hyphenation-remain-character-count , etc.) are satisfied, then line-break is permitted after a soft hyphen ( U+00AD ). A soft hyphen at the end of a line is replaced by the text specified in the hyphenation-character attribute; all other soft hyphens are suppressed.
  4. Unless prohibited by the above rules, line-break is permitted before or after CJK ideographic, Katakana, Hiragana, and Hangul characters.
  5. In all other cases, line-break is prohibited.

The algorithm will be refined in future versions of DiType, when more feedback about non-European scripting systems is received.


DiType uses Unicode soft hyphen characters ( U+00AD ) to mark possible hyphenation points. These characters either can be contained in the source XSL-FO document (e.g. from an external hyphenation application), or can be added by DiType automatically before the source is passed to the formatter.

The hyphenator implements Liang's algorithm. DiType's distribution includes patterns for the following languages: English (American and British), French, German, Spanish, Russian, Polish, Italian, Finnish, Danish, Dutch, Croatian, Czech, Irish, Catalan, Hungarian, Interlingua, Basque, Greek, Latin, Galician, Slovenian, Swedish, Portuguese, Estonian, Icelandic, Norwegian, Turkish, Mongolian, Sanskrit, Bulgarian, Serbian. All patterns are borrowed from CTAN (the Comprehensive TeX Archive Network, http://www.ctan.org/ ), with some modifications for non-English patterns. More patterns can be added if necessary.

Hyphenation Patterns

The hyphenator uses T EX format for hyphenation patterns. It recognizes the following sections in the pattern files:

  • patterns (for hyphenation patterns)
  • hyphenation (for exceptions)

Any other section in the pattern file is ignored. Hexadecimal escape codes (e.g. ^^ae) and control characters (^^A) are supported; they can be used to encode non-ANSI European characters. Additionally, DiType recognizes a set of \rm macros for accented characters: \^a is â (a with circumflex accent), \l is ł (Polish barred l), etc.

Support for Right-to-Left Writing Systems

DiType supports both left-to-right and right-to-left text. To define ordering of charactes within lines and stacking direction of lines into paragraphs, the writing-mode attribute is used. It can be specified on the <fo:simple-page-master>, <fo:region-*>, <fo:table>, <fo:block-container> , and <fo:inline-container> elements. Its primary values are:

  • "lr-tb" : left-to-right, top-to-bottom. This is the default writing mode in XSL-FO; it is used by the majority of world languages, including English.
  • "rl-tb" : right-to-left, top-to-bottom. This mode is used in Arabic writing system (adopted by many languages of the Middle East), Hebrew, and Syriac alphabets.
  • "tb-rl" : top-to-bottom, right-to-left. This way of writing is widely used for Japanese, but also for Chinese and other languages of East Asia.

Note: DiType supports only horizontal writing modes: "lr-tb" and "rl-tb".

The writing-mode attribute defines every aspect of the document organization: binding edge, column ordering in tables, text alignment in blocks, etc. It also sets the correspondence between relative directions ( before – after – start – end ) and absolutely oriented ones ( top – bottom – left – right ).


Bidirectionality is the interleaving of text which is to be displayed in both directions: for example, operating instructions are in Hebrew, but the name of the product appears in the middle of the instructions, in English. In simple situations, the renderer handles the bidirectionality on its own; there are, however, many situations where there may be an ambiguity as to the exact resolution desired. For these situations, XSL defines a special element, <fo:bidi-override> that enables altering the bidirectional behavior of the whole text or its parts. It has two properties:

Sets the dominant direction for a span of text. Possible values are:
  • "ltr" — from left to right
  • "rtl" — from right to left

Specifies behavior of a text span with respect to the Unicode bidi algorithm. Possible values are:
  • "normal" — order characters by Unicode bidi.
  • "embed" — open a new level of embedding.
  • "bidi-override" — ignore directionality of the text and arrange characters in the order specified by the direction property.

Glyph Shaping

DiType supports contextual selection of Arabic positional glyph variants, known as glyph shaping. Shaping proceeds as follows: each character that belongs to Arabic Unicode range U+0600–U+06FF is replaced by its counterpart in the Arabic Presentation Forms ranges U+FB50–U+FDFF and U+FE70–U+FEFF , in accordance with the Unicode rules for Arabic. Only basic changes are considered:

  • Substitution of initial, final, and medial forms
  • Insertion of lam-alef ligatures

Shaping occurs before font selection. For the algorithm to work, the following conditions must be met:

  • Fonts chosen for Arabic text spans shall cover all positional variants for glyphs used (You can specify a list of fonts. Glyphs will be searched in all of them, following the usual rules for processing of multiple font families).
  • Positional variants are accessible through their Unicode codepoints.

This is the case for most TrueType fonts that support Traditional Arabic; however, DiType does not work with Simplified Arabic fonts.

Personal tools