Urdu word processing
1 / 26

URDU WORD PROCESSING - PowerPoint PPT Presentation

  • Updated On :

URDU WORD PROCESSING. AN INTRODUCTION By: Sharraf Hussain. Classes of Problems in Urdu Word Processing. Any computer application that handles Urdu text must confront two main classes of problems. 1. Contextual Formatting 2. Directional Layout. Contextual Formatting .

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'URDU WORD PROCESSING' - Roberta

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Urdu word processing l.jpg



By: Sharraf Hussain

Classes of problems in urdu word processing l.jpg
Classes of Problems in Urdu Word Processing

Any computer application that handles Urdu text must confront two main classes of problems.

1. Contextual Formatting

2. Directional Layout

Contextual formatting l.jpg
Contextual Formatting

We mean the termination of a character’s proper presentation form according to its context.

  • In Urdu each character may have several presentation forms.

  • The proper form of a character in a text is determined according to the available presentation form of the character itself and those of character surrounding it.

  • It also depends on the current presentation forms of the surrounding characters.

Contextual joining joining method l.jpg
Contextual Joining Joining Method

  • Urdu letters join to their neighbors with in a word

  • Each joining letter is represented by four basic contextual forms

    • Rule: The software replaces the simple contextual forms by special graphics designed for the specific context.

    • Note: This kind of contextual formatting is only possible in Naskh script.

The unicode algorithm for arabic character version 2 0 l.jpg
The Unicode Algorithm for Arabic character (Version 2.0)

  • The character divide into six groups

  • Right-joining

  • Left-joining

  • Dual-joining

  • Join-causing (e.g zero-width joiner)

  • Non-joining (e.g non zero-width space characters)

    • The use of non-joiner between two letters prevents them from forming a cursive connection with each other when rendered.

  • Transparent (e.g. harkats)

    • Harkats are marks that indicate vowels or other modification of consonant letters.

  • Two subgroups are defined

  • Right join-causing characters (including dual-joining, right-joining and join-causing characters)

  • Left join-causing characters ( including dual-joining, left joining and join-causing characters)

  • Seven Rules are defined based on these classifications

    • http://www.unicode.org/unicode/uni2book/ch08.pdf (page 193)

Contextual formatting algorithm in form of a finite state machine l.jpg
Contextual formatting Algorithm in form of a finite state machine

  • In this algorithm the separate, last-joining, first joining and middle joining presentation forms are designated by A, B, C and D symbols respectively. In contrast to Unicode algorithm, there is no need to categorize the characters in this algorithm.

  • b[0] and b[1] are the first and second character respectively.

  • The state machine has two states: FIRST_CHAR_STATE and SECOND_CHAR_STATE.

  • In each state a character is entered and is placed in b[1] buffer.

  • In the first state:

    • we first determine the presentation form of b[0]

    • Based on the available presentation forms of b[1], we decide presentation form of b[0] and b[1].

    • Finally we move forward one character (i.e the old b[1] becomes the new b[0].

  • In the second state:

    • We decide on the presentation form of b[0] and b[1] according to the available (and present) presentation forms of these characters.

    • Again move forward one character

Slide8 l.jpg


B[1] has FORM B?

FORM(b[0]) ?






b[1] has FORM B?





b[0] has FORM C?



b[1] has FORM B?

b[1] has FORM B?






Slide window +1

Slide window +1

b[0] has FORM D?

b[0] has FORM D?











Slide window +1







Slide window +1


Slide window +1

Slide window +1


Slide window +1

Slide window +1






A = Separate

B = Last-Joining

C = First-Joining

D = Middle-Joining


b[0] = First Character

b[1] = Second Character

Directional layout l.jpg
Directional Layout machine

The computer take a sequence of right-to-left characters and place each letter in its proper relative position in the text line; this process can be called Directional Layout.

This second class of problem is caused by the fact that the Urdu script is written from right to left.

Urdu numbers l.jpg
Urdu Numbers machine

For right to left orientation with automatic counter flow enabled, Urdu numerals will automatically initiate a special counter flow mode. Counter flow for Urdu is terminated when you press any other key.

Three thousand four hundred fifty six



Design goals l.jpg
Design Goals machine

The term word processing is used here to focus on application where the production of readable text is the major goal.

1. Minimal burden on the user

2. Maximum Transferability of text

3. Minimal idiosyncrasies in the internal text

4. Near typeset print quality

Minimal burden on the user l.jpg
Minimal Burden on the user machine

  • Automatically present Urdu text in its proper format and layout.

  • User should not be burdened with back word typing or confusing “modes”.

  • User should be free to concentrate on the text’s content instead of its form

Maximum transferability of text l.jpg
Maximum Transferability of Text machine

  • It must be possible copy text or numbers in any language text editor.

  • Compatibility with in a document editing window or between window representing different documents or other application program.

Minimum idiosyncrasies in the internal text sequence l.jpg
Minimum idiosyncrasies in the internal text sequence machine

Because other application may be unaware of the special properties of Urdu text, internally stored and transferred text sequence must be devoid of idiosyncrasies related to text directionality such as sub strings stored backwards (especially numbers) or embedded directional commands.

Near typeset quality l.jpg
Near typeset quality machine

  • Professional looking text

Ligatures l.jpg
Ligatures machine

  • When neighboring letters fuse together to form a graphic called ligature

  • Ligature itself has a contextual joining form.

  • Urdu software processing must be able to recognize the sequence of letters lA in the text and then automatically display or print the correct form of the ligature.

How to calculate ligature forms l.jpg
How to calculate ligature forms machine

  • Urdu has 5 letters that have the same initial shape as n except dots/Tuain and 4 letters that have the same shape as j except dots.

It follows that the nj ligature is one of

5x4= 20

structural identical ligatures; but each of these has two contextual forms, so there are



Note: The shape of nj ligature is not same to jn ligature.

Possible combinations of ligatures l.jpg
Possible combinations of ligatures? machine

  • The possible ligature combinations are too numerous to all be drawn in advance, since many of them would never occur in real text.

  • 16000 possible combinations of ligature has been discovered in Urdu so far. (Thanks to Mirza Jamil Ahmed)

Implementation of ligatures l.jpg
Implementation of ligatures machine

  • To separate out the basic skeleton of the ligature from the surrounding dots.

  • To assemble each ligature instance dynamically as needed.

Font busting l.jpg
Font Busting machine

  • The software first represents the letter sequence nj by ligature skeleton plus a dot for the n and a dot for j.

  • Compute the contextual joining form of the skeleton and joins the skeleton into the word.

  • Finally, places the dots at the appropriate “attachment points” stored with the skeleton.

Little problem with ligatures l.jpg
Little problem with ligatures machine

  • Two ligature need to be join together respectively but the first ligature contains last letter same as the starting letter of second ligature.

  • Now confusion arise to select ligature

Space sp 31 l.jpg
Space (Sp-31) machine

  • The invisible character can be typed adjacent to a normal character to trick it assuming a joining form , and it can be typed between s and m of smA in order to break up the automatic formation of sm ligature .

  • The invisible character in Urdu is known as a Space (Sp-31). It is break in Urdu connected words. Explicit space is achieved by Hard Space (Hs-65).

  • In Unicode we have zero-width joining character (U+200D)

  • The availability of this simple override allows users to rely on the automatic formatting algorithm while still retaining final control over the algorithm while still retaining final control over the appearance of the text.

Write english along with urdu l.jpg
Write English Along With URDU machine

  • Automatic mixed directional layout.

  • Directionality Variable

  • Downstream (dominant text)

  • Upstream (Insertion text)

Consequences of automatic layout l.jpg
Consequences of Automatic Layout machine

  • Directional variable controls the direction of the text.

  • The text stores backward direction.

  • Searching algorithms.

  • Word wrapping algorithms