Based on research conducted by RDI’s NLP group (2003-2009)

www.RDI-eg.com Automatic Full Phonetic Transcription of Arabic Script with and without Language Factorization Based on research conducted by RDI’s NLP group (2003-2009) http://www.RDI-eg.com/RDI/Technologies/Arabic_NLP.htm Mohsen Rashwan, Mohamed Al-Badrashiny, and Mohamed Attia Presented by Mohamed Attia Talk hosted by Group of Computational Linguistics - Dept. of Computer Science University of Toronto – Toronto - Canada Oct. 7th, 2009

The Problem of Ambiguity with NLP • Numerous non-trivial NLP tasks that are handled via rule-based (i.e. language factorizing) methods typically end up with multiple possible solutions/analyses; e.g. Morphological Analysis, PoS Tagging, Syntax Analysis, Lexical Semantic Analysis ... etc. • This residual ambiguity arises due to our incomplete knowledge of the underlying dynamics of the linguistic phenomenon, and maybe also due to the lack of higher language processing layersconstraining such a phenomenon; e.g. absence of semantic analysis layer constraining morphological and syntax analysis. •  Statistical methods are well known to be one of the most (if not the ever most) effective, feasible, and widely adopted approaches to automatically resolve that ambiguity. CL group - Dept. of CS – U of T – Toronto - Canada

Statistical disambiguation of factorized sequences of language entities CL group - Dept. of CS – U of T – Toronto - Canada

Intermediate Ambiguous NLP Tasks • Sometimes, such ambiguous NLP tasks are not sought for the sake of their outputs themselves, but as an intermediate step to infer another final output. • An example is the problem of automatically obtaining the phonetic transcription of a given Arabic crude text w1 … wn , which can be directly inferred as a one-to-one mapping of diacritics on the characters of the input words. But these diacritics are typically absent in MSA script! • The NLP solution to this TTS problem is to indirectly infer the diacritics d1 … dn via factorizing the crude input words by morphological analysis, PoS tagging, and Arabic phonetic grammar. Slides no. 13 to 26 provides a review of these language factorization models. • However these language factorization processes are themselves highly ambiguous! CL group - Dept. of CS – U of T – Toronto - Canada

Arabic morphological analysis as an intermediate ambiguous language factorization towards the target output of the diacritics of i/p words CL group - Dept. of CS – U of T – Toronto - Canada

Why not to Go without Language Factorization Altogether!? • Some researchers, however, argue that if statistical disambiguation is eventually deployed to get the most likely sequence of outputs, why do not we go fully statistical; i.e. un-factorizing from the very beginning and give up the burden of rule-based methods? •  For our example; this means the statistical disambiguation (as well as the statistical language models) are built from manually diacritized text corpora where spelling characters and their full diacritics are both supplied for each word. CL group - Dept. of CS – U of T – Toronto - Canada

Cannot Cover, but How Accurate and How Fast? • The obvious answer in many such cases (including the one of our example) is to overcome the problem of poor coverage when the input language entities are produced via a highly generative linguistic process; e.g. Arabic morphology. • However, that sound question may be modified so that it enquires about the performance (accuracy and speed) of statistically disambiguating un-factorized language entities (at least those frequent ones that may be covered without factorization) as compared to statistically disambiguating factorized language entities. • The rest of this presentation discusses 4 issues in this regard: • 1- The statistical disambiguation methodology deployed in both cases. • 2- The related Arabic NLP factorization models and the architecture of the factorizing system. • 3- The architecture of the hybrid (factorizing/un-factorizing) Arabic phonetic transcription system. • 4- Results analysis: factorizing system vs. hybrid system, and hybrid system vs. other groups’. CL group - Dept. of CS – U of T – Toronto - Canada

1- Statistical Disambiguation Methodology Noisy Channel Model for Statistical Disambiguation With maximum a posteriori probability (MAP) criterion: •  For our example; O is the crude Arabic i/p text words sequence. • In case of the factorizing system; I is any valid sequence of factorizations; e.g. Arabic morphological analyses (quadruples), and the ^ denotes the most likely one. • - In case of the un-factorizing system; I is any valid sequence of diacritics, and the ^ denotes the most likely one. CL group - Dept. of CS – U of T – Toronto - Canada

1- Statistical Disambiguation Methodology Likelihood Probability In other pattern recognition problems; e.g. OCR and ASR, the term P(O|I) referred to as the likelihood probability, is modeled via probability distributions; e.g. HMM. Our language factorization models enable us to do better by viewing the availability of possible structures for a given i/p string - in terms of probabilities - as a binary decision of whether the observed string complies with the formal rules of the factorization models or not. This simplifies the MAP formula into: where R(O) is the part of space of the factorization model corresponding to the observed input string; i.e.  In case of the factorizing system; I is now restricted to only possible factorized sequences that can generate (via synthesis) that input sequence, and the ^ denotes the most likely one.  In case of the un-factorizing system; I is a possible sequence of diacritics matching that i/p sequence, and the ^ denotes the most likely one. CL group - Dept. of CS – U of T – Toronto - Canada

1- Statistical Disambiguation Methodology Statistical Language Models, and Search Space The term P(I) is conventionally called the (Statistical) Language Model (SLM). Let us replace the conventional symbol I by the more adequate for our problem, by Q which is more convenient for our specific problem/set of problems. With the aid of the 1st graph in this presentation; the problem is now reduced to searching for the most likely sequence of qi,f(i); 1≤i≤L, i.e. the one with the highest marginal probability through the following lattice: This creates a Cartesian search space: A* search algorithm is guaranteed to exit with the most likely path via two tree-search strategies . CL group - Dept. of CS – U of T – Toronto - Canada

1- Statistical Disambiguation Methodology Lattice Search, and n-Gram Probabilities 1-Heuristic probability estimation of the rest of the path to be expanded next. This is called the h* function. combined with 2-Best-first tree expansion of the path with highest sum of start-to-expansion probability; the g function, plus the h* function. It is then required to estimate the marginal probability of any whole/partial possible path in the lattice. Via the chain rule and the attenuating correlation assumption, this probability is approximated by the formula: Where h+1 is the maximum affordable length of n-grams in the SLM. CL group - Dept. of CS – U of T – Toronto - Canada

1- Statistical Disambiguation Methodology Computing Probabilities of n-Grams with Zipfian Sparseness  These conditional probabilities are primarily calculated via the famous Bayesian formula. Due to the Zipfian sparseness, the Good-Turing discount and Katz’s back-off techniques are also deployed to obtain smooth distributions as well as reliable estimations of rare and unseen events respectively.  While the DB of elementary n-gram probabilities P(q1…qn); (1≤n≤h) are built during the training phase, the task of the statistical disambiguation in the runtime is rendered to: CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsArabic Phonetic Transcription: Problem Definition Despite Arabic is an intensively diacritized language, Modern Standard Arabic (MSA) is typically written by the contemporary natives without diacritics! So, it is the task of the NLP system to accurately infer all the missing diacritics of all the input words in the input Arabic text, and also to amend those diacritics in order to account for the mutual phonetic effects among adjacent words upon their continuous pronunciation. CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsChallenges of Arabic Phonetic Transcription  Modern standard Arabic (MSA) is typically written without diacritics.  MSA script is typically full of many common spelling mistakes.  The extreme derivative and inflective nature of Arabic, which necessitates treating it as a morpheme-based rather than a vocabulary-based language. The size of generable Arabic vocabulary is within the order of billions!  One (or more) diacritic in about 65% of the words in Arabic text is dependent on the syntactic case-ending of each word.  Lexical and Syntax grammars alone produce a high avg. no. of possible solutions at each word of the text. (High Ambiguity)  7.5% of open-domain Arabic text are transliterated words which lack any Arabic constraining model. Moreover, many of these words are confusingly analyzable as normal Arabic words! CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsThe Ladder of NLP Layers; Undiscovered Levels • Theoretically speaking, NLP problems should be combinatorially tackled at all the NLP layers, which is yet far beyond the reach of the current state-of-the-art of science. • Moreover, NLP researchers have not developed firm knowledge at all the NLP layers yet. CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsLanguage Factorizations Deployed for Solving the Problem Arabic morphological analysis (and statistical disambiguation) is deployed to retrieve the syntax-independent lexical phonetic info of each input Arabic word from its building morphemes.  Arabic PoS-tagging (along with morphological analysis) are deployed to statistically infer the most likely syntax-dependent (case-ending) phonetic info of i/p Arabic words.  For transliterated (foreign) words, intra-word Arabic Phonetic Grammar is deployed to constrain the statistical search for the most likely diacritization that matches the spelling of each input transliterated word. Inter-word Arabic phonetic Grammar is deployed (synthetically) to phonetically concatenate fully diacritized adjacent words of all kinds. CL group - Dept. of CS – U of T – Toronto - Canada

The Architecture of the Factorizing Arabic Phonetic Transcription System CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsArabic Morphological Structure: Morphemes • Arabic is a highly derivative and inflective language whose words can be decomposed into a relatively compact set of morphemes. • Our Arabic morphological model acknowledge the following • P: 260 prefixes. • Rd: 4,600 derivative roots. • Frd: 1,000 regular derivative patterns. • Fid : 300 irregularly derived words. • Rf: 260 roots of fixed words. • Ff: 300 fixed words. • Ra: 240 roots of Arabized words. • Fa: 290 Arabized words. • S: 550 suffixes. CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsArabic Morphological Structure: Lexicon A comprehensive Arabic lexicon has been built to be the repository of the linguistic (orthographic, phonological, morphological, Syntactic) description of each Arabic morpheme along with all their possible mutual interactivities (with other morphemes) are registered as extensively as possible in a compact structured format. This lexicon is the core of all our language factorizations. CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsCanonical Structure of Arabic Morphology CL group - Dept. of CS – U of T – Toronto - Canada

The Multiplicity of Possible Arabic Lexical Analyses CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsThe Arabic Lexical Disambiguation Lattice After this process we obtain the diacritization of each Arabic word except for the case ending ones. CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsThe Arabic Case Endings Disambiguation Lattice After this process we obtain the case ending diacritics of each Arabic word. CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsInferring the Diacritization of Transliterated Words •  Foreign names and terminology frequently appear as transliterated Arabic strings in real-life Arabic text at a rate of 7.5% = 1/14 approx. • These words are not constrained by Arabic Morphological or Syntactic models. • Look-Up table-based approach is not a viable solution due to: • - Its lack of completeness and bad coverage. • - Its lack of tolerability to spelling variance. • - Its inability to attaching Arabic infixes. • Its lack of guarantee to the compliance with the Arabic phonology • and above all: • The time variance nature of this problem, • Our approach was then to go statistical at the phoneme level, however, this would generate a too wide search space and perplexity to get good results. • To limit the search space, we constrain the search with another NLP model at the phonology layer: Intra Word Arabic Phonetic Grammar. CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsDisambiguation Lattice of Transliterated Words After this process we obtain the case ending diacritics of each Arabic word. CL group - Dept. of CS – U of T – Toronto - Canada

2- Arabic NLP Factorization ModelsIntra Word Arabic Phonetic Grammar CL group - Dept. of CS – U of T – Toronto - Canada

3- The Hybrid Factorizing/Un-factorizing TranscriptorAdding the Un-factorizing Phonetic Transcriptor  The un-factorizing diacritizer simply tests the spelling of each input word against a dictionary of final-form words; i.e. vocabulary list.  The possible diacritizations of each word in a sequence of input words (called henceforth “Segment”) that are all covered by that dictionary are directly retrieved without any language factorization. The resulting diacritizations lattice of each segment is then statistically disambiguated.  Uncovered segments (along with the disambiguated diacritizations of the covered segments) are then sent to the factorizing transcriptor for inferring the most likely diacritization of uncovered segments as well as for phonetically concatenating the words in all segments. CL group - Dept. of CS – U of T – Toronto - Canada

3- The Hybrid Factorizing/Un-factorizing Transcriptor The Architecture of the Hybrid Transcriptor CL group - Dept. of CS – U of T – Toronto - Canada

4- Results AnalysisExperimental Evaluation of both Architectures Two sets of experiments and result analyses have been performed to evaluate our Arabic phonetic transcription work:  Experiments to compare the performance of the purely factorizing architecture with the hybrid factorizing/un-factorizing one.  Experiments to compare the performance of the best of our two architectures, with the best-reported other systems produced by our rival R&D groups. While the first set of experiments shows the hybrid architecture to outperform the purely factorizing one, the second set shows our hybrid system to be superior to the ones of our rival groups. CL group - Dept. of CS – U of T – Toronto - Canada

4- Results AnalysisComparing with Best Rivals; Experimental Setup The best two reported rival systems reported in the published literature on the problem of full Automatic Arabic Phonetic Transcription are:  N. Habash & O. Rambow group in Columbia Univ. whose architecture is a language factorizing one, with statistical modeling/disambiguation tool of Support Vector Machine Tool (SVMTool). They also build an open-vocabulary SLM with Kneser-Ney smoothing using the SRILM toolkit. (2007)  I. Zitouni, J. S. Sorensen, R. Sarikaya group in IBM’s WRCwhose architecture is also a language factorizing one, with statistical modeling/disambiguation work frame of Maximum Entropy. (2006) Both of the two groups evaluated their performance by training and testing their two systems using LDC’s Arabic Treebank of diacritized news stories (LDC2004T11; text–part 3, v1.0) that is published in 2004. This Arabic text corpus which includes a total of 600 documents ≈ 340K words from AnNahar (Lebanese) newspaper text is split into a training data ≈ 288K words and test data ≈ 52K words. CL group - Dept. of CS – U of T – Toronto - Canada

4- Results AnalysisComparing with Best Rivals; Experimental Results In order to obtain a fair comparison with the work of Habash & Rambow’s group, and with Zitouni et al.’s group:  We used the same aforementioned training and test corpus from LDC’s Treebank.  We adopted their same metrics at counting the errors while evaluating our hybrid system vs. theirs. As each of the other two groups deploys more sophisticated statistical tools than ours, one can attribute the superior performance of ours to hybridizing the un-factorizing transcriptor with the factorizing one in our system architecture. CL group - Dept. of CS – U of T – Toronto - Canada

4- Results AnalysisComparing the Factorizing to the Hybrid Architecture; Experimental Setup It is very insightful not only to know how better is the hybrid transcriptor compared to the purely factorizing one, but also to know how the error margin evolves in both cases with increasing the size of the training annotated text corpora. To this end; a domain-balanced annotated training Arabic text corpora of a total size of 3,250K words have been developed (over years) so that a manually supervised full Arabic morphological analysis and diacritization had been applied to every word. Another domain-balanced (tough) test set of 11K words had also been prepared in both the annotated and un-annotated formats. At approx. log-scale steps of the size of the training corpora, the statistical models (with the same equivalent h) had been built and the following metrics have been measured for each of the two architectures:  Error margin.  Average execution time per query.  Average size of the SLM's. CL group - Dept. of CS – U of T – Toronto - Canada

4- Results AnalysisComparing the Factorizing to the Hybrid Architecture; Experimental Results • Both systems asymptote to the same irreducible error margin. • Justification: Despite being put in two different formats, the SLM’s of both systems are built form the same data and have hence the same information content. •  The hybrid system has a faster learning curve than the purely factorizing one. • Justification: The un-factorizing component suggests fewer candidate diacritizations (by looking the dictionary up) than the factorizing component (which generates all the possibilities) which in turn leads to less ambiguity. Due to the NLP’s Zipfian distribution, a small dictionary (built up from small training data) can quickly capture the frequent words. CL group - Dept. of CS – U of T – Toronto - Canada

4- Results AnalysisComparing the Factorizing to the Hybrid Architecture; Experimental Results (cont’d) •  The hybrid system has been found to be approx. twice faster than the purely factorizing one as per the avg. execution time per transcription query. • Justification: Time needed for extra language factorizations, and slimmer lattice hence less A* search time. • The storage needed for the SLM's of the un-factorizing system has been found to be 8 times smaller (on avg.) than their equivalent counterparts of the purely factorizing one. • N.B. The storage needed for the SLM's of the hybrid system is the sum of those needed for the factorizing and un-factorizing components. • Justification: Extra space is needed to store much more lower-order n-grams in the factorizing system than in the un-factorizing one. CL group - Dept. of CS – U of T – Toronto - Canada

Relevant Publications by: I- Competing Groups • (Columbia Univ. group) • N. Habash, O. Rambow, Arabic Diacritization through Full Morphological Tagging, Proceedings of the 8th Meeting of the North American Chapter of the Association for Computational Linguistics (ACL); Human Language Technologies Conference (HLT-NAACL), 2007. • (IBM group) • I. Zitouni, J. S. Sorensen, R. Sarikaya, Maximum Entropy Based Restoration of Arabic Diacritics, Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (ACL); Workshop on Computational Approaches to Semitic Languages; Sydney - Australia, July 2006; http://www.ACLweb.org/anthology/P/P06/P06-1073. CL group - Dept. of CS – U of T – Toronto - Canada

Relevant Publications by: II- Our Group (RDI’s) • 1- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A., A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Un-factorized Textual Features, IEEE Transactions on Audio, Speech, and Language Processing (TASLP) http://www.SignalProcessingSociety.org/Publications/Periodicals/TASLP. (Accepted but not published yet) • 2- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A., A Stochastic Arabic Hybrid Diacritizer, 2009 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE'09); http://caai.cn:8080/nlpke09/, Dalian-China, Sept. 2009. • 3- Al-Badrashiny, M., Automatic Diacritization for Arabic Texts, M.Sc. thesis, Dept. of Computer Engineering, Faculty of Engineering, Cairo University, June 2009: http://www.rdi-eg.com/rdi/Downloads/ArabicNLP/Mohamed-Badashiny_MSc-Thesis_June2009.pdf. • Cont. on the next page  CL group - Dept. of CS – U of T – Toronto - Canada

Relevant Publications by: II- Our Group (RDI’s) “Cont’d” • 4- Attia, M., Rashwan, M., Al-Badrashiny, M., Fassieh; a Semi-Automatic Visual Interactive Tool for the Morphological, PoS-Tags, Phonetic, and Semantic Annotation of the Arabic Text, IEEE Transactions on Audio, Speech, and Language Processing (TASLP) http://www.SignalProcessingSociety.org/Publications/Periodicals/TASLP: Special Issue on Processing Morphologically Rich Languages, Vol. 17 - Issue 5; pp. 916 to pp. 925 http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=5067414&arnumber=5075778&count=21&index=6, July 2009. • 5- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., A Hybrid System for Automatic Arabic Diacritization, The Proceedings of the 2nd International Conference on Arabic Language Resources and Tools, Cairo - Egypt http://www.MEDAR.info/Conference_All/2009/index.php, Apr. 2009. • 6- Attia, M., Theory and Implementation of a Large-Scale Arabic Phonetic Transcriptor, andApplications, PhD thesis, Dept. of Electronics and Electrical Communications, Faculty of Engineering,Cairo University, • http://www.rdi-eg.com/rdi/technologies/papers.htm, Sept. 2005. • 7- Attia, M., A Large-Scale Computational Processor of the Arabic Morphology, and Applications, M.Sc. thesis, Dept. of Computer Engineering, Faculty of Engineering, Cairo University, http://www.rdi-eg.com/rdi/technologies/papers.htm, Jan. 2000. CL group - Dept. of CS – U of T – Toronto - Canada

I- A given statistical disambiguation technique operating on either factorized or un-factorized sequences of linguistic entities asymptotes to the same disambiguation accuracy at infinitely huge size of annotated training corpora. II- Disambiguating un-factorized sequences is easier-to-develop, computationally faster, and seems to have a faster “accuracy vs. training corpora size” learning curve. III- With highly generative linguistic phenomena (e.g. Arabic morphology), language factorization is necessary to handle the problem of coverage. IV- On the other hand, language factorization costs much R&D efforts, and is also more computationally expensive. V- In such cases, the optimal systems can be built as a hybrid of the two approaches so that the factorizing mode is resorted to only if some un-factorized entities in the i/p sequence are OOV. Conclusions CL group - Dept. of CS – U of T – Toronto - Canada

Thank you for your attention. To probe further, please visit: http://www.RDI-eg.com/RDI/Technologies/Arabic_NLP.htm You may also contact: - Prof. Mohsen Rashwan: Mohsen_Rashwan@RDI-eg.com - Dr. Mohamed Attia: m_Atteya@RDI-eg.com CL group - Dept. of CS – U of T – Toronto - Canada

Based on research conducted by RDI’s NLP group (2003-2009)