One Size Fits All? A Simple Technique to Perform Several NLP Tasks

One Size Fits All?A Simple Technique to Perform Several NLP Tasks Daniel Gayo Avello (University of Oviedo)

Introduction • blindLight is a modified vector model with applications to several NLP tasks: • Automatic summarization, • categorization, • clustering and • information retrieval. 7%

Vector Model vs. blindLight Model blindLight Model • Different documents  different length vectors. • D? no collection  no D  no vector space! • Terms  just character n-grams. • Term weights  in-document n-gram significance (function of just document term frequency) • Similarity measure  asymmetric (in fact, two association measures). Kind-of light pairwise alignment (A vs. B ≠ B vs. A). Advantages • Document vectors = Unique document representations… • Suitable for ever-growing document sets. • Bilingual IR is trivial. • Highly tunable by linearly combining the two assoc measures. Issues • Not tuned yet, so… • …Poor performance with broad topics. Vector Model • Document  D-dimensional vector of terms. • D  number of distinct terms within whole collection of documents. • Terms  words/stems/character n-grams. • Term weights  function of “in-document” term frequency, “in-collection” term frequency, and document length. • Assoc measures  symmetric: Dice, Jaccard, Cosine, … Issues • Document vectors ≠ Document representations… • … But document representations with regards to whole collection. • Curse of dimensionality  feature reduction. • Feature reduction when using n-grams as terms  ad hoc thresholds. 13%

What’s n-gram significance? • Can we know how important an n-gram is within just one document without regards to any external collection? • Similar problem: Extracting multiword items from text (e.g. European Union, Mickey Mouse, Cross Language Evaluation Forum). • Solution by Ferreira da Silva and Pereira Lopes: • Several statistical measures generalized to be applied to arbitrary length wordn-grams. • New measure: Symmetrical Conditional Probability (SCP) which outperforms the others. • So, our proposal to answer first question: If SCP shows the most significant multiword items within just one document it can be applied to rank character n-grams for a document according to their significances. 20%

What’s n-gram significance?(cont.) • Equations for SCP: • (w1…wn) is an n-gram. Let’s suppose we use quad-grams and let’s take (igni) from the text What’s n-gram significance. • (w1…w1) / (w2…w4) = (i) / (gni)  • (w1…w2) / (w3…w4) = (ig) / (ni) • (w1…w3) / (w4…w4) = (ign) / (i)  • For instance, in p((w1…w1)) = p((i)) would be computed from the relative frequency of appearance within the document of n-grams starting with i (e.g. (igni), (ific), or (ican)). • In p((w4…w4)) = p((i)) would be computed from the relative frequency of appearance within the document of n-grams ending with i (e.g. (m_si), (igni), or (nifi)). 27%

What’s n-gram significance? (cont.) • Current implementation of blindLight uses quad-grams because… • They provide better results than tri-grams. • Their significances are computed faster than n≥5 n-grams. • ¿How would it work mixing different length n-grams within the same document vector? Interesting question to solve in the future… • Two example blindLight document vectors: • Q document:Cuando despertó, el dinosaurio todavía estaba allí. • T document:Quando acordou, o dinossauro ainda estava lá. • Q vector (45 elements):{(Cuan, 2.49), (l_di, 2.39), (stab, 2.39), ..., (saur, 2.31), (desp, 2.31), ..., (ando, 2.01), (avía, 1.95), (_all, 1.92)} • T vector (39 elements): {(va_l, 2.55), (rdou, 2.32), (stav, 2.32), ..., (saur, 2.24), (noss, 2.18), ..., (auro, 1.91), (ando, 1.88), (do_a, 1.77)} • ¿How can such vectors be numerically compared? 33%

Comparing blindLight doc vectors Document Total Significance • Some equations: Document Vectors Intersected Document Vector Total Significance Intersected Document Vector Pi and Rho “Asymmetric” Similarity Measures 40%

Comparing blindLight doc vectors (cont.) Pi = SQΩT/SQ = 20.48/97.52 = 0.21 Rho = SQΩT/ST = 20.48/81.92 = 0.25 • The dinosaur is still here… Ω = Q doc vector SQ = 97.52 T doc vector ST = 81.92 QΩT vector SQΩT = 20.48 47%

The relation between ancestor and descendant languages is usually called genetic relationship. Such relationships are displayed as a tree of families of languages. Comparative method looks for regular (i.e. systematic) correspondences in lexicon and thus allows linguists to propose hypothesis about genetic relationship. Languages are not only subject to systematic changes but also random, so comparative method is “sensitive to noise”, specially when studying languages that have diverged more than 10,000 years ago. Joseph H. Greenberg developed the so-called “mass lexical comparison” method which compares large samples of equivalent words for two languages. Our experiment is quite similar to this mass comparison method and to the work done by Stephen Huffman using the Acquaintance technique. Clustering case study:“Genetic” classification of languages 53%

Clustering using orthographic data Clustering using phonetic data Clustering case study:“Genetic” classification of languages (cont.) • Two different kinds of linguistic data: • Orthographic version of first three chapters from the Book of Genesis. • Phonetic transcriptions of “The North Wind and the Sun”. • Similarity measure to compare document vectors was 0.5+0.5. • Clustering algorithm similar to Jarvis-Patrick. • Both resultant trees coherent to each other and consistent with linguistic theories. 60%

Categorization using blindLight is straightforward: Each category vector is compared with the document, the greater the similarity the most likely the membership. Using previous experiment results category vectors on the right were built to develop a language identifier. Many of them are “artificial” obtained by intersecting several language vectors. The language identifier operation is simple. Let’s suppose an English sample of text: It is compared against Basque, Finnish, Italic, northGermanic and westGermanic. The most likely category is westGermanic so… …it is compared against Dutch-German and English. The most likely is English which is a final category. Categorization case study:Language identification Basque Finnish Italic Catalan-French Catalan French Italian Portuguese-Spanish Portuguese Spanish northGermanic Danish-Swedish Danish Swedish Faroese Norwegian westGermanic Dutch-German Dutch German English 67%

Categorization case study:Language identification (cont.) • Preliminary results using 1,500 posts from: • soc.culture.basque • soc.culture.catalan • soc.culture.french • soc.culture.galiza(Galician is not “known” by the identifier). • soc.culture.german • Posts were submitted in a raw form including the whole header to check “noise tolerance”. • It was found that actual samples of around 200 characters can be identified in spite of lengthy headers (500 to 900 characters). • Results for Galician: • As with rest of the groups: plenty of spam (i.e. English posts). • Most of the posts written in Spanish. • Posts actually written in Galician: 63% identified as Portuguese, 37% as Spanish, graceful decade? • Results for other languages: 73%

Information Retrieval using blindLight • Π (Pi) and Ρ (Rho) can be linearly combined into different association measures to perform IR. • Just two tested up to now: Π and (which performs slightly better). • IR with blindLight is pretty easy: • For each document within the dataset a 4-gram is computed and stored. • When a query is submitted to the system: • A 4-gram (Q) is computed for the query text. • For each doc vector (T): • Q and T are Ω-intersected obtaining Πand Ρvalues. • Π and Ρ are combined into a unique association measure (e.g. piro). • A reverse ordered list of documents is built and returned to answer the query. • Features and issues: • No indexing phase. Documents can be added at any moment.  • Comparing each query with every document not really feasible with big data sets.  Rho, and thus Pi·Rho, values are negligible when compared to Pi. norm function scales Pi·Rho values into the range of Pi values. 80%

Bilingual IR with blindLight • We have compared n-gram vectors for pseudo-translations with vectors for actual translations (Source: Spanish, Target: English). • 38.59% of the n-grams within pseudo-translated vectors are also within actual translations vectors. • 28.31% of the n-grams within actual translations vectors are present at pseudo-translated ones. • Promising technique but thorough work is required. • INGREDIENTS:Two aligned parallel corpora. Languages S(ource) an T(arget). • METHOD: • Take the original query written in natural language S (queryS). • Chopped the original query in chunks with 1, 2, …, L words. • Find in the S corpus sentences containing each of these chunks. Start with the longest ones and once you’ve found sentences for one chunk delete its subchunks. • Replace every of these S sentences by its T sentence equivalent. • Compute an n-gram vector for every T sentence, Ω-intersect all the vectors for each chunk. • Mixed all the Ω-intersected n-gram vectors into a unique query vector (queryT). • Voilà! You have obtained a vector for a hypothetical queryT without having translated queryS. Nice translated n-grams Nice un-translatedn-grams For instance, EuroParl Not-Really-Nice un-translated n-grams Encontrar documentos en los que se habla de las discusiones sobre la reforma de instituciones financieras y, en particular, del Banco Mundial y del FMI durante la cumbre de los G7 que se celebró en Halifax en 1995. Encontrar Encontrar documentos Encontrar documentos en ... instituciones instituciones financieras instituciones financieras y ... (1315) …mantiene excelentes relaciones con las instituciones financierasinternacionales. (5865) …el fortalecimiento de las instituciones financierasinternacionales… (6145) La Comisión deberá estudiar un mecanismo transparente para que las instituciones financieras europeas… Definitely-Not-Nice “noise” n-grams (1315) …has excellent relationships with the international financial institutions… (5865) …strengthening international financial institutions… (6145) The Commission will have to look at a transparent mechanism so that the European financial institutions… instituciones financieras: {al_i, anci, atio, cial, _fin, fina, ial_, inan_, _ins, inst, ions, itut, l_in, nanc, ncia, nsti, stit, tion, titu, tuti, utio} 87%

Information Retrieval Results • Experiments with small collections: • CACM (3,204 docs and 64 queries). • CISI (1,460 docs and 112 queries). • Results similar to those achieved by several systems but not as good as reached by SMART for instance. • CLEF 2004 results: • Monolingual IR within Russian documents: 72 documents found from 123 relevant ones, average precision: 0.14 • Bilingual IR using Spanish to query English docs: 145 documents found from 375 relevant ones, average precision: 0.06. • However, blindLight does not apply: • Stop word removal. • Stemming. • Query term weighting. • Problems specially with broad topics. 93%

Conclusions • Genetic classification of languages (clustering) using blindLight: • Coherent results for both orthographic and phonetic input. • Results are also consistent with linguistic theories. • Results useful to develop language identifiers. • Language identification (categorization) using blindLight: • Accuracy higher than 97%. • Information-to-Noise ratio around 2/7. • Information retrieval performance must be improved, however: • Language independent. • Straightforward bilingual IR. • To sum up, blindLight is an extremely simple technique which appears to be flexible enough to be applied to a wide range of NLP tasks showing in all of them adequate performance. 100%

One Size Fits All?A Simple Technique to Perform Several NLP TasksDaniel Gayo Avello(University of Oviedo)

One Size Fits All? A Simple Technique to Perform Several NLP Tasks