Machine Learning for Textual Information Access: Results from the SMART project

Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information Retrieval Evaluation Kolkata, India, December 12th-14th, 2008 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAA

The SMART Project • Statistical Multilingual Analysis for Retrieval and Translation (SMART) • Information Society Technologies Programme • Sixth Framework Programme, “Specific Target Research Project” (STReP) • Start date: October 1, 2006 • Duration: 3 years • Objective: bring Machine Learning researchers to work on Machine Translation and CLIR

The SMART Consortium

Premise and Outline • Two classes of methods for CLIR investigated in SMART • Methods based on dictionary adaptation for the cross-language extension of the LM approach in IR • Latent semantic methods based on Canonical Correlation Analysis • Initial plan (reflected in abstract): to present both • ...but it would take too long, so: • Outline: • (Longish) introduction to state of the art in Canonical Correlation Analysis • A number of advances obtained by the SMART project • For lexicon adaptation methods: check out deliverable D 5.1 from the project website!

Background: Canonical Correlation Analysis

Canonical Correlation Analysis • Abstract view: • Word-vector representations of documents (or queries, or whatever text span) are only superficial manifestations of a deeper vector representation based on concepts. • Since they cannot be observed directly, these concepts are latent • If two spans are the translation of one another, their deep representation in terms of concepts is the same. • Can we recover (at least approximately) the latent concept space? Can we learn to map text spans from their superficial word appearance into their deep representation? • CCA: • Assume mapping from deep to superficial representation is linear • Estimate mapping from empirical data

Five documents in the world of concepts 1 3 2 4 5

The same five documents in two languages 1 3 2 3 5 1 4 2 4 5

5’ 3’ 4’’ 2’’ 1’ 1’’ 5’’ 3’’ 4’ 2’ Finding the first Canonical Variates 1 3 2 3 5 1 4 2 4 5

Finding the first Canonical Variates Maximal covariance to work back the rotation • Find the two directions, one for each language, such that projections of documents are maximally correlated. • Assuming data matrices X and Y are (row-wise) centered: Normalization by the variances to adjust for “stretched” dimensions C1 expressed in the basis of X and Y resp.

Complexity: Finding the first Canonical Variate • Find the two directions, one for each language, such that projections of documents are maximally correlated • Turns out equivalent to finding the largest eigen-pair in a Generalized Eigenvalue Problem (GEP):

Finding further Canonical Variates • Assume we already found i-1 pairs of Canonical Variates: • Turns out equivalent to finding the other eigen-pairs in the same GEP

Examples from the Hansard Corpus

Kernel CCA • Cubic complexity in the number of dimensions becomes soon intractable, especially with text • Also, it could be better to use similarity measures other than inner product of document (possibly weighted) vectors •  Kernel CCA: from primal to dual formulation, since it can be proved that the wxi (resp. wyi) is in the span of the columns of X (resp. Y)

Complexity: Kernel CCA • The computation is again done by solving a GEP:

Unit variances Unit variances Overfitting • Problem: if m · nx and m · ny then there are (infinite) trivial solutions with perfect correlation : OVERFITTING • E.g. two (centered) points in R2: Given an arbitrary direction in the first space... ...we can find one with perfect correlation in the second 1 1 2 2 Perfect correlation... for no matter what direction! Unit covariance

Regularized Kernel CCA • We can regularize the objective function by trading correlation against good account of variance in the two spaces:

5 1 3 2 2 1 4 3 4 5 1 3 3 4 2 2 5 1 4 5 Multiview CCA • (K)CCA can take advantage of the “mutual information” between two languages... • ...but what if we have more than two? Can we benefit from multiple views? Also known as Generalised CCA.

. Multivariate Eigenvalue Problem Multiview CCA • There are many possible ways to combine pairwise correlations between views (e.g. sum, product, min, ...). • Chosen approach: SUMCOR [Horst-61]. With a slightly different regularization than above, this is:

Multiview CCA • Multivariate Eigenvalue Problems (MEP) are much harder to solve then GEPs: • [Horst-61] introduced an extension to MEPs of the standard power method for EPs, for finding the set of first canonical variates only • Naïve implementations would be quadratic in the number of ducuments, and scale up to no more than a few thousands

Innovations from SMART

Innovations from SMART • Extensions of the Horst algorithm [Rupnik and Shawe-Taylor] • Efficient implementation linear in the number of documents • Version for finding many sets of canonical variates • New regression-CCA framework for CLIR [Rupnik and Shawe-Taylor] • Sparse KCCA [Hussain and Shawe-Taylor]

Efficient Implementation of Horst algorithm • Horst algorithm starts with a random set of vectors: • then iteratively multiplies and renormalizes until convergence: Inner loop: k2 matrix-vector multiplications, each O(m2) • Extension (1): exploiting the structure of the MEP matrix, one can refactor computation and save a O(k) factor in the inner loop. The inner loop can be made O(kms) instead of O(k2m2) • Extension (2): exploiting sparseness of the document vectors, one can replace each (vector) multiplication with a kernel matrix (O(m2)) with two multiplications with the document matrix (O(ms) each, where s is the max number of non-zero components in document vectors). • Leveraging this same sparsity, kernel inversions can be replaced by cheaper numerical linear system resolutions.

Extended Horst algorithm for finding many sets of canonical variates • Horst algorithm only finds the first set of k canonical variates • Extension (3): maintain projection matrices Pit that project ¯k,t’s at each iteration onto the subspace orthogonal to all previous canonical variates for space i. Finding d sets of canonical variates can be done in O(d2mks). This scales up!

MCCA: Experiments • Experiments: mate retrieval with Europarl • 10 languages, • 100,000 10-ways aligned sentences for training • 7873 10-ways aligned sentences for testing • Document vectors: uni-, bi- and tri-grams (~200k features for each language). TF*IDF weighting and length normalization. • MCCA used to extract d = 100-dimensional subspaces • Baseline alternatives for selecting new basis: • k-means clustering centroids on concatenated multi-lingual document vectors • CL-LSI, i.e. LSI on concatenated vectors

Some example latent vectors

MCCA experiment results • Measure: recall in Top 10, averaged over 9 languages

MCCA experiment results • More realistic experiment: now pseudo-queries formed with top 5 TF*IDF scoring components in each sentence

Extension (4): Regression - CCA • Given a query q in one language, find the target language vector w which is maximally correlated to it: • Solution: • Given this “query translation” we can then find the closest target documents using the standard cosine measure • Promising initial results on CLEF/GIRT dataset: better then standard CCA, but cannot take thesaurus into account, so MAP still not competitive with the best

Extension (5): Sparse - KCCA • Seeking sparsity in dual solution: first canonical variates expressed as linear combinations of only relatively few documents • Improved efficiency • Alternative regularization Same set of indices i

Sparse - KCCA • For a fixed set of indices i: • But how do we select i ?

Algorithm 2 • Set i to the index of the top d values of • Solve GEP for index set i Sparse – KCCA: Algorithms • Algorithm 1 • initialize • for i = 1toddo Deflate kernel matrices • end for • Solve GEP for index set i Deflation consists in transforming the matrices to reflect a projection onto the space orthogonal to the current basis in feature space

Sparse – KCCA: Mate retrieval experiments Europarl, English-Spanish KCCA Train: 24693 sec. Test: 27733 sec. SKCCA (1) Train: 5242 sec. Test: 698 sec. SKCCA (2) Train: 1873 sec. Test: 695 sec.

SMART - Website D 5.1 on lexicon-based methods and D 5.2 on CCA • Project presentation and deliverables • http://www.smart-project.eu

SMART - Dissemination and Exploitation • Platforms for showcasing developed tools:

Thank you!

Shameless plug • Cyril Goutte, Nicola Cancedda, Marc Dymetman and George Foster, eds: Learning Machine Translation, MIT Press, to appear in 2009.

References

Self-introduction Machine Learning (kernels for text) Text Categorization Grammar Learning (Statistical) Machine Translation ca. 2004 Natural Language Generation

Machine Learning for Textual Information Access: Results from the SMART project

Machine Learning for Textual Information Access: Results from the SMART project

Presentation Transcript

Access to Project Stakeholder Information

From Textual Information to Numerical Vectors

Agile Project Management for Smart m-Learning

Machine Learning for multimedia information retrieval

Access to non-textual information

Textual Information Access for the Visually Impaired

Machine Learning and Information Retrieval

Machine Learning for Textual Information Access: Results from the SMART project

Information Access through Textual Entailment: The Experience of the QALL-ME project

e-LEARNING PROGRAM: PROJECT SMART

Preliminary Results From the ScorePP Project

Introduction to Machine Learning for Information Retrieval

Learning from the project

Machine Learning for Information Extraction

Some results from the “ BOUSSOLE ” project

Information Extraction from the WWW using Machine Learning Techniques

Smart Phones using Machine Learning

Machine learning project In Hyderabad

Machine Learning for Personal Information Management

Textual Information Access for the Visually Impaired

Access to non-textual information

Are You Looking For Machine Learning Project