# Kernel Canonical Correlation Analysis - PowerPoint PPT Presentation

Download Presentation
Kernel Canonical Correlation Analysis

1 / 16
Kernel Canonical Correlation Analysis
Download Presentation

## Kernel Canonical Correlation Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Kernel Canonical Correlation Analysis Cross-language information retrieval Blaz Fortuna JSI, Slovenija

2. Input Two different views of the same data: • Text documents written in different languages • Images with attached text • …

3. Goal Find pairs of features from both views with highest correlations Example: words that co-appear in document and its translation Auto, Fahrzeug, … car, vehicle, … Fleisch, Hahnchen, Rindfleisch, Schweinerne, … meat, chicken, beef, pork, …

4. Theory behind CCA • Documents are presented with pairs of vectors – one for each view • Result of CCA are basis vectors for each view such that the correlation between the projections of the variables onto these basis vectors are mutually maximized

5. Kernelisation of CCA • Method can be rewritten so feature vectors only appear inside inner-product • We can use Kernel for calculating inner-product • Input documents don not need to be vectors (eg. text documents together with string kernel)

6. Cross-Language Text Mining • KCCA constructs language independent representation for text documents • Good part: documents from different languages can be compared using this representation • Bad part: paired dataset is needed for training (can be avoided using machine translation tools)

7. KCCA and LSI • LSI discovers statistically most significant co-occurrences of terms in documents • When word appears in a document, what other words usually also appear? • KCCA matches terms from the first language with terms from the second based on co-occurrences • When word appears in a document, does it also appear in its translation?

8. Text document retrieval • Query databases with multilingual documents • Documents from database and query are transformed into language independent representation • Nearest neighbour

9. Experiments • 36th Canadian Parliament proceedings corpus • Part of documents used for training • For testing 5 most relevant keywords were extracted from a document and used as queries • English query, French documents retrieval accuracy (top-ranked/top-ten-ranked) [%]

10. Text categorization • Categorize multilingual documents • All documents are transformed into language independent representation • Classifier is trained on transformed labelled documents

11. Experiments • NTCIR-3 patent retrieval test collection • Japanese – English • SVM trained on English documents • Tested both on the Japanese and English Average precision [%]

12. Image-Text Retrieval • Retrieval of images based on a text query • No labels associated with images • Paired dataset: • Image retrieved from internet • Text on web page where image appeared

13. Experiments • Querying database with images with text queries • Images were split into three clusters • 10 or 30 images that best match query are retrieved • In first test success is when images are of same label • In second test success is when images that actually matched query is retrieved

14. Images retrieved for the text query: ”height: 6-11 weight: 235 lbs position: forward born: september 18, 1968, split, croatia college: none”

15. ”at phoenix sky harbor on july 6, 1997. 757-2s7, n907wa phoenix suns taxis past n902aw teamwork america west america west 757-2s7, n907wa phoenix suns taxis past n901aw arizona at phoenix sky harbor on july 6, 1997.”

16. Feature work • Use of machine translation for making paired dataset • Experiments with SVEZ-IJS English-Slovene ACQUIS Corpus • Sparse version of KCCA