1 / 42

CERN

CERN. European Organization for Nuclear Research. Automatic Keyword Assignment for High Energy Physics Literature. Arturo Montejo Ráez ETT/SI Data Handling Group- CERN Geneva (Switzerland). Joint Research Center, Ispra (Italy) -4 March 2002. European Organization for Nuclear Research.

ama
Download Presentation

CERN

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling Group- CERN Geneva (Switzerland) Joint Research Center, Ispra (Italy) -4 March 2002

  2. European Organization for Nuclear Research Data Handling Group CERN What we are going to see today... • Keyword assignment process • Why keywords? • How it is done for High Energy Physics papers • The HEPindexer project: • Future work • Data • Algorithm • Experiments • Results Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  3. European Organization for Nuclear Research Data Handling Group CERN Keyword assignment process Indexer Authors Keyworded papers Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  4. European Organization for Nuclear Research Data Handling Group CERN Keyword assignment process Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  5. CERN European Organization for Nuclear Research Data Handling Group Keyword assignment process The document... • Full text paper • Stored in a database • Simplified representation needed Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  6. CERN European Organization for Nuclear Research Data Handling Group Keyword assignment process The thesaurus... • Controlled vocabulary of concepts • Relationships between keywords • Categories and subcategories • Can be domain specific • Can be translated into multiple languages Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  7. CERN European Organization for Nuclear Research Data Handling Group Keyword assignment process The thesaurus: a relational model for terms cheese MT 6016 processed agricultural produce BT1 milk product NT1 blue-veined cheese NT1 cow's milk cheese NT1 fresh cheese NT1 goat's milk cheese NT1 hard cheese NT1 processed cheese NT1 semi-soft cheese NT1 sheep's milk cheese NT1 soft cheese RT cheese factory (6031)

  8. CERN European Organization for Nuclear Research Data Handling Group Keyword assignment process The thesaurus: a subject tree 04 POLITICS 0406 political framework 0411 political party 0416 electoral procedure and voting 0421 parliament 0426 parliamentary proceedings 0431 politics and public safety 0436 executive power and public service 08 INTERNATIONAL RELATIONS 0806 international affairs 0811 cooperation policy 0816 international balance 0821 defence 10 EUROPEAN COMMUNITIES 1006 Community institutions and European civil service 1011 Community law 1016 European construction 1021 Community finance

  9. CERN European Organization for Nuclear Research Data Handling Group Keyword assignment process The indexer... • An expert in the domain of the documents • An expert in the use of the thesaurus • Heavy task • Not always the same proposition • Expensive! Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  10. CERN European Organization for Nuclear Research Data Handling Group Why keywords? • Permit to index documents in a coherent way • Can be viewed like the "index" at the end of a book • Concepts that represent better the content • Human made (value added) • Meaningful • Can stablish relations between documents • Multilingual Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  11. CERN European Organization for Nuclear Research Data Handling Group Why keywords? Access to documents But... we already have fulltext indexing! Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  12. CERN European Organization for Nuclear Research Data Handling Group Why keywords? Classification: • To store (libraries) • To access (narrow searches) Category 1 Category 2 Category 3 Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  13. CERN European Organization for Nuclear Research Data Handling Group Why keywords? Crosslingual access Razor? Navaja Navaja Razor Razor Couteau Couteau Lametta Lametta Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  14. CERN CERN CERN CERN European Organization for Nuclear Research European Organization for Nuclear Research European Organization for Nuclear Research European Organization for Nuclear Research Data Handling Group Data Handling Group Data Handling Group Data Handling Group Why keywords? Why keywords? Multilingual comparison Multilingual comparison Murder Lametta Razor Frabbica Lametta Razor Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  15. CERN CERN European Organization for Nuclear Research Data Handling Group Why keywords? Advantages over fulltext searches: • No ambiguity • Better relevance and precision More advanced tools for searching and classification are coming! Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  16. CERN CERN European Organization for Nuclear Research Data Handling Group Why keywords? The BIG problem... - E X P E N S I V E - Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  17. CERN CERN European Organization for Nuclear Research Data Handling Group Why keywords? The BIG problem? E X P E N S I V E ? Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  18. CERN CERN European Organization for Nuclear Research Data Handling Group Why keywords? The BIG problem? E X P E N S I V E ? Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  19. CERN CERN European Organization for Nuclear Research Data Handling Group The CERN • The world's largest particle physics centre • Explores what matter is made of, and what forces hold it together • Employs just under 3000 people • 6500 scientists, come for their research Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  20. CERN European Organization for Nuclear Research Data Handling Group How it is done for High Energy Physics papers DESY: Deutsche Elektronen-Synchrotron (Hamburg, Germany) • DESY thesaurus • Group of indexers (students, experts...) • Only High Energy Physics related papers Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  21. CERN European Organization for Nuclear Research Data Handling Group How it is done for High Energy Physics papers The DESY thesaurus A *a4(2040) ('postulated particle, a4(2040)', was delta(2040)) *a6(2450) ('postulated particle, a6(2450)', was delta(2450)) *abelian *aberration absorption -absorptive model (model, absorption) accelerator . . . B B B anti-B B+ B+L number B*(5320) (excited B) -B** ('B*2...', similar for B/s, etc.) *B*2(5732) (postulated particle, B*2(5732)) B- -B-factory (B, particle source) B-L number . . . Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  22. CERN European Organization for Nuclear Research Data Handling Group How it is done for High Energy Physics papers The DESY thesaurus: • Few categories rarely used • Only two type of keywords: main keywords (1191) secondary keywords (949) • No relationships between terms • Specific terminology Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  23. CERN European Organization for Nuclear Research Data Handling Group How it is done for High Energy Physics papers The DESY thesaurus: specific terminology • Energy declarations: 1.5-2.7 GeV-cms • Resonances: Delta (1232) • Reaction equations: anti-p p ---> K0 K- pi+ • Combinations: angular distribution, (photon), mass spectrum (pi+ pi- pi0) • Two-particle initial state: 'anti-p p', 'electron positron' Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  24. CERN European Organization for Nuclear Research Data Handling Group How it is done for High Energy Physics papers The problem Indexer Physicists More than 500 preprints per week! Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  25. CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project The solution Physicists Indexer Keyworded papers Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  26. CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project • Use of IR techniques • Objective evaluation • Real time answer • Easy portable • Full integrable into CDS • Posibility of growing • Fully automatical & aider tool Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  27. CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Keyword Term Keyworded papers (collection) Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  28. CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Documents DESY keywords Keyword Term Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  29. CERN European Organization for Nuclear Research Data Handling Group Data The HEPindexer project 2441 training collection • 3,661 documents • 19,143 terms • 1,191 main keywords 1220 test collection Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  30. CERN European Organization for Nuclear Research Data Handling Group Algorithm The HEPindexer project Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  31. CERN European Organization for Nuclear Research Data Handling Group Algorithm The HEPindexer project Preprocessing • Punctuation • Lower case • Remove stop words • Stemming Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  32. CERN European Organization for Nuclear Research Data Handling Group Algorithm The HEPindexer project Weight term - document Weight keyword - document Weight keyword - term Similarity keyword - document Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  33. CERN European Organization for Nuclear Research Data Handling Group Experiments The HEPindexer project Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  34. CERN European Organization for Nuclear Research Data Handling Group Experiments The HEPindexer project AÇB Keywords in the trainning collection A B A: keywords propossed by DESY B: keywords propossed by HEPindexer Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  35. CERN European Organization for Nuclear Research Data Handling Group Results The HEPindexer project 52.7 % of precision 58.5 % of recall Response in 2 seconds Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  36. CERN European Organization for Nuclear Research Data Handling Group Results The HEPindexer project Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  37. CERN European Organization for Nuclear Research Data Handling Group Results The HEPindexer project Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  38. CERN European Organization for Nuclear Research Data Handling Group Software The HEPindexer project • C++ / STL • UNIX • Command line interface • Digilib: Web interface (PHP) http://cern.ch/digilib • Installation on the CERN Document Server http://cds.cern.ch Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  39. CERN European Organization for Nuclear Research Data Handling Group Software The HEPindexer project Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  40. CERN European Organization for Nuclear Research Data Handling Group Software The HEPindexer project Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  41. CERN European Organization for Nuclear Research Data Handling Group Software The HEPindexer project Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

  42. CERN European Organization for Nuclear Research Data Handling Group Future Work • Automatic proposition of secondary keywords • Improve the algorithm (lemmatizer, multiwords, segmentation...) • Use of references to link documents based on common concepts • Specific algorithms for handling of energies, particle decays, desintegrations, etc. • Agents • OAI • Apply Semantic Web approaches Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

More Related