slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CERN PowerPoint Presentation
Download Presentation
CERN

Loading in 2 Seconds...

play fullscreen
1 / 42

CERN - PowerPoint PPT Presentation


  • 120 Views
  • Uploaded on

CERN. European Organization for Nuclear Research. Automatic Keyword Assignment for High Energy Physics Literature. Arturo Montejo Ráez ETT/SI Data Handling Group- CERN Geneva (Switzerland). Joint Research Center, Ispra (Italy) -4 March 2002. European Organization for Nuclear Research.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CERN' - ama


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

CERN

European Organization for Nuclear Research

Automatic Keyword Assignment for High Energy Physics Literature

Arturo Montejo Ráez

ETT/SI Data Handling Group- CERN

Geneva (Switzerland)

Joint Research Center, Ispra (Italy) -4 March 2002

slide2

European Organization for Nuclear Research

Data Handling Group

CERN

What we are going to see today...

  • Keyword assignment process
  • Why keywords?
  • How it is done for High Energy Physics papers
  • The HEPindexer project:
  • Future work
  • Data
  • Algorithm
  • Experiments
  • Results

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide3

European Organization for Nuclear Research

Data Handling Group

CERN

Keyword assignment process

Indexer

Authors

Keyworded papers

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide4

European Organization for Nuclear Research

Data Handling Group

CERN

Keyword assignment process

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide5

CERN

European Organization for Nuclear Research

Data Handling Group

Keyword assignment process

The document...

  • Full text paper
  • Stored in a database
  • Simplified representation needed

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide6

CERN

European Organization for Nuclear Research

Data Handling Group

Keyword assignment process

The thesaurus...

  • Controlled vocabulary of concepts
  • Relationships between keywords
  • Categories and subcategories
  • Can be domain specific
  • Can be translated into multiple languages

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide7

CERN

European Organization for Nuclear Research

Data Handling Group

Keyword assignment process

The thesaurus: a relational model for terms

cheese

MT 6016 processed agricultural produce

BT1 milk product

NT1 blue-veined cheese

NT1 cow's milk cheese

NT1 fresh cheese

NT1 goat's milk cheese

NT1 hard cheese

NT1 processed cheese

NT1 semi-soft cheese

NT1 sheep's milk cheese

NT1 soft cheese

RT cheese factory (6031)

slide8

CERN

European Organization for Nuclear Research

Data Handling Group

Keyword assignment process

The thesaurus: a subject tree

04 POLITICS

0406 political framework

0411 political party

0416 electoral procedure and voting

0421 parliament

0426 parliamentary proceedings

0431 politics and public safety

0436 executive power and public service

08 INTERNATIONAL RELATIONS

0806 international affairs

0811 cooperation policy

0816 international balance

0821 defence

10 EUROPEAN COMMUNITIES

1006 Community institutions and European civil service

1011 Community law

1016 European construction

1021 Community finance

slide9

CERN

European Organization for Nuclear Research

Data Handling Group

Keyword assignment process

The indexer...

  • An expert in the domain of the documents
  • An expert in the use of the thesaurus
  • Heavy task
  • Not always the same proposition
  • Expensive!

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide10

CERN

European Organization for Nuclear Research

Data Handling Group

Why keywords?

  • Permit to index documents in a coherent way
  • Can be viewed like the "index" at the end of a book
  • Concepts that represent better the content
  • Human made (value added)
  • Meaningful
  • Can stablish relations between documents
  • Multilingual

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide11

CERN

European Organization for Nuclear Research

Data Handling Group

Why keywords?

Access to documents

But... we already have fulltext indexing!

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide12

CERN

European Organization for Nuclear Research

Data Handling Group

Why keywords?

Classification:

  • To store (libraries)
  • To access (narrow searches)

Category 1

Category 2

Category 3

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide13

CERN

European Organization for Nuclear Research

Data Handling Group

Why keywords?

Crosslingual access

Razor?

Navaja

Navaja

Razor

Razor

Couteau

Couteau

Lametta

Lametta

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide14

CERN

CERN

CERN

CERN

European Organization for Nuclear Research

European Organization for Nuclear Research

European Organization for Nuclear Research

European Organization for Nuclear Research

Data Handling Group

Data Handling Group

Data Handling Group

Data Handling Group

Why keywords?

Why keywords?

Multilingual comparison

Multilingual comparison

Murder

Lametta

Razor

Frabbica

Lametta

Razor

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide15

CERN

CERN

European Organization for Nuclear Research

Data Handling Group

Why keywords?

Advantages over fulltext searches:

  • No ambiguity
  • Better relevance and precision

More advanced tools for searching and classification are coming!

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide16

CERN

CERN

European Organization for Nuclear Research

Data Handling Group

Why keywords?

The BIG problem...

- E X P E N S I V E -

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide17

CERN

CERN

European Organization for Nuclear Research

Data Handling Group

Why keywords?

The BIG problem?

E X P E N S I V E ?

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide18

CERN

CERN

European Organization for Nuclear Research

Data Handling Group

Why keywords?

The BIG problem?

E X P E N S I V E ?

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide19

CERN

CERN

European Organization for Nuclear Research

Data Handling Group

The CERN

  • The world's largest particle physics centre
  • Explores what matter is made of, and what forces hold it together
  • Employs just under 3000 people
  • 6500 scientists, come for their research

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide20

CERN

European Organization for Nuclear Research

Data Handling Group

How it is done for High Energy Physics papers

DESY: Deutsche Elektronen-Synchrotron (Hamburg, Germany)

  • DESY thesaurus
  • Group of indexers (students, experts...)
  • Only High Energy Physics related papers

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide21

CERN

European Organization for Nuclear Research

Data Handling Group

How it is done for High Energy Physics papers

The DESY thesaurus

A

*a4(2040) ('postulated particle, a4(2040)', was delta(2040))

*a6(2450) ('postulated particle, a6(2450)', was delta(2450))

*abelian

*aberration

absorption

-absorptive model (model, absorption)

accelerator

. . .

B

B

B anti-B

B+

B+L number

B*(5320) (excited B)

-B** ('B*2...', similar for B/s, etc.)

*B*2(5732) (postulated particle, B*2(5732))

B-

-B-factory (B, particle source)

B-L number

. . .

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide22

CERN

European Organization for Nuclear Research

Data Handling Group

How it is done for High Energy Physics papers

The DESY thesaurus:

  • Few categories rarely used
  • Only two type of keywords:

main keywords (1191)

secondary keywords (949)

  • No relationships between terms
  • Specific terminology

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide23

CERN

European Organization for Nuclear Research

Data Handling Group

How it is done for High Energy Physics papers

The DESY thesaurus: specific terminology

  • Energy declarations: 1.5-2.7 GeV-cms
  • Resonances: Delta (1232)
  • Reaction equations: anti-p p ---> K0 K- pi+
  • Combinations: angular distribution, (photon), mass spectrum (pi+ pi- pi0)
  • Two-particle initial state: 'anti-p p', 'electron positron'

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide24

CERN

European Organization for Nuclear Research

Data Handling Group

How it is done for High Energy Physics papers

The problem

Indexer

Physicists

More than 500 preprints per week!

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide25

CERN

European Organization for Nuclear Research

Data Handling Group

The HEPindexer project

The solution

Physicists

Indexer

Keyworded papers

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide26

CERN

European Organization for Nuclear Research

Data Handling Group

The HEPindexer project

  • Use of IR techniques
  • Objective evaluation
  • Real time answer
  • Easy portable
  • Full integrable into CDS
  • Posibility of growing
  • Fully automatical & aider tool

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide27

CERN

European Organization for Nuclear Research

Data Handling Group

The HEPindexer project

Keyword Term

Keyworded papers (collection)

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide28

CERN

European Organization for Nuclear Research

Data Handling Group

The HEPindexer project

Documents

DESY keywords

Keyword Term

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide29

CERN

European Organization for Nuclear Research

Data Handling Group

Data

The HEPindexer project

2441 training collection

  • 3,661 documents
  • 19,143 terms
  • 1,191 main keywords

1220 test collection

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide30

CERN

European Organization for Nuclear Research

Data Handling Group

Algorithm

The HEPindexer project

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide31

CERN

European Organization for Nuclear Research

Data Handling Group

Algorithm

The HEPindexer project

Preprocessing

  • Punctuation
  • Lower case
  • Remove stop words
  • Stemming

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide32

CERN

European Organization for Nuclear Research

Data Handling Group

Algorithm

The HEPindexer project

Weight term - document

Weight keyword - document

Weight keyword - term

Similarity keyword - document

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide33

CERN

European Organization for Nuclear Research

Data Handling Group

Experiments

The HEPindexer project

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide34

CERN

European Organization for Nuclear Research

Data Handling Group

Experiments

The HEPindexer project

AÇB

Keywords in the trainning collection

A

B

A: keywords propossed by DESY

B: keywords propossed by HEPindexer

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide35

CERN

European Organization for Nuclear Research

Data Handling Group

Results

The HEPindexer project

52.7 % of precision

58.5 % of recall

Response in 2 seconds

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide36

CERN

European Organization for Nuclear Research

Data Handling Group

Results

The HEPindexer project

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide37

CERN

European Organization for Nuclear Research

Data Handling Group

Results

The HEPindexer project

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide38

CERN

European Organization for Nuclear Research

Data Handling Group

Software

The HEPindexer project

  • C++ / STL
  • UNIX
  • Command line interface
  • Digilib: Web interface (PHP)

http://cern.ch/digilib

  • Installation on the CERN Document Server

http://cds.cern.ch

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide39

CERN

European Organization for Nuclear Research

Data Handling Group

Software

The HEPindexer project

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide40

CERN

European Organization for Nuclear Research

Data Handling Group

Software

The HEPindexer project

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide41

CERN

European Organization for Nuclear Research

Data Handling Group

Software

The HEPindexer project

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

slide42

CERN

European Organization for Nuclear Research

Data Handling Group

Future Work

  • Automatic proposition of secondary keywords
  • Improve the algorithm

(lemmatizer, multiwords, segmentation...)

  • Use of references to link documents based on

common concepts

  • Specific algorithms for handling of energies,

particle decays, desintegrations, etc.

  • Agents
  • OAI
  • Apply Semantic Web approaches

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002