Metadata as infrastructure for information retrieval and text mining
This presentation is the property of its rightful owner.
Sponsored Links
1 / 50

Metadata as Infrastructure for Information Retrieval and Text Mining PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on
  • Presentation posted in: General

Metadata as Infrastructure for Information Retrieval and Text Mining. Prof. Ray R. Larson University of California, Berkeley School of Information. Overview. Metadata as Infrastructure What, Where, When and Who? What are Entry Vocabulary Indexes? Notion of an EVI How are EVIs Built

Download Presentation

Metadata as Infrastructure for Information Retrieval and Text Mining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Metadata as infrastructure for information retrieval and text mining

Metadata as Infrastructure for Information Retrieval and Text Mining

Prof. Ray R. Larson

University of California, BerkeleySchool of Information

NaCTeM – Ray R. Larson


Overview

Overview

  • Metadata as Infrastructure

    • What, Where, When and Who?

  • What are Entry Vocabulary Indexes?

    • Notion of an EVI

    • How are EVIs Built

  • Time Period Directories

    • Mining Metadata for new metadata

NaCTeM – Ray R. Larson


Metadata as infrastructure

Metadata as Infrastructure

  • The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How?

NaCTeM – Ray R. Larson


Metadata as infrastructure1

Metadata as Infrastructure

  • The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who.

  • The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library.

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

What?

Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents.

Two kinds of mapping in every search:

  • Documents are assigned to topic categories, e.g. Dewey

  • Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers.

    Also mapping between topic systems, e.g. US Patent classification and International Patent Classification.

NaCTeM – Ray R. Larson


What searches involve mapping to controlled vocabularies

‘What’ searches involve mapping to controlled vocabularies

Thesaurus/

Ontology

Texts

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

Start with a collection of documents.

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

Index

Classify and index with controlled vocabulary

Or use a pre-indexed collection.

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

Problem:Controlled Vocabularies can be difficult for people to use.

For: “Wirtschaftspolitik”

In Library of Congress subj

Index

Use: “Economic Policy”

“pass mtr veh spark ign eng”

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

pass mtr veh spark ign eng”

= “Automobile”

Solution:Entry Level Vocabulary Indexes.

Index

EVI

NaCTeM – Ray R. Larson


What and entry vocabulary indexes

“What” and Entry Vocabulary Indexes

  • EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

Domains to select from: Engineering, Medicine, Biology, Social science, etc.

Has an Entry Vocabulary Module been built?

User selects a subject domain of interest.

Use an existing EVI.

YES

User has question but is unfamiliar with the domain he wants to search.

NO

Extract terms (words and noun phrases) from titles and abstracts.

Build associations between extracted terms & controlled vocabularies.

Download a set of training data.

Map user’s query to ranked list of controlled vocabulary terms

For noun phrases

User selects search terms from the ranked list of terms returned by the EVI.

Part of speech tagging

Internet DB indexed with a controlled vocabulary.

Building an Entry Vocabulary Module (EVI)

Searching

Building and Searching EVIs

NaCTeM – Ray R. Larson


Technical details

Extract terms (words and noun phrases) from titles and abstracts.

Build associations between extracted terms & controlled vocabularies.

Download a set of training data.

Part of speech tagging

Technical Details

For noun phrases

Internet DB indexed with a controlled vocabulary.

Building an Entry Vocabulary Module (EVI)

NaCTeM – Ray R. Larson


Association measure

Association Measure

C ¬C

t a b

¬t c d

Where t is the occurrence of a term and C is the occurrence of a class in the training set

NaCTeM – Ray R. Larson


Association measure1

W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d)

- logL(p,a,a+b) – logL(p,c,c+d)]

where

logL(p,n,k) = klog(p) + (n – k)log(1- p)

and p1= p2= p=

a

a+b

c

c+d

a+c

a+b+c+d

Vis. Dunning

Association Measure

  • Maximum Likelihood ratio

NaCTeM – Ray R. Larson


Alternatively

Alternatively

  • Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

In Arabic

Chinese

Greek

Japanese

Korean

Russian

Tamil

Find

Plutonium

Digital library resources

Statistical association

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

EVI example

Index term:“pass mtr veh spark ign eng”

EVI 1

User Query “Automobile”

Index term:“automobiles”

OR

“internal combustible engines”

EVI 2

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

But why stop there?

Index

EVI

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

“Which EVI do I use?”

Index

EVI

Index

EVI

Index

EVI

Index

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

EVI to EVIs

Index

EVI

EVI2

Index

EVI

Index

EVI

Index

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

In Arabic

Chinese

Greek

Japanese

Korean

Russian

Tamil

Find

Plutonium

Why not treat language the same way?

NaCTeM – Ray R. Larson


It is also difficult to move between different media forms

It is also difficult to move between different media forms

Thesaurus/

Ontology

Texts

EVI

Numeric

datasets

NaCTeM – Ray R. Larson


Searching across data types

Searching across data types

  • Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results

NaCTeM – Ray R. Larson


But texts associated with numeric data can be mapped as well

But texts associated with numeric data can be mapped as well…

Thesaurus/

Ontology

Texts

EVI

EVI

captions

Numeric

datasets

NaCTeM – Ray R. Larson


Evi to numeric data example

1

2

3

4

search

interface 1

online

catalog

EVI

LCSH

10

9

5

numeric

table

search

results

captions

11

8

7

6

search

interface 2

numeric

database

new query

marc

EVI to Numeric Data example

NaCTeM – Ray R. Larson


But there are also geographic dependencies

But there are also geographic dependencies…

Thesaurus/

Ontology

Texts

EVI

EVI

Maps/

Geo Data

captions

Numeric

datasets

NaCTeM – Ray R. Larson


Where place names are problematic

WHERE: Place names are problematic…

  • Variant forms: St. Petersburg, Санкт Петербург, Saint-Pétersbourg, . . .

  • Multiple names: Cluj, in Romania / Roumania / Rumania, is also called Klausenburg and Kolozsvar.

  • Names changes: Bombay  Mumbai.

  • Homographs:Vienna, VA, and Vienna, Austria;

    • 50 Springfields.

  • Anachronisms: No Germany before 1870

  • Vague, e.g. Midwest, Silicon Valley

  • Unstable boundaries: 19th century Poland; Balkans; USSR

  • Use a gazetteer!

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

WHERE. Geo-temporal search interface. Place names found in documents. Gazetteer provided lat. & long. Places displayed on map.

Timebar

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

Zoom on map. Click on place for a list of records. Click on record to display text.

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

Catalogs and gazetteers should talk to each other!

Catalog search

Gazetteer search

Geographic sort / display of catalog search result.

NaCTeM – Ray R. Larson


So geographic search becomes part of the infrastructure

So geographic search becomes part of the infrastructure

Thesaurus/

Ontology

Texts

EVI

Maps/

Geo Data

Gazetteers

captions

Numeric

datasets

NaCTeM – Ray R. Larson


When search by time is also weakly supported

WHEN: Search by time is also weakly supported…

  • Calendars are the standard for time

  • But people use the names of events to refer to time periods

  • Named time periods resemble place names in being:

    • Unstable: European War, Great War, First World War

    • Multiple: Second World War, Great Patriotic War

    • Ambiguous: “Civil war” in different centuries in England, USA, Spain, etc.

  • Places have temporal aspects & periods have geographical aspects: When the Stone Age was, varies by region

NaCTeM – Ray R. Larson


Similarity between place names and period names

Similarity between place names and period names

  • Suggests a similar solution: A gazetteer-like Time Period Directory.

  • Gazetteer:

    • Place name – Type – Spatial markers (Lat & long) -- When

  • Time Period Directory:

    • Period name – Type – Time markers (Calendar) – Where

  • Note the symmetry in the connections between Where and When.

NaCTeM – Ray R. Larson


Solution time period directories

Solution - Time Period Directories

  • Initial development involved mining the Library of Congress Subject Authority file for named time periods…

NaCTeM – Ray R. Larson


Lc marc authorities records

LC MARC Authorities Records

<USMARC>

<Fld001>sh 00000613 </Fld001>

<Fld151><a>Magdeburg (Germany)</a><x>History</x><y>Siege, 1550-1551</y></Fld151>

<Fld550><w>g</w><a>Sieges</a><z>Germany</z></Fld550>

<Fld670><a>Work cat.: 45053442: Besselmeier, S. Warhafftige history vnd beschreibung des Magdeburgischen Kriegs, 1552.</a></Fld670>

<Fld670><a>Cath. encyc.</a><b>(Magdeburg: besieged (1550-51) by the Margrave Maurice of Saxony)</b></Fld670>

<Fld670><a>Ox. encyc. reformation</a><b>(Magdeburg: ... during the 1550-1551 siege of Magdeburg ...)</b></Fld670>

</USMARC>

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

NaCTeM – Ray R. Larson


Time periods by named location

Time periods by named location

NaCTeM – Ray R. Larson


Catalog search result

Catalog Search Result

NaCTeM – Ray R. Larson


Web interface access by map

Web Interface - Access by map

NaCTeM – Ray R. Larson


Zoomable interface gives access to geographically focused info

Zoomable interface gives access to geographically focused info…

NaCTeM – Ray R. Larson


Web interface access by timeline

Web Interface - Access by timeline

Link initiates search of the

Library of Congress catalog

for all records relating to this

time period.

NaCTeM – Ray R. Larson


When and what

WHEN and WHAT

  • These named time periods are derived from Library of Congress catalog subject headings and so can be used for catalog searching which finds books on topics important for that time period

NaCTeM – Ray R. Larson


Time period directories link via the place or time

Time period directories link via the place (or time)

Thesaurus/

Ontology

Texts

EVI

Maps/

Geo Data

Gazetteers

captions

Numeric

datasets

Time Period Directory

Time lines, Chronologies

NaCTeM – Ray R. Larson


When where and who

WHEN, WHERE and WHO

  • Catalog records found from a time period search commonly include names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia.

NaCTeM – Ray R. Larson


Metadata as infrastructure for information retrieval and text mining

Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs,

Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc.

Biographical dictionaries are heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970.

Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else.

NaCTeM – Ray R. Larson


A new form of biographical dictionary would link to all

A new form of biographical dictionary would link to all

Biographical Dictionary

Thesaurus/

Ontology

Texts

EVI

Maps/

Geo Data

Gazetteers

captions

Numeric

datasets

Time Period Directory

Time lines, Chronologies

NaCTeM – Ray R. Larson


A metadata infrastructure

RESOURCES

CATALOGS

Audio

Images

Numeric Data

Objects

Texts

Virtual Reality

Webpages

Achives

Historical Societies

Libraries

Museums

Public Television

Publishers

Booksellers

Learners

Dossiers

Facet

Authority Control

Special Display Tools

INTERMEDIA INFRASTRUCTURE

WHAT

Thesaurus

Syndetic Structure

WHERE

Gazetteer

Maps

WHEN

Time Period Directory

Timelines

WHO

Biographical Dictionary

Text and Images

A Metadata Infrastructure

NaCTeM – Ray R. Larson


Acknowledgements

Acknowledgements

  • Electronic Cultural Atlas Initiative project

  • This work was partially supported by the Institute of Museum and Library Services through a National Leadership Grant for Libraries, award number LG-02-04-0041-04, Oct 2004 - Sept 2006 entitled “Supporting the Learner: What, Where, When and Who” – See: http://ecai.org/imls2004

  • Michael Buckland, Fred Gey, Vivien Petras, Matt Meiske, Kim Carl

  • Contact: [email protected]

NaCTeM – Ray R. Larson


  • Login