information retrieval to knowledge retrieval one more step n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Retrieval to Knowledge Retrieval , one more step PowerPoint Presentation
Download Presentation
Information Retrieval to Knowledge Retrieval , one more step

Loading in 2 Seconds...

play fullscreen
1 / 104

Information Retrieval to Knowledge Retrieval , one more step - PowerPoint PPT Presentation


  • 198 Views
  • Uploaded on

Information Retrieval to Knowledge Retrieval , one more step. Xiaozhong Liu Assistant Professor School of Library and Information Science Indiana University Bloomington. What is Information?. What is Retrieval?. What is Information Retrieval?. I am Retriever.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Retrieval to Knowledge Retrieval , one more step' - quiana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information retrieval to knowledge retrieval one more step

Information Retrieval to Knowledge Retrieval, one more step

Xiaozhong Liu

Assistant Professor

School of Library and Information Science

Indiana University Bloomington

what is information

What is Information?

What is Retrieval?

What is Information Retrieval?

search something based on user information need

Search something based on User Information Need!!

How to express your information need?

Query

user information need

User Information Need!!

What is Good query?

What is Bad query?

Good query: query ≈ information need

Bad query: query ≠ information need

Query

Wait!!! User NEVER make mistake!!!

It’s OUR job!!!

slide7

Task 1: Given user information need, how to help (or automatically) help user propose a better query?

If there is a query…

Perfect query:

User input query:

user information need1

User Information Need!!

What is Good results?

What is Bad results?

Given a query,

How to retrieve results?

Query

Results

slide9

Task 2: Given a query (not perfect), how to retrieve

Documents from collection?

F(query, doc)

Very Large, Unstructured

Text Data!!!

Can you give me an example?

slide10

If query term exist in doc

Yes, this is result

F(query, doc)

If query term NOT exist in doc

No, this is not result

Is there any problem in this function?

Brainstorm…

slide11

Query: Obama’s wife

Doc 1. My wife supports Obama’s new policy on…

Doc 2. Michelle, as the first lady of the United States…

Yes, this is a very challenging task!

slide12

Another problem

Collection size: 5 billion

Match doc: 5

My algorithm successfully finds all the 5 docs! In… 3 billion results…

user information need2

User Information Need!!

How to help user find the results from all the retrieved results?

Query

Results

slide14

Task 3: Given retrieved results, how to help you find their results?

If retrieval algorithm retrieved 1 billion results from collection, what will you do???

Search with Google, click “next”???

Yes, we can help user find what they need!

slide15

Query: Indiana University Bloomington

Can you read it

One by one?

You use it??

user information need3

User

User Information Need!!

1

3

2

Query

Results

System

slide18

Text

Map

Information Retrieval

Image

……

Music

slide19

web

scholar

document

blog

Text

news

Map

Information Retrieval

Image

……

Music

documents vs database records
Documents vs. Database Records
  • Relational database records are typically made up of well-defined fields

Select * from students where GPA > 2.5

Text, similar way? Find all the docs including “Xiaozhong”

Select * from documents where text like ‘%xiaozhong%’

We need a more effective way to index the text!

slide22

Collection C: doc1, doc2, doc3 ……… docN

Vocabulary V: w1, w2, w3 ……… wn

Document doci : di1, di2, di3 ……… dim All dij V

Query q: q1, q2, q3 ……… qt where qx is the query term

slide23

Collection C: doc1, doc2, doc3 ……… docN

V: w1, w2, w3 ……… wn

Doc1 1 0 0 1

Doc2 0 0 0 1

Doc3 1 1 1 1

………

DocN1 0 1 1

Query q: 0, 1, 0 ………

slide24

Collection C: doc1, doc2, doc3 ……… docN

Normalization is very important!

V: w1, w2, w3 ……… wn

Doc1 3 0 0 9

Doc2 0 0 0 7

Doc3 2 11 21 1

………

DocN7 0 1 2

Query q: 0, 3, 0 ………

slide25

Collection C: doc1, doc2, doc3 ……… docN

Normalization is very important!

V: w1, w2, w3 ……… wn

Doc1 0.41 0 0 0.62

Weight

Doc2 0 0 0 0.12

Doc3 0.42 0.11 0.34 0.13

………

DocN0.01 0 0.19 0.24

Query q: 0, 0.37, 0 ………

slide26

Term weighting

TF * IDF

Inverse document frequency

1+ log(N/k)

N total num of docs in collection

k total num of docs with word w

Term frequency: freq (w, doc) / | doc|

Or…

An effective way to weight each word in a document

slide27

Retrieval Model?

Ranking?

Index

Speed?

Semantic?

Space?

Document representation meets the requirement of retrieval system

slide28

Stemming

Education

Educate

Educational

Educat

Educating

Educations

Very effective to improve system performance.

Some risk! E.g. LA Lakers = LA Lake?

slide29

Inverted index

Doc 1: I love my cat.

Doc 2: This cat is lovely!

Doc 3: Yellow cat and white cat.

I love my cat this is lovely yellow and write

ilove cat thiyellow write

i - 1

love - 1, 2

thi- 2

cat - 1, 2, 3

yellow - 3

write - 3

We lose something?

slide30

Inverted index

Doc 1: I love my cat.

Doc 2: This cat is lovely!

Doc 3: Yellow cat and white cat.

i - 1

love - 1, 2

thi- 2

cat - 1, 2, 3

yellow - 3

write - 3

i – 1:1

love – 1:1, 2:1

thi– 2:1

cat – 1:1, 2:1, 3:2

yellow – 3:1

write – 3:2

We still lose something?

slide31

Inverted index

Doc 1: I love my cat.

Doc 2: This cat is lovely!

Doc 3: Yellow cat and white cat.

i – 1:1

love – 1:1, 2:1

thi– 2:1

cat – 1:1, 2:1, 3:2

yellow – 3:1

write – 3:2

i – 1:1

love – 1:2, 2:4

thi– 2:1

cat – 1:4, 2:2, 3:2, 3:5

yellow – 3:2

write – 3:4

Why do you need position info?

slide32

Proximity of query terms

query: information retrieval

Doc 1: informationretrieval is important for digital library.

Doc 2: I need some information about the dogs, my favorite is golden retriever.

slide33

Index – bag of words

query: information retrieval

Doc 1: informationretrieval is important for digital library.

Doc 2: I need some information about the dogs, my favorite is golden retriever.

What’s the limitation of bag-of-words? Can we make it better?

n-gram:

Doc 1: information retrieval, retrieval is, is important, important for ……

bi-gram

Better semantic representation!

What’s the limitation?

slide34

Index – bag of “phrase”?

Doc 1: …… big apple ……

Doc 2: …… apple……

More precision, less ambiguous

How to identify phrases from documents?

Identify syntactic phrases using POS tagging

n-grams

from existing resources

slide35

Noise detection

What is the noise of web page? Non-informative content…

slide36

Web Crawler - freshness

Web is changing, but we cannot constantly check all the pages…

Need to find the most important page that change freq

www.nba.com

www.iub.edu

www.restaurant????.com

Sitemap: a list of urls for each host; modification time and freq

slide38

Model

Mathematical modeling is frequently used with the objective to understand, explain, reason and predict behavior or phenomenon in the real world (Hiemstra, 2001).

i.e. some model help you to predict tomorrow stock price…

slide39

Vector Space Model

Hypothesis:

Retrieval and ranking problem = Similarity Problem!

Is that a good hypothesis? Why?

Retrieval Function: Similarity (query, Document)

Return a score!!! We can Rank the documents!!!

slide40

Vector Space Model

So, query is a short document

slide41

Collection C: doc1, doc2, doc3 ……… docN

V: w1, w2, w3 ……… wn

Doc1 0.41 0 0 0.62

Doc2 0 0 0 0.12

Doc3 0.42 0.11 0.34 0.13

………

DocN0.01 0 0.19 0.24

Query q: 0, 0.37, 0 ………

slide42

Collection C: doc1, doc2, doc3 ……… docN

V: w1, w2, w3 ……… wn

Doc1 0.41 0 0 0.62

Doc2 0 0 0 0.12

Doc3 0.42 0.11 0.34 0.13

Similarity

Doc Vector

………

DocN0.01 0 0.19 0.24

Query q: 0, 0.37, 0 ………

Query Vector

slide43

Doc1: ……Cat……dog……cat……

Doc2: ……Cat……dog

Doc3: ……snake……

Query: dog cat

cat

doc 1

2

doc 2

dog

1

doc 3

slide44

Doc1: ……Cat……dog……cat……

Doc2: ……Cat……dog

Doc3: ……snake……

Query: dog cat

cat

doc 1

2

doc 2 = query

θ

dog

1

doc 3

F (q, doc) = cosine similarity (q, doc)

Why Cosine?

slide45

Vector Space Model

Vocabulary V: w1, w2, w3 ……… wn

Dimension = n = vocabulary size

Document doci : di1, di2, di3 ……… din All dij V

Query q: q1, q2, q3 ……… qnSame dimensional space!!!

slide46

Doc1: ……Cat……dog……cat……

Doc2: ……Cat……dog

Doc3: ……snake……

Query: dog cat

Try!

slide47

Term weighting

Doc[ 0.42 0.11 0.34 0.13 ]

weight, how?

TF * IDF

Inverse document frequency

1+ log(N/k)

N total num of docs in collection

k total num of docs with word w

Term frequency: freq (w, doc) / | doc|

Or…

slide48

More TF

Weighting is very important for retrieval model!

We can improve TF by…

i.e.

freq (term, doc)

log [freq (term, doc)]

  • BM25:
slide49

Vector Space Model

But…

Bag of word assumption = Word independent!

Query = Document, maybe not true!

Vector and SEO (Search Engine Optimization)…

synonym? Semantic related words?

slide50

TF

IDF

How about these…

+parameter

Normalization

Pivoted Normalization Method

Dirichlet Prior Method

slide51

Language model

Probability distribution over words

P (I love you) = 0.01

P (you love I) = 0.00001

P (love you I) = 0.0000001

If we have this information… we could build a generative model!

P(text | )

slide52

Language model - unigram

Generate text with bag-of-word assumption (word independent):

P (w1,w2,…wn) = P(w1) P(w2)…P(wn)

topic X = ???

food orange desk USB computer Apple Unix …. …. …. milk sport superbowl

slide53

Doc: I’m using Mac computer… remote access another computer… share some USB device…

P(Doc | topic1) vs. P(Doc | topic2)

topic 1

topic 2

food orange desk USB computer Apple Unix …. …. milk yogurt iPad NBA sport superbowl NHL score information unix USB

slide54

king ghost hamlet play …. …. romeojulietiPadiplhone 4s tv apple …… play store

slide55

How to estimate???

topicX

food orange desk USB computer Apple Unix …. …. …. milk sport superbowl

10/10000

1000/10000

30/10000

P(“computer” | topic X)

If we have enough data, i.e. docs about topic X

slide56

query: sport game watch

P(query | doc 1) vs. P(query | doc 2)

doc 1

doc 2

food orange desk USB computer Apple Unix …. …. milk yogurt iPad NBA sport superbowl NHL score information unix USB

slide57

a document  doc:

query likelihood

query term likelihood

Retrieval Problem  Query likelihood  Term likelihood P(qi | doc)

But document is a small sample of topic… Data like:

Smoothing!

slide58

Smoothing

P(qi | doc) What if qi is not observed in doc? P(qi | doc) = 0?

We want give this non-zero score!!!

i.e.

We can make it better!

slide59

Smoothing

  • First, it addresses the data sparseness problem. As a document is only a very small sample, the probability P (qi | Doc) could be zero for those unseen words (Zhai & Lafferty, 2004).
  • Second, smoothing helps to model the background (non-discriminative) words in the query.

Improve language model estimation by using Smoothing

slide60

Smoothing

  • Another smoothing method:

P (w | doc)

if the word exist in doc

P (w |  )

if the word not exist in doc

P (w | collection)

Collection Language Model

P (w |  ) = (1-λ) ∙P( query | θdoc)+λ∙P(doc| θcollection)

slide61

Smoothing

  • We could use collection language model:

Term Freq

IDF

Doc length norm

TFIDF is closely related to Language Model and other retrieval models

slide62

Language model

  • Solid statistical foundation
  • Flexible parameter setting
  • Different smoothing method
slide63

Language model in library?

  • If we have a paper… and a query…

Similarity (paper, query) Vector Space Model

If query word not in the paper…

Score = 0

If we use language model…

slide64

Language model in library?

  • Likelihood of query given a paper can be estimated by:

P(query |  ) = αP (query | paper) + βP (query | author) +γP(query | journal) +……

Likelihood of query given a paper & author & journal & ……

slide65

e.g. what’s the difference between web and doc retrieval???

F (doc, query)

vs

F (web page, query)

web page = doc + hyperlink + domain info + anchor text + metadata + …

Can you use those to improve system performance???

slide69

Current

Interest

Historical Interest

  • Diminishing topic
  • Hot topic
  • Regular topic
slide70

“Obama”, Nov 5th 2008 After Election

Win

Create history

First black president

Wiki:Barack_Obama; Wiki:Election; win; success;

Wiki:President_of_the_United_States

Wiki:African_American; President

World; America; victory; record; first;

president ; 44th; History; Wiki:Victory_RecordsEntity:first_black_president;

Entity:first_black_president; Celebrate; black; african;

Wiki:Colin_Powell; Wiki:Secretary_of_State

Wiki:United_States

Wiki:Sarah_Palin; sarah; palin; hillary

Secret; Wiki:Hillary_Rodham_Clinton

Clinton; newsweek; club; cloth

knowledge retrieval system
Knowledge Retrieval System

How to represent knowledge?

How to help user propose knowledge-base queries ?

Matching

Knowledge Representation

Query

Knowledge within Scientific Literature

Knowledge-based Information Need

How to match between the two?

query recommendation feedback
Query Recommendation & Feedback

Query

Feedback

Query Recommendation

evaluation domain knowledge generation
Evaluation – Domain Knowledge Generation

GOOD! but not PERFECT…

F measure comparison for Supervised Learning and Semi-Supervised Learning

knowledge comes from
Knowledge comes from…

System? Machine Learning, but… low modest performance…

User? No way! Very high cost! Author won’t contribute…

System + User? Possible!

slide79

WikiBackyard

Trigger: 1. Wiki page improve; 2. Machine learning model improve; 3. All other wiki pages improve; 4. KR index improve!

Edit

ScholarWiki

slide81

Knowledge retrieval for scholar publications…

  • Knowledge from paper
  • Knowledge from user
    • Knowledge feedback
    • Knowledge recommendation
  • Knowledge from User vs. from Machine learning
  • ScholarWiki (user) + WikiBackyard(machine)
slide84

Full text citation analysis

With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.

Content of each node?

Motivation of each citation?

slide85

With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.

Every word @ Citation Context will VOTE!!

Motivation? Topic? Reason??? Left and Right N words??

N = ??????????

slide86

With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation countsand has offered new opportunitiesfor citation analysis.

Word effectiveness is decaying based on the distance!!!

Closer words make more significant contribution!!

slide87

How about language model? Each node and edge represented by a language model?

High dimensional space! Word difference?

slide88

Topic modeling – each node is represented by a topic distribution (Prior Distribution); each edge is represented by a topic distribution (Transitioning Probability Distribution)

slide89

Supervised topic modeling

Each topic has a label (YES! We can interpret each topic)

We DO KNOW the total number of topics

Each paper is a mix probability distribution of Author Given Keywords

Keywords

slide90

Each paper: pzkeyi(paper) = p(zkeyi| abstract, title)

With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation countsand has offered new opportunitiesfor citation analysis.

slide91

Paper importance

Domain credit: 100

if we have 3 topics (keywords): key1, key2, key3

Key1-Pub1 credit:

25 * 0.6

25

25

25

25

Key1-Citation1 credit:

25 * 0.6*[0.8/(0.8+0.2)]

P(key1 | text) = 0.6

P(key2 | text) = 0.15

P(key3 | text) = 0.25

pub 4

pub 3

pub 2

pub 1

0.8

P(key1 | citation) = 0.8

P(key2 | citation) = 0.1

P(key3 | citation) = 0.1

Evenly share the credits?

0.2

Citation is important if 1. citation focusing on important topic 2. other citations focusing on other topics

slide92

Paper importance

Domain credit: 100

if we have 3 keywords: key1, key2, key3

Key1-Pub1 credit:

25 * 0.6

25

25

25

25

[25,25,25]

Key1-Citation1 credit:

25 * 0.6*[0.8/(0.8+0.2)]

pub 2

pub 1

pub 3

pub 4

0.8

[27,27,26]

[29,26,28]

Domain publication ranking

Domain keyword topical ranking

Topical citation tree

0.2

[25,25,25]

Citation number between paper pair is IMPORTANT!

slide93

Different citations make different contribution to different topics (keywords) to the citing publication.

slide94

Citation transitioning topic prior

Publication/venue/author topic prior

slide96

Literature Review Citation recommendation

Input: Paper Abstract

Output: A list of ranked citations

MAP and NDCG evaluation

slide97

Given a paper abstract:

Word level match (language model)

Topic level match (KL-Divergence)

Topic importance

Use Inference Network to integrate each hypothesis

slide98

Citation Recommendation

Inference Network

Publication Topical Prior

Topic match

Content Match

PageRank

Full-text PageRank (greedy match)

Full-text PageRank (topic modeling)

slide99

Output:

[3] YES 3

[2] YES 2

[6] NO 0

[8] NO 0

[10] YES 1

[1] NO 0

……

Input

MAP

(Cite or not?)

NDCG

(Important citation?)

slide100

Based on topic inference, 30 seconds

Based on greedy match, 1 second

slide101

CONCLUSION

  • Information Retrieval
    • Index
    • Retrieval Model
    • Ranking
    • User feedback
    • Evaluation
  • Knowledge Retrieval
    • Machine Learning
    • User Knowledge
    • Integration
    • Social Network Analysis