Information Retrieval to Knowledge Retrieval , one more step

Information Retrieval to Knowledge Retrieval, one more step Xiaozhong Liu Assistant Professor School of Library and Information Science Indiana University Bloomington

What is Information? What is Retrieval? What is Information Retrieval?

I am Retriever

How to find this book in Library?

Search something based on User Information Need!! How to express your information need? Query

User Information Need!! What is Good query? What is Bad query? Good query: query ≈ information need Bad query: query ≠ information need Query Wait!!! User NEVER make mistake!!! It’s OUR job!!!

Task 1: Given user information need, how to help (or automatically) help user propose a better query? If there is a query… Perfect query: User input query:

User Information Need!! What is Good results? What is Bad results? Given a query, How to retrieve results? Query Results

Task 2: Given a query (not perfect), how to retrieve Documents from collection? F(query, doc) Very Large, Unstructured Text Data!!! Can you give me an example?

If query term exist in doc Yes, this is result F(query, doc) If query term NOT exist in doc No, this is not result Is there any problem in this function? Brainstorm…

Query: Obama’s wife Doc 1. My wife supports Obama’s new policy on… Doc 2. Michelle, as the first lady of the United States… Yes, this is a very challenging task!

Another problem Collection size: 5 billion Match doc: 5 My algorithm successfully finds all the 5 docs! In… 3 billion results…

User Information Need!! How to help user find the results from all the retrieved results? Query Results

Task 3: Given retrieved results, how to help you find their results? If retrieval algorithm retrieved 1 billion results from collection, what will you do??? Search with Google, click “next”??? Yes, we can help user find what they need!

Query: Indiana University Bloomington Can you read it One by one? You use it??

User User Information Need!! 1 3 2 Query Results System

They are not independent! User User Information Need!! 1 3 2 Query Results System

Text Map Information Retrieval Image …… Music

web scholar document blog Text news Map Information Retrieval Image …… Music

Index

Documents vs. Database Records • Relational database records are typically made up of well-defined fields Select * from students where GPA > 2.5 Text, similar way? Find all the docs including “Xiaozhong” Select * from documents where text like ‘%xiaozhong%’ We need a more effective way to index the text!

Collection C: doc1, doc2, doc3 ……… docN Vocabulary V: w1, w2, w3 ……… wn Document doci : di1, di2, di3 ……… dim All dij V Query q: q1, q2, q3 ……… qt where qx is the query term

Collection C: doc1, doc2, doc3 ……… docN V: w1, w2, w3 ……… wn Doc1 1 0 0 1 Doc2 0 0 0 1 Doc3 1 1 1 1 ……… DocN1 0 1 1 Query q: 0, 1, 0 ………

Collection C: doc1, doc2, doc3 ……… docN Normalization is very important! V: w1, w2, w3 ……… wn Doc1 3 0 0 9 Doc2 0 0 0 7 Doc3 2 11 21 1 ……… DocN7 0 1 2 Query q: 0, 3, 0 ………

Collection C: doc1, doc2, doc3 ……… docN Normalization is very important! V: w1, w2, w3 ……… wn Doc1 0.41 0 0 0.62 Weight Doc2 0 0 0 0.12 Doc3 0.42 0.11 0.34 0.13 ……… DocN0.01 0 0.19 0.24 Query q: 0, 0.37, 0 ………

Term weighting TF * IDF Inverse document frequency 1+ log(N/k) N total num of docs in collection k total num of docs with word w Term frequency: freq (w, doc) / | doc| Or… An effective way to weight each word in a document

Retrieval Model? Ranking? Index Speed? Semantic? Space? Document representation meets the requirement of retrieval system

Stemming Education Educate Educational Educat Educating Educations Very effective to improve system performance. Some risk! E.g. LA Lakers = LA Lake?

Inverted index Doc 1: I love my cat. Doc 2: This cat is lovely! Doc 3: Yellow cat and white cat. I love my cat this is lovely yellow and write ilove cat thiyellow write i - 1 love - 1, 2 thi- 2 cat - 1, 2, 3 yellow - 3 write - 3 We lose something?

Inverted index Doc 1: I love my cat. Doc 2: This cat is lovely! Doc 3: Yellow cat and white cat. i - 1 love - 1, 2 thi- 2 cat - 1, 2, 3 yellow - 3 write - 3 i – 1:1 love – 1:1, 2:1 thi– 2:1 cat – 1:1, 2:1, 3:2 yellow – 3:1 write – 3:2 We still lose something?

Inverted index Doc 1: I love my cat. Doc 2: This cat is lovely! Doc 3: Yellow cat and white cat. i – 1:1 love – 1:1, 2:1 thi– 2:1 cat – 1:1, 2:1, 3:2 yellow – 3:1 write – 3:2 i – 1:1 love – 1:2, 2:4 thi– 2:1 cat – 1:4, 2:2, 3:2, 3:5 yellow – 3:2 write – 3:4 Why do you need position info?

Proximity of query terms query: information retrieval Doc 1: informationretrieval is important for digital library. Doc 2: I need some information about the dogs, my favorite is golden retriever.

Index – bag of words query: information retrieval Doc 1: informationretrieval is important for digital library. Doc 2: I need some information about the dogs, my favorite is golden retriever. What’s the limitation of bag-of-words? Can we make it better? n-gram: Doc 1: information retrieval, retrieval is, is important, important for …… bi-gram Better semantic representation! What’s the limitation?

Index – bag of “phrase”? Doc 1: …… big apple …… Doc 2: …… apple…… More precision, less ambiguous How to identify phrases from documents? Identify syntactic phrases using POS tagging n-grams from existing resources

Noise detection What is the noise of web page? Non-informative content…

Web Crawler - freshness Web is changing, but we cannot constantly check all the pages… Need to find the most important page that change freq www.nba.com www.iub.edu www.restaurant????.com Sitemap: a list of urls for each host; modification time and freq

Retrieval

Model Mathematical modeling is frequently used with the objective to understand, explain, reason and predict behavior or phenomenon in the real world (Hiemstra, 2001). i.e. some model help you to predict tomorrow stock price…

Vector Space Model Hypothesis: Retrieval and ranking problem = Similarity Problem! Is that a good hypothesis? Why? Retrieval Function: Similarity (query, Document) Return a score!!! We can Rank the documents!!!

Vector Space Model So, query is a short document

Collection C: doc1, doc2, doc3 ……… docN V: w1, w2, w3 ……… wn Doc1 0.41 0 0 0.62 Doc2 0 0 0 0.12 Doc3 0.42 0.11 0.34 0.13 ……… DocN0.01 0 0.19 0.24 Query q: 0, 0.37, 0 ………

Collection C: doc1, doc2, doc3 ……… docN V: w1, w2, w3 ……… wn Doc1 0.41 0 0 0.62 Doc2 0 0 0 0.12 Doc3 0.42 0.11 0.34 0.13 Similarity Doc Vector ……… DocN0.01 0 0.19 0.24 Query q: 0, 0.37, 0 ……… Query Vector

Doc1: ……Cat……dog……cat…… Doc2: ……Cat……dog Doc3: ……snake…… Query: dog cat cat doc 1 2 doc 2 dog 1 doc 3

Doc1: ……Cat……dog……cat…… Doc2: ……Cat……dog Doc3: ……snake…… Query: dog cat cat doc 1 2 doc 2 = query θ dog 1 doc 3 F (q, doc) = cosine similarity (q, doc) Why Cosine?

Vector Space Model Vocabulary V: w1, w2, w3 ……… wn Dimension = n = vocabulary size Document doci : di1, di2, di3 ……… din All dij V Query q: q1, q2, q3 ……… qnSame dimensional space!!!

Doc1: ……Cat……dog……cat…… Doc2: ……Cat……dog Doc3: ……snake…… Query: dog cat Try!

Term weighting Doc[ 0.42 0.11 0.34 0.13 ] weight, how? TF * IDF Inverse document frequency 1+ log(N/k) N total num of docs in collection k total num of docs with word w Term frequency: freq (w, doc) / | doc| Or…

More TF Weighting is very important for retrieval model! We can improve TF by… i.e. freq (term, doc) log [freq (term, doc)] • BM25:

Vector Space Model But… Bag of word assumption = Word independent! Query = Document, maybe not true! Vector and SEO (Search Engine Optimization)… synonym? Semantic related words?

TF IDF How about these… +parameter Normalization Pivoted Normalization Method Dirichlet Prior Method

Information Retrieval to Knowledge Retrieval , one more step

Information Retrieval to Knowledge Retrieval , one more step

Presentation Transcript

Information retrieval

Information Retrieval

Information retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Knowledge Retrieval

Information Retrieval