龙星计划课程 : 信息检索 Course Overview & Background

龙星计划课程:信息检索Course Overview & Background ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu

Outline • Course overview • Essential background • Probability & statistics • Basic concepts in information theory • Natural language processing

Course Overview

Course Objectives • Introduce the field of information retrieval (IR) • Foundation: Basic concepts, principles, methods, etc • Trends: Frontier topics • Prepare students to do research in IR and/or related fields • Research methodology (general and IR-specific) • Research proposal writing • Research project (to be finished after the lecture period)

Prerequisites • Proficiency in programming (C++ is needed for assignments) • Knowledge of basic probability & statistics (would be necessary for understanding algorithms deeply) • Big plus: knowledge of related areas • Machine learning • Natural language processing • Data mining • …

Course Management • Teaching staff • Instructor: ChengXiang Zhai (UIUC) • Teaching assistants: • Hongfei Yan (Peking Univ) • Bo Peng (Peking Univ) • Course website: http://net.pku.edu.cn/~course/cs410/ • Course group discussion: http://groups.google.com/group/cs410pku • Questions: First post the questions on the group discussion forum; if questions are unanswered, bring them to the office hours (first office hour: June 23, 2:30-4:30pm)

Format & Requirements • Lecture-based: • Morning lectures: Foundation & Trends • Afternoon lectures: IR research methodology • Readings are usually available online • 2 Assignments (based on morning lectures) • Coding (C++), experimenting with data, analyzing results, open explorations (~5 hours each) • Final exam (based on morning lectures): 1:30-4:30pm, June 30. • Practice questions will be available

Format & Requirements (cont.) • Course project (Mini-TREC) • Work in teams • Phase I: create test collections (~ 3 hours, done within lecture period) • Phase II: develop algorithms and submit results (done in the summer) • Research project proposal (based on afternoon lectures) • Work in teams • 2-page outline done within lecture period • full proposal (5 pages) due later

Coverage of Topics: IR vs. TIM IR and TIM will be used interchangeably Text Information Management (TIM) Information Retrieval (IR) Multimedia, etc

What is Text Info. Management? • TIM is concerned with technologies for managing and exploiting text information effectively and efficiently • Importance of managing text information • The most natural way of encoding knowledge • Think about scientific literature • The most common type of information • How much textual information do you produce and consume every day? • The most basic form of information • It can be used to describe other media of information • The most useful form of information!

Text Management Applications Mining Access Select information Create Knowledge Add Structure/Annotations Organization

Examples of Text Management Applications • Search • Web search engines (Google, Yahoo, …) • Library systems • … • Recommendation • News filter • Literature/movie recommender • Categorization • Automatically sorting emails • … • Mining/Extraction • Discovering major complaints from email in customer service • Business intelligence • Bioinformatics • … • Many others…

Elements of Text Info Management Technologies Focus of the course Retrieval Applications Summarization Visualization Mining Applications Filtering Mining Information Organization Information Access Knowledge Acquisition Search Extraction Categorization Clustering Natural Language Content Analysis Text

Text Management and Other Areas User Human-computer interaction Information Science Computer science TM Applications Software engineering Web Probabilistic inference Machine learning TM Algorithms Natural language processing Storage Compression Text

Related Areas Applications Models Applications Web, Bioinformatics… Machine Learning Pattern Recognition Data Mining Library & Info Science Statistics Optimization Information Retrieval Databases Natural Language Processing Software engineering Computer systems Algorithms Systems

Publications/Societies (Incomplete) Learning/Mining Applications ICML ISMB WWW ICML, NIPS, UAI RECOMB, PSB Info. Science ACM SIGKDD Info Retrieval ASIS Statistics JCDL AAAI ACM SIGIR HLT Databases NLP ACM CIKM TREC ACL ACM SIGMOD COLING, EMNLP, ANLP SOSP VLDB, PODS, ICDE Software/systems OSDI

Schedule: available at http://net.pku.edu.cn/~course/cs410/

Essential Backgroud 1: Probability & Statistics

Prob/Statistics & Text Management • Probability & statistics provide a principled way to quantify the uncertainties associated with natural language • Allow us to answer questions like: • Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) • Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

Basic Concepts in Probability • Random experiment: an experiment with uncertain outcome (e.g., tossing a coin, picking a word from text) • Sample space: all possible outcomes, e.g., • Tossing 2 fair coins, S ={HH, HT, TH, TT} • Event: ES, E happens iff outcome is in E, e.g., • E={HH} (all heads) • E={HH,TT} (same face) • Impossible event ({}), certain event (S) • Probability of Event : 1P(E) 0, s.t. • P(S)=1 (outcome always in S) • P(A B)=P(A)+P(B) if (AB)= (e.g., A=same face, B=different face)

Interpretation of Bayes’ Rule Hypothesis space: H={H1 ,…,Hn} Evidence: E If we want to pick the most likely hypothesis H*, we can drop P(E) Posterior probability of Hi Prior probability of Hi Likelihood of data/evidence if Hi is true

Random Variable • X: S  (“measure” of outcome) • E.g., number of heads, all same face?, … • Events can be defined according to X • E(X=a) = {si|X(si)=a} • E(Xa) = {si|X(si)  a} • So, probabilities can be defined on X • P(X=a) = P(E(X=a)) • P(aX) = P(E(aX)) • Discrete vs. continuous random variable (think of “partitioning the sample space”)

An Example: Doc Classification Sample Space S={x1,…, xn} For 3 topics, four words, n=? Topic the computer game baseball Conditional Probabilities: P(Esport | Ebaseball ), P(Ebaseball|Esport), P(Esport | Ebaseball,computer ), ... X1:[sport 1 0 1 1] X2: [sport 1 1 1 1] Thinking in terms of random variables Topic: T {“sport”, “computer”, “other”}, “Baseball”: B {0,1}, … P(T=“sport”|B=1), P(B=1|T=“sport”), ... X3: [computer 1 1 0 0] X4: [computer 1 1 1 0] X5: [other 0 0 1 1] … … Events Esport ={xi | topic(xi )=“sport”} Ebaseball ={xi | baseball(xi )=1} Ebaseball,computer = {xi | baseball(xi )=1 & computer(xi )=0} An inference problem: Suppose we observe that “baseball” is mentioned, how likely the topic is about “sport”? P(T=“sport”|B=1) P(B=1|T=“sport”)P(T=“sport”) But, P(B=1|T=“sport”)=?, P(T=“sport” )=?

Getting to Statistics ... • P(B=1|T=“sport”)=? (parameter estimation) • If we see the results of a huge number of random experiments, then • But, what if we only see a small sample (e.g., 2)? Is this estimate still reliable? • In general, statistics has to do with drawing conclusions on the whole population based on observations of a sample (data)

Parameter Estimation • General setting: • Given a (hypothesized & probabilistic) model that governs the random experiment • The model gives a probability of any data p(D|) that depends on the parameter  • Now, given actual sample data X={x1,…,xn}, what can we say about the value of ? • Intuitively, take your best guess of  -- “best” means “best explaining/fitting the data” • Generally an optimization problem

Maximum Likelihood vs. Bayesian • Maximum likelihood estimation • “Best” means “data likelihood reaches maximum” • Problem: small sample • Bayesian estimation • “Best” means being consistent with our “prior” knowledge and explaining data well • Problem: how to define prior?

Illustration of Bayesian Estimation Likelihood: p(X|) X=(x1,…,xN) Prior: p() : prior mode : posterior mode ml: ML estimate Posterior: p(|X) p(X|)p() 

Maximum Likelihood Estimate Data: a document d with counts c(w1), …, c(wN), and length |d| Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) We’ll tune p(wi) to maximize l(d|M) Use Lagrange multiplier approach Set partial derivatives to zero ML estimate

What You Should Know • Probability concepts: • sample space, event, random variable, conditional prob. multinomial distribution, etc • Bayes formula and its interpretation • Statistics: Know how to compute maximum likelihood estimate

Essential Background 2:Basic Concepts in Information Theory

Information Theory • Developed by Shannon in the 40s • Maximizing the amount of information that can be transmitted over an imperfect communication channel • Data compression (entropy) • Transmission rate (channel capacity)

Basic Concepts in Information Theory • Entropy: Measuring uncertainty of a random variable • Kullback-Leibler divergence: comparing two distributions • Mutual Information: measuring the correlation of two random variables

Entropy: Motivation • Feature selection: • If we use only a few words to classify docs, what kind of words should we use? • P(Topic| “computer”=1) vs p(Topic | “the”=1): which is more random? • Text compression: • Some documents (less random) can be compressed more than others (more random) • Can we quantify the “compressibility”? • In general, given a random variable X following distribution p(X), • How do we measure the “randomness” of X? • How do we design optimal coding for X?

Entropy: Definition Entropy H(X) measures the uncertainty/randomness of random variable X Example: H(X) P(Head) 1.0

Entropy: Properties • Minimum value of H(X): 0 • What kind of X has the minimum entropy? • Maximum value of H(X): log M, where M is the number of possible values for X • What kind of X has the maximum entropy? • Related to coding

Interpretations of H(X) • Measures the “amount of information” in X • Think of each value of X as a “message” • Think of X as a random experiment (20 questions) • Minimum average number of bits to compress values of X • The more random X is, the harder to compress A fair coin has the maximum information, and is hardest to compress A biased coin has some information, and can be compressed to <1 bit on average A completely biased coin has no information, and needs only 0 bit

Conditional Entropy • The conditional entropy of a random variable Y given another X, expresses how much extra information one still needs to supply on average to communicate Y given that the other party knows X • H(Topic| “computer”) vs. H(Topic | “the”)?

Cross Entropy H(p,q) What if we encode X with a code optimized for a wrong distribution q? Expected # of bits=? Intuitively, H(p,q)  H(p), and mathematically,

Kullback-Leibler Divergence D(p||q) What if we encode X with a code optimized for a wrong distribution q? How many bits would we waste? Properties: - D(p||q)0 - D(p||q)D(q||p) - D(p||q)=0 iff p=q Relative entropy KL-divergence is often used to measure the distance between two distributions • Interpretation: • Fix p, D(p||q) and H(p,q) vary in the same way • If p is an empirical distribution, minimize D(p||q) or H(p,q) is equivalent to maximizing likelihood

Cross Entropy, KL-Div, and Likelihood Likelihood: log Likelihood: Criterion for selecting a good model Perplexity(p)

Mutual Information I(X;Y) Comparing two distributions: p(x,y) vs p(x)p(y) Properties: I(X;Y)0; I(X;Y)=I(Y;X); I(X;Y)=0 iff X & Y are independent Interpretations: - Measures how much reduction in uncertainty of X given info. about Y - Measures correlation between X and Y - Related to the “channel capacity” in information theory Examples: I(Topic; “computer”) vs. I(Topic; “the”)? I(“computer”, “program”) vs (“computer”, “baseball”)?

What You Should Know • Information theory concepts: entropy, cross entropy, relative entropy, conditional entropy, KL-div., mutual information • Know their definitions, how to compute them • Know how to interpret them • Know their relationships

Essential Background 3: Natural Language Processing

What is NLP? Arabic text Spanish text - What are the basic units of meaning (words)? - What is the meaning of each word? - How are words related with each other? - What is the “combined meaning” of words? - What is the “meta-meaning”? (speech act) - Handling a large chunk of text - Making sense of everything Morphology Syntax Semantics Pragmatics Discourse Inference … يَجِبُ عَلَى الإنْسَانِ أن يَكُونَ أمِيْنَاً وَصَادِقَاً مَعَ نَفْسِهِ وَمَعَ أَهْلِهِ وَجِيْرَانِهِ وَأَنْ يَبْذُلَ كُلَّ جُهْدٍ فِي إِعْلاءِ شَأْنِ الوَطَنِ وَأَنْ يَعْمَلَ عَلَى مَا … La listas actualizadas figuran como Aneio I. How can a computer make sense out of such a string?

An Example of NLP Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Noun Phrase Noun Phrase Complex Verb Prep Phrase Semantic analysis Verb Phrase Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). Verb Phrase + Sentence Scared(x) if Chasing(_,x,_). A person saying this may be reminding another person to get the dog back… Scared(b1) Inference Pragmatic analysis (speech act) A dog is chasing a boy on the playground Lexical analysis (part-of-speech tagging) Syntactic analysis (Parsing)

If we can do this for all the sentences, then … BAD NEWS: Unfortunately, we can’t. General NLP = “AI-Complete”

NLP is Difficult! • Natural language is designed to make human communication efficient. As a result, • we omit a lot of “common sense” knowledge, which we assume the hearer/reader possesses • we keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve • This makes EVERY step in NLP hard • Ambiguity is a “killer” • Common sense reasoning is pre-required

Examples of Challenges • Word-level ambiguity: E.g., • “design” can be a noun or a verb (Ambiguous POS) • “root” has multiple meanings (Ambiguous sense) • Syntactic ambiguity: E.g., • “natural language processing” (Modification) • “A man saw a boy with a telescope.” (PP Attachment) • Anaphora resolution: “John persuaded Bill to buy a TV for himself.” (himself = John or Bill?) • Presupposition: “He has quit smoking.” implies that he smoked before.

龙星计划课程 : 信息检索 Course Overview & Background