Q & A Based Internet Information Acquisition

Q & A Based Internet Information Acquisition Xiaoyan Zhu Tsinghua University zxy-dcs@tsinghua.edu.cn

Tsinghua University found in 1911

13 Schools and 54 Departments With 7,080 Faculties and 25,173 students

National Lab of Information Science and Technology 1,000 people

Department of CS Faculty member: 120 Students Undergraduate 661 Master 545 PhD 298

Joint Research CenterTsinghua – Waterloo Research Center for Internet Information Acquisition • Title of the project • Break barrier to access internet • Mission • Enable the internet information to be widely and conveniently accessed by the disadvantaged: people who do not read English, people who do not have the internet, and visually impaired people. • Goal • To minimize language barriers and allow over 19.2 billion English internet pages accessible to 1.5 billion Chinese people. • To enable 580 million Chinese cell phone users access the internet, even if they do not have internet connection. • To enable 16 million visually impaired people in China, and many more in Canada and the world, access the internet conveniently. • Supported by International Research Chair Initiative project from International Development Research Center, Canada, (1 million Canadian dollar)

http://ciia.cs.tsinghua.edu.cn/Project_WebSite/index.jsp

TALK OUTLINE • How we get information from internet • Research topics • Information similarity measure • information extraction • Information summarization • Future work

There are more than 20 billion web pages in the world, and increasing in a speed of 200,000 a day. Fromhttp://en.disitu.com/2008/03/10/how-many-website-in-the-world/

General search engines

more than 20billion+5billion (70?) web pages indexed. almost 1, 000, 000, 000 of user-generated data(MB) every hour. almost 4, 000, 000, 000 user queries per day. almost 17, 600, 000 daily visitors.

Vertical search engines • Travel Search • Shopping (or Product) Search • Employment Search • ……

15, 000, 000 registered users. 20, 000, 000 services provided every year. 1, 000, 000 daily visitors

Community Q&A Systems

Q&A Systems Input: natural language questions Return: sophisticated and tidy answers!

Requirement Keywords URL Web pages Information Refine keywords Requirement Keywords URL Web pages Information QUANTA Natural language query QUANTA−Q&A based information acquisition To make information acquisition more intelligent and convenient and effective To minimize language barriers and allow over 19.2 billion English internet pages accessible to 1.5 billion Chinese people. To enable 580 million Chinese cell phone users access the internet, even if they do not have internet connection. Q&A Knowledge Power

THU QUANTA http://166.111.138.87/quanta/index.jsp

QUANTA’s result

Summary General engines • Key word search and more powerful for the shorter query. • Question answering and more useful for detail question in a specific domain. • Question answering and more popular for various complex questions. • Complement to web keyword search. • Enhance existing cQA and search services. • Leverage existing knowledge in the question and answer forms and their authors. Vertical engines Community Q&A Get answer automatically from internet at anytime, anywhere for everyone！ Q&A systems

Problems • General search engines, Key word search • powerful for short query but long query and question • Loss of information  Return irrelevant results • Vertical search engines, question answering • Powerful for specified questions  Domain limited • cQA system, question answering • The best answer is really best? • Too many redundant answers for the same question. • The answer is not in real time for a new question.

Complexity of the question limited • QUANTA • Knowledge based • Powerset • Database based • Wolfram alpha Problems (cont.) Q&A Systems

TALK OUTLINE • How we can get information from internet • Research topics • Information similarity measure • information extraction • Information summarization • Future work

Research topics ongoing • Information Similarity • Information Extraction • Sentiment analysis, opinion mining • Information summarization • Question analysis and classification • Candidate generation, ranking, evaluation • Recommendation for related content Theoretical Study Application

Our work • Information theory based information similarity measure • Conditional information distance • Distance between named entities under the specific environment • Min distance measure • Distance between twoobjects for partial matching problem • Information Distance among Many Objects • Comprehensive and typical information selection, e.g. review mining; multi-document summarization • Information extraction • Relationship extraction • Redundant removing • Summarization • Based on Information Distance • Based on Graph Centrality

TALK OUTLINE • How we can get information from internet • Research topics • Information similarity measure • information extraction • Information summarization • Future work

Information Distance Dmax(x,y|c) dmax(x,y|c) dmax(x,y) Dmin(x,y) dmin(x,y) Dmax(x,y) Information Distance Dmax(x1,x2,…) KolmogorovComplexity

Information Distance • Original theories: • Information distance: Dmax(x, y) • Normalized information distance: dmax(x, y) • Proposed by our group： • Conditional Information Distance(CID/NCID): Dmax(x, y|c), dmax(x, y|c) • minDistance: Dmin(x, y), dmin(x,y) • Information Distance among Many Objects (IDMO): Dmax(x1, …,xn)

Motivation of NCIDNormalized Conditional Information Distance • Information Distance (Bennett C H, G´acs P, Li M, Vitanyi P, and Zurek W. , 1998): This is normalized ID, named NID, where, K(x) is the Kolmogorov complexity of x, while K(x|y) is that condition to y. NID is an absolute, universal, and application independent distance measure between any two sequences, which was applied in evolution tree creation, language classification, music classification, plagiarism detection, data mining for images and time sequences such as heart rhythm data etc.. • However, while people try to use it with Google, it becomes difficult and meaningless sometimes. For example, “fan”  “CPU” and “star”? The NIDs of them are 0.60 and 0.58, respectively which are almost same and mean nothing.

Experiment results of NCID Regular expression

Approximation of NCID where, f(x) is the number of elements in which x occurs, and f(x,c) is number of elements in which x and c both occur; f(y,c) and f(x,y,c) are similarly defined. From definition, 0 ≤ f(x,y,c) ≤ f(x,c) / f(y,c) ≤ f(c), when f(x,y,c) = 0, if f(x,c) * f(y,c) > 0, then dc(x,y) = ∞ (infinite); otherwise, dc(x; y) is undefined.

Motivation of min Distance Partial matching problem • Example: What city is Lake Washington by? (Seattle, Bellevue, Kirkland) • Seattle, correct and most popular answer, has much more private information comparing with the other answers, in contrast with the information shared with “Lake Washington” . • Problem of max distance • The shared information between question and the answer is “diluted” by the private information of the answer, which makes the system selected an unpopular candidate, Bellevue which has “dense” shared information. • Can we remove the irrelevant information in a coherent theory and give the most popular city Seattle a chance?

minDistance measure • Define Dminand dmin • Characters of min Distance • Universal • No partial matching problem • Nonnegative and symmetrical • It does not satisfy triangle inequality • Publications: ACM KDD-07, Bioinformatics, JCST

min Distance’s problem Triangle equation problem d(human,horse) > d(human,Centaur)+d(Centaur,horse) An information distance must reflect what “we think ” to be similar. And what “we think” to be similar apparently does not really satisfy triangle inequality. min Distance is reasonable and successful in the applications.

Motivation of IDMO • In many data mining applications, we are more interested in mining information from many, not just two, information carrying entities, for example: • What is the public opinion on the United States presidential election, from the blogs? • What do the customers say about a product, from the reviews? • Which article, among many, covers the news most comprehensively? Or typical in one particular news item? 35 9/15/2014

IDMO measure • Define Dmax(x1, …,xn) • Dmax(x1,...,xn) =Em(x1,...,xn) = min { |p| : U(xi,p,j) = xj, for all i,j} • Conditional Dmax(x1, …,xn) • Dmax(x1,...,xn|c) =Em(x1,...,xn|c) = min { |p| : U(xi,p,j|c) = xj, for all i,j} • Most representative object • Left-hand side: the most “comprehensive” object that contains the most information about all of the others • Right-hand side: the most “typical” object which is similar to all of the others 36 9/15/2014

Document Summarization by IDMO Multi-document Summarization To generate the most “typical” summary according to the right-hand side of Em Ranked top 1 under the measurement of overall responsiveness both on TAC 2008(33 research groups, 58 submissions) and 2009 (26 research groups, 52 submissions) Publications: ICDM-09 37 9/15/2014

Review Mining by IDMO Comprehensive and typical review selection To select the most “comprehensive” and the most “typical” reviews We have studied the relationship between a review’s sentiment rating and its textual content and developed a rating estimation system The system based on our theory is very helpful for customers Publications: ACM CIKM-08, WI-09 38 9/15/2014

TALK OUTLINE • How we can get information from internet • Research topics • Information similarity measure • Information extraction • Information summarization • Future work

Information Extraction • Relationship extraction • Supervised learning • Unsupervised learning

Interaction extraction • Statistical algorithms • low precision, high recall • Pattern matching algorithms • Manual pattern generation • High precision, low recall • Badat generalization • Automatic pattern generation • Good balance between precision and recall • Goodat generalization

Pattern generation and optimization • Pattern generation: extract patterns automatically. • dynamic alignment/dynamic programming algorithm • Pattern optimization: reduce and merge patterns to increase generalization power, and hence the recall and precision rates. • Supervised Machine Learning algorithm • approach based on MDL principle • data labeling is required • Semisupervised Machine Learning algorithm • approach with a ranking function, and a heuristic evaluation algorithm • Relatively, little data labeling is required

Pattern Set • Best pattern set should satisfy: • least amount of error in output and least amount of redundancy in pattern set • maximum number of sentences matched by at least one pattern. • The problem is how to take the balance. That is a trade-off between the complexity of the model (pattern set) and the fitness of the model to the data (shown by the performance of the system).

Pattern optimization (MDL based) • What is Minimum Description Length principle • Proposed by Rissanen, in 1978, as a tool to solve the tradeoff problem between generalization and accuracy. • MDL principle can be applied without the analytical form of the risk function. • The MDL principle states that given some data D, the best model (or theory) MMDL in the set M of all models consistent with the data is the one that minimizes the sum of the length in bits of description of the model, and the length in bits of the description of the data with the aid of the model. where l(M) and l(D/M) denote, respectively, the description length of the model M and that of data D using model M.

Implement of MDL principle • The MDL principle can be viewed from the point of view of Kolmogorov complexity (Li and Vitanyi, 1997): where K() is Kolmogorov complexity which is also not computable. The MDL principle looks for an optimal balance between the regularities (in the model) and the randomness remaining in the data, that is, a trade-off between the complexity of the model and the fitness of the model to the data. • The problem is how to make the principle computable. ,

= MDL based algorithm • Examples of pattern (a tag sequence): • {PTN VBZ IN PTN: *; binds, associates; to, with; *} • {PTN VBZ PTN: *; binds, associates, activate; *} Pattern set P, P={p1, p2, …, pn}, and pattern pi, pi=mi1mi2…miJi • Finally, the optimal pattern P* is obtained as follows: where, γ is a constant, and c(mij) is the number of words involved in jth components of pattern pi. I* is expected interaction set while I is extracted by pattern set P. So d(I, I*) is the number of difference between I and I*.

Experiment results (1) • X-coordinate is the number of deleted pattern; • Y-coordinates are the value of MDL, precision and recall of the system, respectively.

Experiment results (2) Publications: Bioinformatics 2004, 2005

Pattern optimization (semi-supervised) • Why is semisupervised algorithm proposed • Many kinds of relationship between protein, gene, and disease. • Data annotation is too expensive. • Key point is ranking function and evaluation algorithm

Semi-supervised Algorithms • Novel Ranking function • Wherep. positive indicates the number of correct instances matched by the pattern p and p. negative denotes the number of false instances. The parameter β is a threshold that controls p/n. If , HD is an increasing function of (p+n), which means if several patterns have the same p/n that exceeds , a pattern with larger (p+n) has a higher rank and is more possible to be saved. If , the first term is negative, which means that a pattern with larger (p+n) will have a lower rank. Thus different ranking strategies are used when different p/n are met. • Heuristic Evaluation Algorithm (HEA) • To reduce redundancy among patterns with an optimization function

Q & A Based Internet Information Acquisition