1 / 18

Relevance in big data : an IR perspective

Relevance in big data : an IR perspective. Jian-Yun Nie University of Montreal. Big data dream. Data available when needed Data can be processed at a click Finding needed data as required Relating data from different sources Understanding the data correctly

argyle
Download Presentation

Relevance in big data : an IR perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relevance in big data: an IR perspective Jian-Yun Nie University of Montreal

  2. Big data dream • Data available when needed • Data can be processed at a click • Finding needed data as required • Relating data from different sources • Understanding the data correctly • Making appropriate inference/prediction • …

  3. Implicit assumptionsunderlying some dreams • Data are well structured • We understand what a field means and how it is connected to others • E.g. transactional data • Data are precisely valued • A value has a standard representation and unique meaning • E.g. Date=20130325

  4. What we can do under the assumptions • Retrieving exactly what we want • Formulate a query formally • The result is what we want • E.g. all transactions in US dollar of a client, all people who traveled to Canada in 2012 • Discover patterns and relations among data • Exploit data structure and values • E.g. People who buy bread also buy butter. • Connecting data from different sources • E.g. Finding a picture of the house of a person who bought a fridge a day ago

  5. Some realities in big data • Data are often unstructured • Data values expressed in flexible and imprecise way • Date=2013/06/25 vs. 25/06/13 vs. 13年6月25号 • Date=beginning of summer in 2013 • Date=几天以前 • Difficulties in making sense of a data (understanding) • E.g. the word “china” in a text • Content of an image or a sequence of video • Who/what are shown? • Data may still be too large to be processed completely • All the surveillance videos • All the texts on the web

  6. Format of big data • Unstructured texts • Webpages • Microblog posts • Auto. Speech Recognition • Multimedia data • Sensor data • Surveillance camera • Pictures • … • How can we query these data?

  7. Basic retrieval in big data • Query: • We can no longer always formulate a requirement in SQL or alike • E.g. finding posts relating a social event • A picture of someone • We no longer query for data, but for information • Query = expression of some characteristics of the information • Retrieval: • Cannot rely on simple matches of values • There is not a unique set of answers • Criterion: Relevance • How relevant a piece of information is to a query?

  8. Data retrieval vs. Information retrieval • Data = structured • Query = exact formal specification • Retrieval = exact value match • Answer = exact set • Information =unstructured • Query = approximate specification • Retrieval =concept match • Answer = ranked list, more or less relevant

  9. Dealing with textual data/information • A large part of available data (likely in Big Data) is textual • Webpages • Microblog discussions • Auto. Speech Recognition • … • IR has been in the big textual data era for some years

  10. How do we process big textual data? • Infrastructure: computer clusters, cloud • Distributed computing: MapReduce • Understanding texts (NLP): Indexing, information extraction, named entities, … • Retrieval models: define a function of match and ranking • Goal: approximate user’s relevance judgments

  11. Relevance in IR • No formal definition of relevance • Is variable, depending on query, user, time, … • IR: defining models to approximate user relevance as much as possible • Boolean model, language model • Learning-to-rank: Learn an approximate function of relevance from samples • Learning from users (query logs)

  12. Examples on using web data • Usage frequency • What usage is correct in English? • Correlation between economic development and search behavior • Higher correlation with search in future  more developed • Using approximate statistics based on relevant data

  13. Relevance as a basic concept in big data • Retrieving a relevant subset of data corresponding (more or less) to a criterion • User does not know how data is organized • Mining correlation/relationship between subsets of data / raking lists • E.g. People active in microblogs are also active in society? • Prediction using more abstract information concepts instead of data • E.g. youngstersfamiliar with IT will …

  14. Accessing the quality of a retrieval model • Any IR system can retrieve a set of answers • A system is not very useful if it happens to find some relevant ones for a query from time to time • Ideally, it should always do it • Implemented relevance ≈ User relevance • Criteria of quality • Desired answers  System answers • Precision + Recall • nDCG, …

  15. Quality of a system on big data • Retrieval system: Is the data retrieved relevant? • How can we access quality of retrieval in big data? • Precision is easy to access • Recall is impossible • nDCG not always enough • May depend on applications

  16. Should we access the quality of a mining system? • Success story: A system can successfully mine the relation between “buying bread” and “buying butter”. • Informed mining: Humans make a hypothesis • However, if the system also mine a large number of irrelevant relations at the same time? • Still useful: Can provide possible relations to human experts • More useful: find candidate relations of higher quality • Precision/Recall • Can we define standard answers (data or relations to be mined)? Let’s dream !

  17. From data processing to information processing • Data  Information  Knowledge • Data retrieval: exact representation • Information: flexible representation • Mining knowledge from data  Mining knowledge from information • E.g. bad-weather  more traffic jam? • Relevance as a basic notion in big data • Access relevant data/information • Mining relevant properties on relevant data/information

  18. Thanks

More Related