1 / 33

Web Search and Mining “ WMa ” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC)

Web Search and Mining “ WMa ” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC). (Venue may change). Motivation. Few areas of computer science have had as much impact on our lives in the last 15 years as the Internet, the Web, search engines, e-commerce, and social media

olin
Download Presentation

Web Search and Mining “ WMa ” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Search and Mining“WMa”CS635 Autumn 2013Mon Thu 6:30—8:00pmLCH31 (third floor LHC) (Venue may change)

  2. Motivation • Few areas of computer science have had as much impact on our lives in the last 15 years as the Internet, the Web, search engines, e-commerce, and social media • Rapid rise of some of the largest and most accomplished companies on earth • Draws heavily from data structures, algorithms, databases, probability and statistics, machine learning, parallel computing, economics and game theory, social sciences, …(“whatever works”)

  3. Highlights Arthur C. Clarke predicts that satellites would "bring the accumulated knowledge of the world to your fingertips" using a console combining the functionality of the photocopier, telephone, television and a small computer, allowing data transfer and video conferencing around the globe. 1970 1990 Infrastructure Tim Berners-Lee &co build first prototype at CERN 1992 Mosaic Web brower developed Crawling 1994 Netscape, Yahoo! founded Indexing 1995 Alta Vista launched by Digital Equipment Corp PageRank 1996 Backrub, later called Google, founded 1998 Goto.com introduces paid search and ads De-spamming Text mining 2003 Google launches AdSense Auctions 2004 Facebook founded Social media Personalization 2005 YouTube founded Query+clicklog mining Collaborativefiltering 2006 Twitter founded Local search 2009 Bing launched by Microsoft

  4. CS635 and CS728 • CS635 = Web search and mining = first course on Web information retrieval • Indexing, query processing, ranking • Corpus and document models, text mining • Hyperlinks and social network analysis • Large scale data analysis: “big data” • CS728 = Organization of Web information = second course with more recent topics • Bootstrapping knowledge bases • Named entity and relation recognition • Open domain knowledge extraction • Semantic annotations and search

  5. Vs. CS101 • Usually CS635 runs in Autumn and CS728 runs in Spring • Taught CS101 in Spring 2012 • Largely recovered from the trauma in Autumn 2012, didn’t teach anything • Therefore offered CS635 in Spring 2013 • Offering CS635 again in Autumn 2013 to reset the clock • To avoid falling asleep, may sneak in new material

  6. Administrivia • http://soumen.in/teach/cs635a2013/ • soumen.in = www.cse.iitb.ac.in/~soumen = soumen.cse.iitb.ac.in • https://moodle.iitb.ac.in/course/view.php?id=2550 • Meant primarily for Mtech1, Btech3, PhD • Mtech2, Btech4 ok, a little late for research? • Weak prerequisites: prob+stat, databases • If you can read and understand a fair bit of math, you will need ~10 days of self-study to pick up the relevant pieces from basic machine learning

  7. Course design: books • Basic material from a few books • Manning-Raghavan-Schutze online book • Baeza-Yates and Ribeiro-Neto • Managing Gigabytes (plumbing tips) • MTW second edition in progress, notes will be available for many segments

  8. Course design: research papers • Papers written much faster than you can write a book • Digesting papers: critical survival skill • They sound like nothing left to do • You get credit for pointing out loopholes, flaws, and further ideas to follow from existing papers

  9. Evaluation • Homeworks • All homeworks and exams from all earlier offerings are effectively (ungraded) homeworks (many with solutions) — take advantage! • Hands-on work (coding, simulations and measurements) • Projects: depends on class size, interest • Exams • Limited time, in-class • Problems “served on a platter” • Hard to pose and solve real systems issues

  10. Multidisciplinary synthesis • Content: hypermedia, markup standards, text and semistructured data models • Activity: linking, blogging, selling, spamming • Algorithms: graphs, indexing, compression, string processing, ranking • Statistics: models for text and link graph, catching (link) spam, profiling queries • Language: tagging, extraction • Plumbing: networking, storage, distributed systems, map-reduce, “big data”, cloud …

  11. CS635 syllabus • Text indexing, search • Document and corpus models • Ranking, learning to rank • Whole-document labeling/classification • Measuring and modeling social networks • Hyperlink assisted search and mining • Web sampling, crawling, monitoring

  12. CS728 syllabus • Entity and relation annotation • Information extraction and integration • Coreference resolution • Bootstrapping Web knowledge bases • Adding graph structure to text search • Labeling and ranking in graph data models

  13. Text indexing, search and ranking • Boolean queries involving word occurrence in documents • Inverted index design, construction, compression, updates; query processing • Vector space model, relevance ranking, recall, precision, F1, break-even • Probabilistic ranking, belief networks • Fast top-k search • Similarity search: minhash, random projections, locality-preserving hash

  14. Document and corpus models • Token, compound, phrase, stems, stopwords • Language and typography issues • Practical computer representations • Bag of words, corpus, term-document matrix • Probabilistic generative models • Modeling multi-topic corpus and documents • Statistical notions of semantic similarity • Text clustering applications

  15. Diversity Query suggestion Query = SL100

  16. Whole-document labeling/classification • Topics on Yahoo!, spam vs. non-spam, etc. • Bayesian classification using generative corpus models • Conditional probabilistic classification • Discriminative classification • Transductive, semi-supervised and active labeling • Exploiting graph connectivity for improved labeling accuracy • Machine learning needed here

  17. Label coupling Instance Local feature variable Label variable Topic Topic CS705, IT655

  18. Measuring and modeling social networks • Prestige, centrality, co-citation • Web graph: degree, diameter, dense subgraphs, giant bow-tie • Generative models: preferential attachment, copying links, “Googlearchy” • Link locality and content locality • Generating synthetic social networks • Compressing and indexing large social networks, reference compression, connectivity server, reachability index

  19. Internet Gnutella Web

  20. Hyperlink assisted search and mining • Review of spectral graph theory CS725 • Hyperlink induced topic search (HITS) and Google’s Pagerank • Large-scale computational issues • Stability, topology sensitivity, spam resilience • Other random walks: SALSA, PHITS, maxent walks, absorbing walks, SimRank, … • Personalized and topic-sensitive Pagerank • Viral marketing, social networks • Link prediction, recommender systems

  21. Web sampling, crawling, monitoring • Plumbing: DNS, TCP/IP, HTTP, HTML • Large-scale crawling issues: concurrency, network load, shared work pool, spider traps, politeness • Setting crawl priorities using graph properties and page contents • Sampling Web pages using random walks • Monitoring change and refreshing crawls • Driven by query workload • Driven by search and advertising customers?

  22. Adding graph structure to text search • XML and related (largely) tree data models • Typed entity-relationship networks • Path expressions, integrating word queries with path matches; indexing, query processing • Spreading activation queries: find entity of specified type “near” matching predicates • Steiner query: explain why two or more entities (or words) are closely related

  23. “Find a person near IBM and IITB”

  24. Entity annotation in text • Need a label catalog • Michael Jordan, Stuart Russell • Collective annotation

  25. Name a physicist who searchedfor intelligent life in the cosmos  type=physicist NEAR “cosmos”… Where was Sagan born? type=region NEAR “Sagan” When was Sagan born?  type=timepattern=isDDDD NEAR“Sagan” “born” Bridging the Structured-Unstructured Gap abstraction entity is-a region person city scientist time district physicist year state astronomer hasDigit isDDDD Born in New York in 1934 , Sagan wasa noted astronomer whose lifelong passionwas searching for intelligent life in the cosmos.

  26. Mining the Web as a noisy database

  27. Mining Web tables • “Organically grown” Web tables • Much relational information • No coordinated schema • Use social entity and type catalogs to label

  28. Small answer subgraph search http://www.cse.iitb.ac.in/banks/

  29. Bootstrapping Web knowledge bases • Hearst, 1992; KnowItAll (Etzioni+ 2004) • T such as x, x and other Ts, x or other Ts, T x, x is a T, x is the only T, … • Google sets

  30. Information carnivores at work KO :: India Pakistan Cricket SeriesA web site by Khalid Omar, sort of live from Karachi, Pakistan. “cities such as [probe]” “[probe] and othercities”, “[probe] is a city”, etc. • “Garth Brooks is a country” [singer],“gift such as wall” [clock] • “person like Paris” [Hilton],“researchers like Michael Jordan” (which one?)

  31. Sample output • author; “Harry Potter” • J K Rowling, Ron • person; “Eiffel Tower” • Gustave, (Eiffel), Paris • director; Swades movie • Ashutosh Gowariker, Ashutosh Gowarikar • What can search engines do to help? • Cluster mentions and assign IDs • Allow queries for IDs — expensive! • “Harry Potter” context in “Ron is an author” Ambiguity andextremely skewedWeb popularity

  32. Summary • Understand Web-scale text data processing • Crawling, indexing, search, ranking, link mining, personalization • Mix of theory and practice • VC for “big data” and “cloud” exceeds that for “social ___” • Jobs! • Are you “the one”?

More Related