csci5250 engg5106 information retrieval and search engines n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CSCI5250/ENGG5106: Information Retrieval and Search Engines PowerPoint Presentation
Download Presentation
CSCI5250/ENGG5106: Information Retrieval and Search Engines

Loading in 2 Seconds...

play fullscreen
1 / 86

CSCI5250/ENGG5106: Information Retrieval and Search Engines - PowerPoint PPT Presentation


  • 158 Views
  • Uploaded on

CSCI5250/ENGG5106: Information Retrieval and Search Engines. Lecture 1: Introduction and Boolean Information Retrieval Prof . Michael R. Lyu. Outline. Administrative Overall Course Introduction Boolean Retrieval System ( Ch.1 of IR Book) Inverted Index Processing Boolean Queries

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

CSCI5250/ENGG5106: Information Retrieval and Search Engines


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. CSCI5250/ENGG5106:Information Retrieval and Search Engines Lecture 1: Introduction and Boolean Information Retrieval Prof. Michael R. Lyu

    2. Outline • Administrative • Overall Course Introduction • Boolean Retrieval System (Ch.1 of IR Book) • Inverted Index • Processing Boolean Queries • Query Optimization

    3. Outline • Administrative • Overall Course Introduction • Boolean Retrieval System • Inverted Index • Processing Boolean Queries • Query Optimization

    4. Motivation • Do you want to work in these companies?

    5. Motivation of the Course • To understand the infrastructure and techniques behind Search Engines. • To know the existing literature and research challenges in the area of Information Retrieval. • To realize how to organize and manage huge amount of information, such as that from on the Web. • To practice a real project in Information Retrieval and/or Search Engine

    6. Textbook • Introduction to Information Retrieval • Christopher Manning, Associate Professor of Linguistics and Computer Science at Stanford • PrabhakarRaghavan, Consulting Professor of Computer Science at Stanford, Vice President of Engineering at Google, previous was Head of Yahoo! Research • HinrichSchütze,Chair of Theoretical Computational Linguistics Institute for Natural Language Processing, University of Stuttgart

    7. Textbook • Amazon • Link • PDF of the book for online viewing • http://www-nlp.stanford.edu/IR-book/

    8. Instructors • Prof. Michael R. Lyu 呂榮聰 • www.cse.cuhk.edu.hk/~lyu • Room 927; lyu@cse • TA: HU Junjie胡俊傑 • www.cse.cuhk.edu.hk/~jjhu • Room 1024; jjhu@cse • TA: ZHAO Tong 趙桐 • www.cse.cuhk.edu.hk/~tzhao • Room 1024; tzhao@cse

    9. Time, Venue, and Website • Lecture • Monday from 9:30 am to 12:15 pm • LHC 104 & G04 (Y.C. Liang Hall, 潤昌堂) • Tutorial • Tuesday from 10:30am to 11:15 am • LSB LT4 • Course URL • http://www.cse.cuhk.edu.hk/csci5250 • http://www.cse.cuhk.edu.hk/engg5106 • Course email: csci5250@cse / engg5106@cse

    10. Grade Assessment Scheme • Two Assignments (20%) • Written assignments • Some programming • One Midterm Examination (40%) • 9:30am – 12:15pm, November 3, 2014 • Open one A4-size paper (double sided fine) • One Project (40%) • Presentations • Report

    11. Class Project • Project is for everyone • 3-4 persons per project group • Each group is to design and implement a search engine of your choice • Email to csci5250@cse or engg5106@cse the names and student IDs of your group by Friday • Project specification and schedule will be assigned next Monday and published on course website.

    12. STUDENT EXPECTATIONS • a positive, respectful, and engaged academic environment inside and outside the classroom; • to attend classes at regularly scheduled times without undue variations, and to receive before term-end adequate make-ups of classes that are canceled due to leave of absence of the instructor; • to receive a course syllabus • to consult with the instructor and tutors through regularly scheduled office hours or a mutually convenient appointment;

    13. STUDENT EXPECTATIONS • to have reasonable access to University facilities and equipment for assignments and/or objectives; • to have access to guidelines on University’s definition of academic misconduct; • to have reasonable access to grading instruments and/or grading criteria for individual assignments, projects, or exams and to review graded material; • to consult with each course’s faculty member regarding the petition process for graded coursework.

    14. FACULTY EXPECTATIONS • a positive, respectful, and engaged academic environment inside and outside the classroom; • students to appear for class meetings timely; • to select qualified course tutors; • students to appear at office hours or a mutual appointment for official academic matters; • full attendance at examination, midterms, presentations, and laboratories;

    15. FACULTY EXPECTATIONS • students to be prepared for class, appearing with appropriate materials and having completed assigned readings and homework; • full engagement within the classroom, including focus during lectures, appropriate and relevant questions, and class participation; • to cancel class due to emergency situations and to cover missed material during subsequent classes; • students to act with integrity and honesty. • CUHK has zero tolerance on plagiarism. Read: http://www.cuhk.edu.hk/policy/academichonesty/

    16. Outline • Administrative • Overall Course Introduction • Boolean Retrieval System • Inverted Index • Processing Boolean Queries • Query Optimization

    17. Definition of Information Retrieval • Information retrieval (IR) is finding material (usually documents) of an unstructured nature that satisfies an information needfrom within large collections (usually stored on computers)

    18. Information Retrieval • Hot in both industrial and research societies

    19. Information Retrieval • Conferences related to IR • SIGIR • WWW • AAAI • CIKM • WSDM • KDD • TREC • ECIR • ACL • EMNLP • COLING • …

    20. Search Engine Issues • Domain of Information • Size, type, etc. • Search Interface • User Interface • Hardware Systems • Scaling Problems • Performance Issues • Search Accuracy • Search Speed

    21. Anatomy of A Search Page

    22. Anatomy of A SearchResult Page

    23. Anatomy of A SearchResult Page

    24. (circa1997) Anatomy of A SearchEngine

    25. SearchEngineModules • Crawling • Storage • Indexing • Queries

    26. The web crawling (downloading of web pages) is done by several distributed crawlers • URLserver sends lists of URLs to be fetched to the crawlers • The webpages that are fetched are then sent to the storeserver • The storeserver then compresses and stores the web pages into a repository • Every web page has an associated ID number called a docID which is assigned Crawler

    27. The indexing function is performed by the indexer and the sorter. • It reads the repository, uncompressesthe documents, and parses them. • Each document is converted into a set of word occurrences called hits. • The hits record the word, position in document, an approximation of font size, and capitalization. • The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. • It parses out all the links in every web page and stores important information about them in an anchors file. • This file contains enough information to determine where each link points from and to, and the text of the link. Indexer

    28. The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. • It puts the anchor text into the forward index, associated with the docID that the anchor points to. • It also generates a database of links which are pairs of docIDs. • The links database is used to compute PageRanks for all the documents. URLresolver

    29. The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index. • The sorter also produces a list of wordIDsand offsets into the inverted index. • A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. • The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries. Sorter

    30. Topics to Cover • Boolean retrieval • The term vocabulary & postings lists • Dictionaries and tolerant retrieval • Scoring, term weighting & the vector space model • Computing scores in a complete search system • Probabilistic information retrieval • Text classification & Naive Bayes • Matrix decompositions & latent semantic indexing • Web search basics • Web crawling and indexes • Link analysis • Multimedia Information Retrieval

    31. Information Retrieval and Search Engine Web Crawling (20) Document Parsing (1,2) Indices Indexing (2,20) Ads (19) Scoring (6,7,9,11,12,18) Classification (13,14,15) Result Clustering (16,17) Quality Ranking (21) Query Query Processing (3)

    32. Crawling (Ch. 20) • Initialize queue with URLs of known seed pages • Repeat • TakeURL from queue • Fetch and parse page • Extract URLs from page • AddURLs to queue • Fundamental assumption: The Web is well linked

    33. Crawling (Ch. 20) • How do we distribute the crawler so we can scale up? • We can’t index everything: we need to subselect. How? • How do we fight against spam and spider traps? • What are basic requirements a crawler should meet?

    34. Document Parsing (Ch. 2) • What decisions should we make when parsing a document? • Language? Character set? Tokenization?

    35. Indexing (Ch. 2, 20) • Why we need index? • Gain speed benefits of indexing at retrieval time, need to build index in advance • Search query: Brutus Calpurnia

    36. Indexing (Ch. 2, 20) • If we employ distribute crawler, how do we construct distributed indices? • Partition by terms? • Partition by documents?

    37. Query Processing (Ch. 3) • How do we deal with wildcard queries? E.g. mon*, *mon • How do we do spelling correction? E.g., googel->google

    38. Classification (13,14,15) How does a computer know whether a news is technology and health? Classification

    39. Clustering (16,17) • Document clustering is the process of grouping a set of documents into clusters of similar documents

    40. Quality Ranking (21) • There are millions of documents relevant to query “information retrieval”, how do we rank them? • Some spammer pages contain repetition of keywords, how do we downgrade their rankings?

    41. Scoring (6,7,9,11,12,18) • Goal: measure how a document relevant to a query • Term frequency, inverse document frequency • Vector space modeling • Relevance feedback • Probabilistic information retrieval • Language modeling • Latent semantic indexing • …

    42. Outline • Administrative • Overall Course Introduction • Boolean Retrieval System (Ch.1 of IR Book) • Inverted Index • Processing Boolean Queries • Query Optimization

    43. Motivation of this Lecture • Introduce the simplest form of information retrieval system • Boolean information retrieval • Understand each component of the Boolean information retrieval system

    44. Does Google Use the Boolean Model? • On Google, the default interpretation of a query [w1 w2. . .wn] is w1AND w2AND . . .ANDwn

    45. Cases where you get hits that do not contain one of the wi • Anchor text • Anchor text usually gives relevantdescriptive and contextualinformation about the content of the link‘s destination • <a href="http://en.wikipedia.org/wiki/Main_Page">Wikipedia</a> Anchor text: “Wikipedia”

    46. Cases where you get hits that do not contain one of the wi • Page contains variant of wi(morphology, spellingcorrection, synonym) • Long queries (n is large) • Boolean expression which generates very few hits

    47. Simple Boolean vs. Ranking of Result Set • Simple Boolean retrieval returns matching documents in noparticularorder • Google (and most well-designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. 49

    48. Outline • Administrative • Overall Course Introduction • Boolean Retrieval System (Ch.1 of IR Book) • Inverted Index • Processing Boolean Queries • Query Optimization