1 / 36

Information Retrieval and Web Search

Vasile Rus, PhD vrus@memphis.edu www.cs.memphis.edu/~vrus/teaching/ir-websearch/. Information Retrieval and Web Search. Administrivia Why Information Retrieval? Information Overload. Outline. Web Site: http://www.cs.memphis.edu/~vrus/teaching/ir-websearch/ Instructor Vasile Rus, PhD

lafountain
Download Presentation

Information Retrieval and Web Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Vasile Rus, PhD vrus@memphis.edu www.cs.memphis.edu/~vrus/teaching/ir-websearch/ Information Retrieval and Web Search

  2. Administrivia Why Information Retrieval? Information Overload Outline

  3. Web Site: http://www.cs.memphis.edu/~vrus/teaching/ir-websearch/ Instructor Vasile Rus, PhD Office: 323 Dunn Hall Office Hours: 323 Dunn Hall; T-R 10:00-11:00AM Phone: x5259 E-mail: vrus@memphis.edu TA Shanshan Gao Office hours: TBD General Information

  4. will help you cope with the information overload problem will allow you to design and implement solutions for handling large collections of information is FUN! (hopefully) Why Attending this Class ?

  5. Week 1: Introduction to IR and Web Search Week 2: Introduction to PERL Week 3: Classic IR: Boolean and Vectorial Models Week 4: More IR Models Week 5: Evaluation in IR Week 6: Query Operations and Languages Week 7: Text Properties, Text Operations Week 8: NO CLASS – FALL BREAK, Indexing and Searching, Review Week 9: MIDTERM, WWW and Web Search Intro Syllabus

  6. Week 10: Web Search Week 11: Text Categorization Week 12: Text Clustering Week 13: Question Answering Week 14: Advanced IR Models, THANKSGIVING HOLIDAY Week 15: Project Presentations, Review Week 16: Final Exam Syllabus (cont’d)

  7. Read the syllabus Understand the structure of the course Read the general policies Attend classes and participate by asking questions or/and contributing with related remarks Explore the course website To be successful you need to

  8. Try to enjoy the programming assignments Don't limit yourself to what is asked in class To be successful you need to

  9. Assignments 6-8 (or more) Assignments: 35% Project (30%) 2 Exams Midterm (15%) Final (15%) Active Participation, Presentations (5%) Grading

  10. Grading 2.5 above or below the cut-off will earn you a + or – in front of your grade. For example: 89 has a letter equivalent of B+ Exception: 90-91 will give you A-, 92 to 96 will give you A, anything above 97 means A+.

  11. Attendance can help you when on borderline PhD Students need to make a class presentation (besides project presentation) General announcements are posted on the web site frequently! Please check it out as often as possible If you notice any inconsistencies on the website (broken links, misspellings, etc.) please notify me Thank you! Other Issues

  12. REQUIRED: Baeza-Yates & Ribeiro-Neto Modern Information Retrieval (required) RECOMMENDED (!) Frakes & Baeza-Yates Information Retrieval: Data Structures and Algorithms C. Manning, P. Raghavan, and H. Schutze: Introduction to Information Retrieval Bibliography

  13. During the following times I'll be available in my office TR: 10:00AM - 11:00AM By appointment You must send me an email to set up an appointment If you just knock on my door without notice the chances are that I'll be busy TA’s office hours can be found on the website Please use the office hours! Office Hours and Extra Help

  14. Submissions: You will have on average one-two weeks from the date the work is assigned Late submissions are not accepted In exceptional cases you may have a 48-hour grace period at the cost of 50% of the grade (you should ask for it before the due date) Assignment Submission

  15. Programming submissions are Electronic (using a form or email) ANDon paper should contain your name as part of the file name and the assignment number e.g.: vasileRus.Assignment01.sh (the code) should be well indented and contain lots of comments see the Recommended code-style guidelines on the website Each file should contain a header as given in the next slide If multiple files are submitted, pack them using gzip, tar, etc. Programming Assignments

  16. /************************************* * Name: FileName, Package name if necessary * Assignment: assignment ID * Description: a text describing the assignment * Author: Your Name * Date: put here the due date * Comments: any comments you think are necessary *************************************/ File Header

  17. Plagiarism Plagiarism is not tolerated. If caught, you'll be given grade 0 (zero) and disciplinary actions will be taken It's OK to help some of your friends who may have problems This is actually a good learning tool but it is not OK to share code or answers. If they need, help/discuss with them but never show them your code I may (and I will) ask you to demonstrate and explain your programs Plagiarism

  18. During exams you should sit as far from each other as possible As rule of thumb, leave at least one chair between you and any other student Usually, all exams are closed book Exams are normally made of: true-false questions multiple-choice questions “open” questions (programming or not) There are no make-up exams Exams

  19. Questions

  20. “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden) Information Overload

  21. Information Overload

  22. Coping With It! • “reserve large blocks of time on your calendar, don’t answer the phone, and return calls in short bursts once or twice a day” (Drucker, 1967)

  23. Coping With It! • some combination of focusing, filtering, and forgetting • It requires a tremendous amount of self-discipline, and we can’t do it alone: in our teams and across the whole organization, we need to establish a set of norms that support a more productive way of working. • “Multitasking is not heroic; it’s counterproductive” • http://www.mckinsey.com/insights/organization/recovering_from_information_overload

  24. Coping With It! • We have to admit, for example, that we do feel satisfied when we can respond quickly to requests and that doing so somewhat validates our desire to feel so necessary to the business that we rarely switch off. There’s nothing wrong with these feelings, but we need to consider them alongside their measurable cost to our long-term effectiveness. No one would argue that burning up all of a company’s resources is a good strategy for long-term success, and that is equally true of its leaders and their mental resources.

  25. Text books, periodicals, WWW, memos, ads published/refeered Film Photos, other Images Broadcast TV, Radio Telephone Conversations Databases What kinds of information are there?

  26. How much information is there?(Estimates courtesy of Hal Varian and Peter Lyman) Original: http://www.sims.berkeley.edu/emc Newer: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/

  27. Stored Information Print Film Optical Magnetic Communicated Internet Broadcast Phone Mail How Much Information?

  28. Annual Production Books 968,735 = 8 Terabytes (compressed image) Newspapers 22643 = 25 Terabytes Journals 40000 = 2 Terabytes Magazines 80000 = 10 Terabytes Office Documents 12x10^9 pages = 312 Terabytes TOTAL 357 Terabytes Print

  29. Library of Congress Printed book collection About 18 Million books About 130 Terabytes (compressed image) For all of LC we should also assume 13M photographs, 5MB each = 65 TB 4M maps, say 200 TB 500K files, 1GB each = 500 TB 3.5M sound recordings, ~2000 TB Grand total: 3 petabytes (~3000 terabytes) Books in Print (which you can buy TODAY) 3.2 Million titles About 26 Terabytes Print

  30. Film Photographs = 410 Petabytes per year Movies = 16 Terabytes (Commercial Production of about 4000 films) X-Rays = 12 Petabytes Film and Image

  31. CD-Music 90,000 items = 58 TB CD-ROM 3,000 items = 3 TB DVD-Video 5,000 items = 22 TB Total 83 TB Optical Media

  32. Audio Tape 184,200,000 = 184.2 Petabytes Video Tape 355,000,000 = 1420 Floppy disks = 0.07 Removable disks = 1.69 Hard Disks = 500 Magnetic Media

  33. Totals Stored Per Year Medium Type of content Terabytes/Year Terabytes/Year Upper Bound Lower Bound Paper Books 8 7 Newspapers 25 20 Periodicals 12 12 Office documents 312 312 SUBTOTAL 357 351 Film Photographs 410,000 100,000 Cinema 16 16 X-Rays 12,000 12,000 SUBTOTAL 422,000 112,016 Optical Music CDs 58 40 Data CDs 3 3 DVDs 22 22 SUBTOTAL 83 65 Magnetic Camcorder 300,000 300,000 Disk drives 2,555,000 1,000,20 SUBTOTAL 2,855,000 1,300,200 TOTAL 3,277,440 1,412,632

  34. Landauer 86: Human brain holds 200MB looked at rate of information intake and rate of forgetting, and amount of information adults need for normal tasks 6B people on earth implies total memory of all people alive about 1,200 petabytes Another way: estimate that people take in a byte/sec lifetime 250,000 days or 2B sec result is 2 GB (doesn’t count synthesizing new info) Human Memory

  35. Administrivia Why Information Retrieval Summary

  36. Introduction to Information Retrieval Next

More Related