INF 141: Information Retrieval

INF 141: Information Retrieval Discussion Session Week 3 – Winter 2010 TA: Sara Javanmardi

How to submit Answers For Assignment3 • Create a PDF le containing your answer to general and extra credit questions. For the programming question, create a txt le containing the answers and a jar le containing the code. Put all les in a folder. Make the folder name < StudentID >{< StudentID >{< StudentID >{Assignment03}, zip it and submit it to EEE (Only one of the team members needs to submit the .zip file).

Grading Assignment 1 & 2 • Come to my office at ICS1:408E (all team members) at one of the following time slots • Wednesday, Jan 19, 3:30 pm to 6 pm • Monday, Jan 24, 9 am to 12 pm • I will ask you to explain your algorithm, to run your code on a test input file. I might ask some other general questions related to these two assignments.

Quiz 1 • Next week Jan 26, in the discussion class • Closed Book • All material covered in weeks 1, 2, 3.

Assignment 3 • You can get it from http://www.ics.uci.edu/~sjavanma/IR/Assignments/Assignment3/ • Deadline Jan 30

Crawler4j • http://code.google.com/p/crawler4j/ • Read the sample usage • Download • crawler4j-2.2.zip Unzip it and put the ‘lib’ folder in your java project. • crawler4j-dependencies-lib.zip *Unzip it add the .jar file to the ‘lib’ folder *Create a folder called ‘resources’ and put the .properties file in it

The ‘lib’ & ‘resources’ folder

Main Classes • Create two classes https://crawler4j.googlecode.com/svn/trunk/crawler4j/src/edu/uci/ics/crawler4j/example/simple/ • Controller • MyCrawler

Controller Class • At this time you should see some errors Oops forgot to import the .jar files!

Add External Jars Select All and press open and then OK

Add Sources To The Classpath

Controller:Setting The Parameters

MyCralwer: Main Methods • shouldVisit(WebURL url) • Should I put this URL in frontier or not? • visit(Page page) • How should I process this page coming from the head of the frontier? • page.getWebURL().getURL(); • page.getHTML(); • page.getText(); • page.getURLs() • Example https://crawler4j.googlecode.com/svn/trunk/crawler4j/src/edu/uci/ics/crawler4j/example/advanced/MyCrawler.java

An Example https://crawler4j.googlecode.com/svn/trunk/crawler4j/src/edu/uci/ics/crawler4j/example/advanced/MyCrawler.java

Content Articles public static boolean isArticle(String titlePartOfUrl) { if (titlePartOfUrl.startsWith("Image:") || titlePartOfUrl.startsWith("Wikipedia:") || titlePartOfUrl.startsWith("Category:")|| titlePartOfUrl.startsWith("Special:") || titlePartOfUrl.startsWith("Image_talk:")|| titlePartOfUrl.startsWith("Portal:")|| titlePartOfUrl.startsWith("Wikipedia_talk:") || titlePartOfUrl.startsWith("User:")|| titlePartOfUrl.startsWith("Template:")|| titlePartOfUrl.startsWith("Template_talk:") || titlePartOfUrl.startsWith("Help:")|| titlePartOfUrl.startsWith("Talk:")|| titlePartOfUrl.startsWith("User_talk:") || titlePartOfUrl.startsWith("Category_talk:") || titlePartOfUrl.startsWith("Media:")|| titlePartOfUrl.startsWith("MediaWiki:") || titlePartOfUrl.startsWith("File:") || titlePartOfUrl.startsWith("MediaWiki_Talk:")) {return false;} return true;} http://en.wikipedia.org/wiki/Bing_search_engine http://en.wikipedia.org/wiki/Category:Bing

Main Questions To Answer • How to count unique terms, what data structure? • How to write the result in file(s)? • How to solve concurrency problems that might happen? • Static • Synchronized • Atomic Integer

Example: IO • I have only one file and all threads(crawlers) write in it • Each thread has its own file and I merge all files when the threads threads(crawlers) are done.

Sample Code Snippet private static PrintStream out; static { try { out = new PrintStream("/home/sara/Wikipedia2/Train-Test-Features/testUserPageStatus.txt"); } catch (FileNotFoundException e) { e.printStackTrace(); }} public MyCrawler() { }

INF 141: Information Retrieval