180 likes | 196 Views
INF 141: Information Retrieval. Discussion Session Week 3 – Winter 2010 TA: Sara Javanmardi. How to submit Answers For Assignment3.
E N D
INF 141: Information Retrieval Discussion Session Week 3 – Winter 2010 TA: Sara Javanmardi
How to submit Answers For Assignment3 • Create a PDF le containing your answer to general and extra credit questions. For the programming question, create a txt le containing the answers and a jar le containing the code. Put all les in a folder. Make the folder name < StudentID >{< StudentID >{< StudentID >{Assignment03}, zip it and submit it to EEE (Only one of the team members needs to submit the .zip file).
Grading Assignment 1 & 2 • Come to my office at ICS1:408E (all team members) at one of the following time slots • Wednesday, Jan 19, 3:30 pm to 6 pm • Monday, Jan 24, 9 am to 12 pm • I will ask you to explain your algorithm, to run your code on a test input file. I might ask some other general questions related to these two assignments.
Quiz 1 • Next week Jan 26, in the discussion class • Closed Book • All material covered in weeks 1, 2, 3.
Assignment 3 • You can get it from http://www.ics.uci.edu/~sjavanma/IR/Assignments/Assignment3/ • Deadline Jan 30
Crawler4j • http://code.google.com/p/crawler4j/ • Read the sample usage • Download • crawler4j-2.2.zip Unzip it and put the ‘lib’ folder in your java project. • crawler4j-dependencies-lib.zip *Unzip it add the .jar file to the ‘lib’ folder *Create a folder called ‘resources’ and put the .properties file in it
Main Classes • Create two classes https://crawler4j.googlecode.com/svn/trunk/crawler4j/src/edu/uci/ics/crawler4j/example/simple/ • Controller • MyCrawler
Controller Class • At this time you should see some errors Oops forgot to import the .jar files!
Add External Jars Select All and press open and then OK
MyCralwer: Main Methods • shouldVisit(WebURL url) • Should I put this URL in frontier or not? • visit(Page page) • How should I process this page coming from the head of the frontier? • page.getWebURL().getURL(); • page.getHTML(); • page.getText(); • page.getURLs() • Example https://crawler4j.googlecode.com/svn/trunk/crawler4j/src/edu/uci/ics/crawler4j/example/advanced/MyCrawler.java
An Example https://crawler4j.googlecode.com/svn/trunk/crawler4j/src/edu/uci/ics/crawler4j/example/advanced/MyCrawler.java
Content Articles public static boolean isArticle(String titlePartOfUrl) { if (titlePartOfUrl.startsWith("Image:") || titlePartOfUrl.startsWith("Wikipedia:") || titlePartOfUrl.startsWith("Category:")|| titlePartOfUrl.startsWith("Special:") || titlePartOfUrl.startsWith("Image_talk:")|| titlePartOfUrl.startsWith("Portal:")|| titlePartOfUrl.startsWith("Wikipedia_talk:") || titlePartOfUrl.startsWith("User:")|| titlePartOfUrl.startsWith("Template:")|| titlePartOfUrl.startsWith("Template_talk:") || titlePartOfUrl.startsWith("Help:")|| titlePartOfUrl.startsWith("Talk:")|| titlePartOfUrl.startsWith("User_talk:") || titlePartOfUrl.startsWith("Category_talk:") || titlePartOfUrl.startsWith("Media:")|| titlePartOfUrl.startsWith("MediaWiki:") || titlePartOfUrl.startsWith("File:") || titlePartOfUrl.startsWith("MediaWiki_Talk:")) {return false;} return true;} http://en.wikipedia.org/wiki/Bing_search_engine http://en.wikipedia.org/wiki/Category:Bing
Main Questions To Answer • How to count unique terms, what data structure? • How to write the result in file(s)? • How to solve concurrency problems that might happen? • Static • Synchronized • Atomic Integer
Example: IO • I have only one file and all threads(crawlers) write in it • Each thread has its own file and I merge all files when the threads threads(crawlers) are done.
Sample Code Snippet private static PrintStream out; static { try { out = new PrintStream("/home/sara/Wikipedia2/Train-Test-Features/testUserPageStatus.txt"); } catch (FileNotFoundException e) { e.printStackTrace(); }} public MyCrawler() { }