1 / 38

Introduction to YouSeer

Discover the overview, components, and advantages of YouSeer, a complete and flexible open source search engine that integrates Heritrix and Solr. Learn about its architecture, workflow, and demo setup.

selvin
Download Presentation

Introduction to YouSeer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to YouSeer Partha Mukherjee pom5109@ist.psu.edu

  2. Outline • Overview • YouSeer components • Heritrix • Solr • Demo

  3. Overview • YouSeer: is a complete and powerful open source search engine available on SourceForge that integrates the open source crawler Heritrix with the open source indexer Solr/Lucene. • Java-based, and run successfully on Windows Requirements • 512 MB RAM, 6.5 GB on Hard Disk • Java 1.6 ( Java 1.5 also works)

  4. Advantages of YouSeer • Built on top of scalable components – Tested on 23M documents, while Solr and Heritrix can scale to billions • Very flexible, and easy to extend • Modifying the index and the ingestion module is easy • The crawler supports complicated crawling policies

  5. YouSeer Components • Heritrix: – The Internet Archive’s crawler – Reported to scale up to 1B documents – Written in Java, and has a web interface • Apache Solr: – open source enterprise search server based on the Lucene – Has REST-like API – Supports caching, distributed search, and index replication

  6. YouSeer Architecture Cache Request Apache Tomcat DB heritrix WWW Storage File System Middleware Apache Solr

  7. Heritrix Workflow • 1) Choose a URI from all among the scheduled • 2) Fetch that URI • 3)Analyze or archive the results • 4) select discovered URIs of interest, and add to those scheduled • 5) Note that the URI is done and repeat

  8. Heritrix Crawl Result • By default, heritrix writes all its crawled to disk as Internet Archive ARC files • By default, Heritrix writes compressed version of ARC files • The compression is done with gzip • Each record (which contain a document) is gzipped • All gzipped records are concatenated together to make up a file of multiple gzipped members 9

  9. Apache Solr • Very popular distribution of Lucene • Easy to configure and optimize – All modifications are in the XML files – No need to touch the code • The index has a schema, similar to database schema – Think of the index as a table in the database, and you have to define the columns

  10. Solr Schema Example • • • <field name="url" type="string" indexed="true" stored="true"/> <field name="title" type="text" indexed="true" stored="true"/> <field name="keywords" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true"/> <field name="creationDate" type="date" indexed="true" stored="true"/> <field name="rating" type="sint" indexed="true" stored="true"/> <field name="published" type="boolean" indexed="true" stored="true"/> <field name="content" type="text" indexed="true" stored="true" /> <field name="all" type="text" indexed="true" stored="true" multiValued="true"/> • • • • • 11

  11. Solr Documents • Solr accepts well formatted XML documents • <add> <doc> – <field name=“URL">www.cnn.com</field> – <field name=“title">CNN Breaking News – Obama wins</field> – <field name=“content">Barack Obama is the 44th president of the USA</field> – <field name=“pubDate">2008-11- 06T23:59:59.999Z</field> • </doc> </add> 12

  12. YouSeer workflow • Waits for the crawled documents to be written • Iterates on the compressed files, and process the documents • Extract the textual content of the document, and parse metadata • Generate an XML file as output – Each custom extractor appends its result to this file • This XML file is submitted to the index

  13. Demo: Configurtion • The schema of Solr is already configured in your installation • Solr is installed on tomcat • Heritrix web interface is listening on the port 8080 by default same as Apache TomCat server. • So change it to some other port number – i.e. ./hertitrix –p 9000

  14. Demo • Download Virtual Machine image from http://sourceforge.net/projects/youseer/files/VM/youseer.0.1 /fedora-11-i386.zip/download – Unzip fedora-11-i386.zip – The virtual image is a linux VMware image • To run the VM, you need to download and install VMware player from: http://www.vmware.com/products/player/ – Double click on Vmware virtual machine configuration icon

  15. Demo

  16. Demo • Get into YouSeer with password “heritrixsolr”. – You are in a virtual Linux environment sitting in Windows. • While leaving the VM environment – Log out from youseer (“youseer -> quit” ) – Shutdown the VM (“ shutdown”) – Press Ctrl + Alt to work in your local machine.

  17. Demo

  18. Demo • About to start Heritrix (crawler) !!! – In VM open a terminal – Go to apps directory (cd apps) – You find solr, tomcate, heritrix-1.14.3 etc applications – Don’t forget to start up solr server before running heritrix • Go to apache-solr…/example/ • Locate the jar file “start.jar” and run it. • Solr should run all the time.

  19. Demo

  20. Demo

  21. Demo • Now open another terminal or another tab from the same terminal – Go to heritrix-1.14.3 under /home/apps. – Run heritrix application with the following command line arguments • ./heritrix –p XXXX - -admin=nameX:passwordX • Now open the browser in VM and type the URL – http://localhost:XXXX – Get heritrix UI (Username= nameX and password = passwordX)

  22. Demo: Heritrix

  23. Demo: Heritrix

  24. Demo: Heritrix

  25. Demo: Heritrix • Configure first job • Most important parameter is user agent under configurations – Enter a valid URL and email address – Enter http://www.psu.edu – And your OWN email address – Do not run more than 5 threads • Avoid machine “tireness” and system crash.

  26. Demo: Heritrix

  27. Demo: Features of Heritrix

  28. Demo: More features

  29. Demo : Heritrix

  30. Demo • ARC files are written to: – ~/crawler/heritrix-1.14.3/jobs/JOB-NAME/arcs • To start tomcat, enter start-tomcat – Solr will start automatically • YouSeer ingestion module (middleware) is located under: – ~/youseer/release • Add folder entry to Apache web server configuration file – Retrieve cached copies of documents from ARC files – Use URL of the solr to post the document – Specify number of working threads to process the documents – Java –jar YouSeer.jar [IndexURL] [Path_ARCfiles] [Cached_virtual_Folder][Number_of_Threads][wait_Time]

  31. Demo • To index documents crawled by heritrix: – Navigate to ~/youseer/release – Run: java –jar YouSeer.jar http://localhost:8983/solr/update /absolute/path/to/arc/files /cachingDirectory 1 0 • Solr URL • The full path to the ARC files • The virtual directory which maps to the cached files • Number of threads, please keep it <5 • Waiting Time between retries

  32. Demo

  33. Comments • YouSeer tracks which arc files has been processed into the database, default name is submitted.db – If you want to re-ingest the documents, Map virtual directory within TomCat directory – Update the submitted.db file – Execute $ path= /cached docBase=“/heritrix- 1.14.3/jobs/JOB_NAME/arcs” crossContext=“false” debug=“0” reloadable=“true”/ • The search interface: – http://localhost:8080/youseerui

  34. Shots

  35. Test case (http://pike.psu.edu)

  36. Test Case(:pike)

  37. References • http://youseer.sourceforge.net/doc/Tutorial.pdf • http://crawler.archive.org/articles/user_manual/ • http://lucene.apache.org/solr/tutorial.html Want to Download separately?? • https://sourceforge.net/projects/youseer/ • https://sourceforge.net/projects/archive – crawler/files/archive-crawler%20(heritrix%201.x)/ • http://www.apache.org/dyn/closer.cgi/lucene/solr

  38. THANK YOU

More Related