1 / 10

Web Scraping Using Nutch and Solr 2/3

A short presentation ( part 2 of 3 ) describing the use of open source code nutch and solr to web crawl the internet and process the data.

semtechs
Download Presentation

Web Scraping Using Nutch and Solr 2/3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Scraping Using Nutch and Solr - Part 2 • The following example assumes that you have • Watched “web scraping with nutch and solr” • The above movie identity is cAiYBD4BQeE • Set up Linux based Nutch/Solr environment • Run the web scrape in the above movie • Now we will • Clean up that environment • Web scrape a parameterised url • View the urls in the data

  2. Empty Nutch Database • Clean up the Nutch crawl database • Previously used apache-nutch-1.6/nutch_start.sh • This contained -dir crawl option • This created apache-nutch-1.6/crawl directory • Which contains our Nutch data • Clean this as • cd apache-nutch-1.6; rm -rf crawl • Only because it contained dummy data ! • Next run of script will create dir again

  3. Empty Solr Database • Clean Solr database via a url • Book mark this url • Only use it if you need to empty your data • Run the following ( with solr server running )‏ • http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'

  4. Set up Nutch • Now we will do something more complex • Web scrape a url that has parameters i.e. • http://<site>/<function>?var1=val1&var2=val2 • This web scrape will • Have extra url characters '?=&' • Need greater search depth • Need better url filtering • Remember that you need to get permission to scrape a third party web site

  5. Nutch Configuration • Change seed file for Nutch • apache-nutch-1.6/urls/seed.txt • In this instance I will use a url of the form • http://somesite.co.nz/Search?DateRange=7&industry=62 • ( this is not a real url – just an example )‏ • Change conf regex-urlfilter.txt entry i.e. • # skip URLs containing certain characters • -[*!@] • # accept anything else • +^http://([a-z0-9]*\.)*somesite.co.nz\/Search • This will only consider some site Search urls

  6. Run Nutch • Now run nutch using start script • cd apache-nutch-1.6 ; ./nutch_start.bash • Monitor for errors in solr admin log window • The Nutch crawl should end with • crawl finished: crawl

  7. Checking Data • Data should have been indexed in Solr • In Solr Admin window • Set 'Core Selector' = collection1 • Click 'Query' • In Query window set fl field = url • Click Execute Query • The result ( next ) shows the filtered list of urls in Solr

  8. Checking Data

  9. Results • Congratulations you have completed your second crawl • With parameterised urls • More complex url filtering • With a Solr Query search

  10. Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

More Related