1 / 10

Web Scraping Using Nutch and Solr 3/3

A short presentation ( part 3 of 3 ) describing the use of open source code nutch and solr to web crawl the internet and process the data.

semtechs
Download Presentation

Web Scraping Using Nutch and Solr 3/3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Solr Extracting Data • Start this session with a full Solr indexed repository • Movie cAiYBD4BQeE showed installation • Movie Th5Scvlyt-E showed Nutch web crawl • This movie will show how to • Extract data from Solr • Extract to xml or csv • Show aim to load into data warehouse • This movie assumes you know Linux

  2. Solr Extracting Data • Progress so far, greyed out area yet to be examined

  3. Checking Solr Data • Data should have been indexed in Solr • In Solr Admin window • Set 'Core Selector' = collection1 • Click 'Query' • In Query window set fl field = url • Click Execute Query • The result ( next ) shows the filtered list of urls in Solr

  4. Checking Solr Data

  5. How To Extract • How could we get at Solr data ? • In admin console via query • Via http solr select • Via curl -o call using solr http select • What format of data – that suits this purpose • Xml • Comma separated variable (csv)‏

  6. How To Extract • We want to extract two columns from Solr • tstamp, url • We want to extract as csv ( csv in call below could be xml )‏ • We want to extract to a file • So we will use an http call • http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv • We will also use a curl call • curl -o <csv file> '<http call>'

  7. How To Extract • Ceate a bash file in Solr install directory • cd solr-4-2-1/extract ; touch solr_url_extract.bash • chmod 755 solr_url_extract.bash • Add contents to bash file • #!/bin/bash • curl -o result.csv 'http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv' • mv result.csv result.csv.$(date +”%Y%m%d.%H%M%S”)‏ • Now run the bash script • ./solr_url_extract.bash

  8. Check Output • Now we check whether we have data • ls -l shows • result.csv.20130506.124857 • Check the content , wc -l shows 11 lines • Check the content , head -2 shows • tstamp, url • 2013-05-04T01:56:58.157Z,http://www.mysite.co.nz/Search? DateRange=7& ... • Congratulations, you have extracted data from Solr • It's in CSV format ready to be loaded into a data warehouse

  9. Possible Next Steps • Choose more fields to extract from data • Allow Nutch crawl to go deeper • Allow Nutch crawl to collect a lot more data • Look at facets in Solr data • Load CSV files into Data Warehouse Staging schema • Next movie will show next step in progress

  10. Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

More Related