40 likes | 160 Views
Discover how to kickstart your web crawling project with Apache Nutch and index your data directly into Solr. This guide covers the essentials of setting up Nutch, including creating a seed list from DMOZ and understanding content extraction through tools like LingPipe and OpenNLP. You'll find links to useful resources and tutorials to help you get started on your journey into web crawling and entity identification. Learn how to navigate through essential components of semantic web technology.
E N D
Search Bootstrapping How / Where to get started
Crawling • Start with Nutch • http://nutch.apache.org/ • Index directly to SOLR • http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/ • Create a seed list from DMOZ rdf • http://www.dmoz.org/rdf.html • http://wiki.apache.org/nutch/NutchTutorial
Understanding Content • Entity Extraction • LingPipehttp://alias-i.com/lingpipe/ • OpenNLPhttp://incubator.apache.org/opennlp/ • Entity Identification / Taxonomies • Freebase http://www.freebase.com/
Some Additional Links • Basic Web Page Parser • https://github.com/pjaol/Webcrawler • Example of OpenNLP usage • https://github.com/pjaol/entity_extractor