www.honeynetproject.ca Date: May 24, 2009 Prepared by: Serge Gorbunov ( serge@gserge.com )

Web Crawling www.honeynetproject.ca Date: May 24, 2009 Prepared by: Serge Gorbunov (serge@gserge.com)

Content • What is web crawling? • What can you do with web crawlers? • Googlebot • Web crawling process • Simple architecture • Advanced architecture – spider-monkey • What is and why to crawl the hidden web? • Questions to ask “before you crawl”

What is web crawling? • Process of browsing Internet/WWW to collect index of data, analyze it and store for future reference • Crawlers must be able to download many pages at a short period of time and update already downloaded pages • Crawlers are used by search engines, marketing companies, researchers and others

What can you do with web crawlers? • “Download” the Internet • Quickly find important information for marketing purposes • Study societies/nations/groups of people • Analyze malware, spyware and “junk” on the Internet • Count most repeated words and letters in the world

Googlebot • Crawler used by google to index, cache the Internet • Step 1: Visit a number of pages -> extract all links -> visit all pages -> extract all links -> etc. • Step 2: Algorithm called PageRank assesses a specific page's importance by how many other Web pages link to it and by the importance of those linking pages

Googlebot The PageRank of a particular page is roughly based upon the quantity of inbound links as well as the PageRank of the pages providing the links. PageRank is given from 0 to 10

Googlebot • Web Page URL: http://facebook.com • The Page Rank: 9/10 • Web Page URL: http://honeynet.org • The Page Rank: 6/10 • Web Page URL: http://honeynetproject.ca • The Page Rank: 3/10

Web crawling process • Two important steps must be established before starting to “crawl”: • 1)Find a starting point - a list lf initial URLs to start the search (Seeds) • Start from some known links • Use web search engines • 2) Determine a scope - how wide the crawling should go • maximum links hops to include(URL with a particular number of links) • Transitive hops to include (URL with a particular number of transitive hope )

Simple Architecture

Simple Architecture • Queue – a list of pages to be processed/downloaded • Schedulers and revisiting policy – after web has been “downloaded” its content will most likely be out of date. Revisiting policy must consider this • Downloader: • Parallelization - downloading all pages in parallel • Serialization - downloading only one page at a time at the maximum speed

Advanced Architecture Spider-Monkey

Spider-Monkey Seeder • Generates a list of URLs • Method 1: Web search • Method 2: Extract URLs from spam emails • The monitoring seeder is used to constantly reseed previously found malicious content • over time from malware database

Spider-Monkey Web Crawling - Heritrix • Open source • Queues the generated URLs from the seeder • Stores the crawled contents on the file server while generating detailed log files • Multi-threaded design • Link extraction • Web and JMX interface

Spider-Monkey Web Crawling - Heritrix

Spider-Monkey Malware analysis • Scanner extracts ARC-files • Analyzes content with multiple AVs • Identified malware and malicious Web sites are stored in the malware directory • Information regarding the malicious content is stored in the database

Crawling hidden web • Tapping into unexplored information • Improving user experience • Due to the heavy reliance of many Web users on search engines for locating information, search engines influence how the users perceive the Web • Users do not necessarily perceive what actually exists on the Web but what is indexed by search engines

Hidden Web database model • Textual database • Site that mainly contains plain-text documents • Simple search interface where users type a list of keywords in a single search box • Structured database • Multi-attribute relational data • Multi-attribute search interfaces

Textual crawler • Crawler has to generate a query, issue it to the Web site • Download the result, index page, and follow the links to download the actual pages • Everything comes down to the query submitted to search • Some studies suggest that hidden web is about 500 times larger than public web

Questions to ask “Before you crawl” • What information are you looking for? • What sites to crawl? • What content to crawl? • How to extract links from the crawled content? • Determine necessary crawling performance • What and where to store data? what format? • How to analyze data?

Let’s crawl

References • http://oak.cs.ucla.edu/~cho/research/crawl.html - web crawling research project • http://www.prchecker.info/check_page_rank.php - check page ranking • http://www.wisegeek.com/what-is-a-web-crawler.htm • http://monkeyspider.sourceforge.net/Diploma-Thesis-Ali-Ikinci.pdf

www.honeynetproject.ca Date: May 24, 2009 Prepared by: Serge Gorbunov ( serge@gserge.com )