1 / 8

Data collection with Web crawlers (Web-crawl graphs)

Data collection with Web crawlers (Web-crawl graphs). further experience:. technical/technological “treading lightly” incremental versus batch crawling HTTP headers character sets and malformed headers/urls shallow/deep queries methodological minimise modification/distortion of data

kamal-huff
Download Presentation

Data collection with Web crawlers (Web-crawl graphs)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data collection with Web crawlers(Web-crawl graphs)

  2. further experience: • technical/technological • “treading lightly” • incremental versus batch crawling • HTTP headers • character sets and malformed headers/urls • shallow/deep queries • methodological • minimise modification/distortion of data • maximise accessibility to the data

  3. incremental versus batch crawling

  4. HTTP headers

  5. character sets and malformed headers/urls • cannot assume ASCII • WISER needs support for EU languages! • characters are no longer bytes • cannot assume either HTTP headers or html urls are well formed • may contain arbitrary characters

  6. blinker (Weblink crawler) development blinker is a stable parameterised link crawler based on standard software components • objectives • to identifier problem issues in crawling e.g non-standard servers, malformed data • to demonstrate ethical crawling • to provide Web-crawl graphs • to compare the effect of varying crawling parameters

  7. shallow / deep queries • the query url problem • are not necessarily dynamic • are routinely collected by search engine crawlers • may lead to recursion, but recursion is not eliminated by ignoring them • collecting shallow queries is a compromise • a shallow query is a query url from a Web-page that does not itself have a query url

  8. (further) methodological goals • minimise modification/distortion of data • maximise accessibility These are discussed next in more detail in the context of using xml to exchange Web-crawl graphs.

More Related