Introduction to web crawling and regular expression
Download
1 / 16

Introduction to Web Crawling and Regular Expression - PowerPoint PPT Presentation


  • 192 Views
  • Uploaded on

Introduction to Web Crawling and Regular Expression. CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou Email: [email protected] Outline. Course & Tutors Information Introduction to Web Crawling Utilities of a crawler Features of a crawler

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Web Crawling and Regular Expression' - Patman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Introduction to web crawling and regular expression

Introduction to Web Crawling and Regular Expression

CSC4170 Web Intelligence and Social Computing

Tutorial 1

Tutor: Tom Chao Zhou

Email: [email protected]


Outline
Outline

  • Course & Tutors Information

  • Introduction to Web Crawling

    • Utilities of a crawler

    • Features of a crawler

    • Architecture of a crawler

  • Introduction to Regular Expression

  • Appendix


Course and tutors information
Course and Tutors Information

  • Course homepage:

    • http://wiki.cse.cuhk.edu.hk/irwin.king/teaching/csc4170/2009

  • Tutors:


Utilities of a crawler
Utilities of a crawler

  • Web crawler, spider.

  • Definition:

    • A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia)

  • Utilities:

    • Gather pages from the Web.

    • Support a search engine, perform data mining and so on.

  • Object:

    • Text, video, image and so on.

    • Link structure.


Features of a crawler
Features of a crawler

  • Must provide:

    • Robustness: spider traps

      • Infinitely deep directory structures: http://foo.com/bar/foo/bar/foo/...

      • Pages filled a large number of characters.

    • Politeness: which pages can be crawled, and which cannot

      • robots exclusion protocol: robots.txt

      • http://blog.sohu.com/robots.txt

        • User-agent: *

        • Disallow: /manage/


Features of a crawler cont d
Features of a crawler (Cont’d)

  • Should provide:

    • Distributed

    • Scalable

    • Performance and efficiency

    • Quality

    • Freshness

    • Extensible


Architecture of a crawler

Doc

Fingerprint

Robots

templates

URL

set

www

DNS

Parse

Content

Seen?

URL

Filter

Dup

URL

Elim

Fetch

URL Frontier

Architecture of a crawler


Architecture of a crawler cont d

Doc

Fingerprint

Robots

templates

URL

set

www

DNS

Parse

Content

Seen?

URL

Filter

Dup

URL

Elim

Fetch

URL Frontier

Architecture of a crawler (Cont’d)

  • URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set.

  • DNS: domain name service resolution. Look up IP address for domain names.

  • Fetch: generally use the http protocol to fetch the URL.

  • Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted.


Architecture of a crawler cont d1

Doc

Fingerprint

Robots

templates

URL

set

www

DNS

Parse

Content

Seen?

URL

Filter

Dup

URL

Elim

Fetch

URL Frontier

Architecture of a crawler (Cont’d)

  • Content Seen?: test whether a web page with the same content has already been seen at another URL. Need to develop a way to measure the fingerprint of a web page.

  • URL Filter:

    • Whether the extracted URL should be excluded from the frontier (robots.txt).

    • URL should be normalized (relative encoding).

      • en.wikipedia.org/wiki/Main_Page

      • <a href="/wiki/Wikipedia:General_disclaimer" title="Wikipedia:General disclaimer">Disclaimers</a>

  • Dup URL Elim: the URL is checked for duplicate elimination.


Architecture of a crawler cont d2
Architecture of a crawler (Cont’d)

  • Other issues:

    • Housekeeping tasks:

      • Log crawl progress statistics: URLs crawled, frontier size, etc. (Every few seconds)

      • Checkpointing: a snapshot of the crawler’s state (the URL frontier) is committed to disk. (Every few hours)

    • Priority of URLs in URL frontier:

      • Change rate.

      • Quality.

    • Politeness:

      • Avoid repeated fetch requests to a host within a short time span.

      • Otherwise: blocked 


Regular expression
Regular Expression

  • Usage:

    • Regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words or patterns of characters.

  • Today’s target:

    • Introduce the basic principle.

  • A tool to verify the regular expression: Regex Tester

    • http://www.dotnet2themax.com/blogs/fbalena/PermaLink,guid,13bce26d-7755-441e-92b3-1eb5f9e859f9.aspx


Regular expression1
Regular Expression

  • Metacharacter

    • Similar to the wildcard in Windows, e.g.: *.doc

  • Target: Detect the email address


Regular expression2
Regular Expression

  • \b: stands for the beginning or end of a Word.

    • E.g.: \bhi\b find hi accurately

  • \w: matches letters, or numbers, or underscore.

  • .: matches everything except the newline

  • *: content before * can be repeated any number of times

    • \bhi\b.*\bLucy\b

  • +: content before + can be repeated one or more times

  • []: match characters in it

    • E.g: \b[aeiou]+[a-zA-Z]*\b

  • {n}: repeat n times

  • {n,}: repeat n or more times

  • {n,m}: repeat n to m times


Regular expression3
Regular Expression

  • Target: Detect the email address

  • Specifications:

    • [email protected]

    • A: combinations English characters a to z, or digits, or . or _ or % or + or –

    • B: cse.cuhk.edu.hk or cuhk.edu.hk (English characters)

  • Answer:


Appendix
Appendix

  • Mercator Crawler:

    • http://mias.uiuc.edu/files/tutorials/mercator.pdf

  • Regular Expression tutorial:

    • http://www.regular-expressions.info/tutorial.html



ad