1 / 23

Agenda

Agenda. Overview of the project Resources. CS172 Project. c rawling. indexing. ranking. Second Phase. First Phase. Phase 1 Options. Web data Needs to come out with your own crawling strategy Twitter data Can use third-party for Twitter Streaming API Still needs some web crawling.

leala
Download Presentation

Agenda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Agenda • Overview of the project • Resources

  2. CS172 Project crawling indexing ranking Second Phase First Phase

  3. Phase 1 Options • Web data • Needs to come out with your own crawling strategy • Twitter data • Can use third-party for Twitter Streaming API • Still needs some web crawling

  4. Crawling getNext Download contents of page 1 • Frontier • www.cs.ucr.edu • www.cs.ucr.edu/~vagelis Parse the downloaded file to extract links the page 2 getNext() addAll(List) Clean and Normalize the extracted links 3 Add(List<URLs>) Store extracted links in the Frontier 4

  5. 1. Download File Contents

  6. 2. Parsing HTML to extract links <- This is what you will see when you download a page. Notice HTML Tags.

  7. 2. Parsing HTML file • Write your own parser Some suggestions: Parse HTML file as XML. Two Parsing methods • SAX (Simple API for XML) • DOM (Document Object Model) • Use existing library • JSoup(http://jsoup.org/). Can be used to download the page. • HTML Parser (http://htmlparser.sourceforge.net/)

  8. 2. Parsing HTML file • Things to think about • How to handle Malformed HTML? Browser can still display it, but how do you handle it?

  9. 3. Clean extracted URLs • Some URL entries while crawling www.cs.ucr.edu /intranet/ /inventthefuture.html systems.engr.ucr.edu news/e-newsletter.html http://www.engr.ucr.edu/sendmail.html http://ucrcmsdev.ucr.edu/oucampus/de.jsp?user=D01002&site=cmsengr&path=%2Findex.html /faculty/ / /about/ #main http://www.pe.com/local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533104

  10. 3. Clean extracted URLs What to avoid • Parse only http links (avoid ftp, https or any other protocol) • Avoid duplicates • Bookmarks : #main – Bookmarks should be stripped off. • Self paths: / • Avoid downloading pdfs or images • /news/GraphenePublicationsIndex.pdf • Its ok to download them, but you cannot parse them. • Take care of invalid characters in URLs • Space: www.cs.ucr.edu/vagelishristidis • Ampersand: www.cs.ucr.edu/vagelis&hristidis • These characters should be encoded else you will get a MalformedURLException

  11. Normalize Links Found on the page • Relative URLs: • These URLs have no host address • E.g. While crawling (www.cs.ucr.edu/faculty) you find urls such as: • Case 1: /find_people.php • A “/” at the beginning means path starts from the root of the host (www.cs.ucr.edu) in this case. • Case 2: all • No “/” means the path is relative to current path. • Normalize them (respectively) to • www.cs.ucr.edu/find_people.php • www.cs.ucr.edu/faculty/all

  12. Clean extracted URLs • Different Parts of the URL highlighted with different colors • http://www.pe.com:8080/local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533 • Protocol • Port • Host • Path • Query • Bookmark

  13. java.net.URL Has methods that can separate different parts of the URL. getProtocol: http getHost: www.pe.com getPort: -1 getPath: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece getQuery: ssimg=532988 getFile: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988

  14. Normalizing with java.net.URL • You can normalize URLs with simple string manipulations and using methods from java.net.URL class. • Here is the snippet for normalizing “Case 1” root relative URLs

  15. Crawler Ethics • Some websites don’t want crawlers swarming all over them. • Why? • Increases load on the server • Private websites • Dynamic websites • …

  16. Crawler Ethics • How does the website tell you (crawler) if and what is off limits. • Two options • Site wide restrictions: robots.txt • Webpage specific restrictions: Meta tag

  17. Crawler Ethicsrobots.txt • A file called “robots.txt” in the root directory of the website • Example: http://www.about.com/robots.txt • Format: User-Agent: <Crawler name> Disallow: <don’t follow path> Allow: <can-follow-paths>

  18. Crawler Ethicsrobots.txt • What should you do? • Before starting on a new website: • Check if robots.txt exists. • If it does, download it and parse it for all inclusions and exclusions for “generic crawler” i.e. User-Agent: * • Don’t’ crawl anything in the exclusion list including sub-directories

  19. Crawler EthicsWebsite Specific: Meta tags • Some webpages have one the following meta-tag entries: • <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> • <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW"> • <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> • Options: • INDEX or NOINDEX • FOLLOW or NOFOLLOW

  20. Twitter data collecting • Collecting through Twitter Streaming API • https://dev.twitter.com/docs/platform-objects/tweets, where you can check the data schema. • Rate limit: you will get up to 1% of the whole Twitter traffic. So you can get about 4.3M tweets per day (about 2GB) • You need to have a Twitter account for that. Check https://dev.twitter.com/

  21. Third-party libarary Twitter4j for Java. • You can find supports for other languages also. • Well documented and code examples. e.g., http://twitter4j.org/en/code-examples.html

  22. Important Fields • At least following fields you should save: • Text • Timestamp • Geolocation • User of the tweet • Links

  23. Crawl links in Tweets • Tweets may contain links. • It may contains useful information. E.g., links to news articles. • After collect the tweets, use another process to crawl the links. • Because the crawling is slower, so you may not want to crawl it right after you get the tweet.

More Related