1 / 22

Scalable Web Crawling and Basic Transactions

Scalable Web Crawling and Basic Transactions. Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems June 6, 2014. Administrivia. Emailed list of project partners due Friday

brigid
Download Presentation

Scalable Web Crawling and Basic Transactions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems June 6, 2014

  2. Administrivia • Emailed list of project partners due Friday • … For those without 4-person groups, I will try to assign / merge groups over the weekend • … This might result in breaking up some 3 person groups

  3. Mercator: Scalable Web Crawler • Expands a “URL frontier” • Avoids re-crawling same URLs • Also considers whether a document has been seen before • Every document has signature/checksum info computed as it’s crawled

  4. Dequeue frontier URL Fetch document Record into RewindStream (RIS) Check against fingerprints to verify it’s new Extract hyperlinks Filter unwanted links Check if URL repeated(compare its hash) Enqueue URL Mercator Web Crawler

  5. Mercator’s Polite Frontier Queues • Tries to go beyond breadth-first approach – want to have only one crawler thread per server • Distributed URL frontier queue: • One subqueue per worker thread • The worker thread is determined by hashing the hostname of the URL • Thus, only one outstanding request per web server

  6. Mercator’s HTTP Fetcher • First, needs to ensure robots.txt is followed • Caches the contents of robots.txt for various web sites as it crawls them • Designed to be extensible to other protocols • Had to write own HTTP requestor in Java – their Java version didn’t have timeouts • Today, can use setSoTimeout() • Can also use Java non-blocking I/O if you wish: • http://www.owlmountain.com/tutorials/NonBlockingIo.htm • But they use multiple threads and synchronous I/O

  7. Other Caveats • Infinitely long URL names (good way to get a buffer overflow!) • Aliased host names • Alternative paths to the same host • Can catch most of these with signatures of document data (e.g., MD5) • Crawler traps (e.g., CGI scripts that link to themselves using a different name) • May need to have a way for human to override certain URL paths – see Section 5 of paper

  8. Mercator Document Statistics Histogram of document sizes PAGE TYPE PERCENT text/html 69.2% image/gif 17.9% image/jpeg 8.1% text/plain 1.5% pdf 0.9% audio 0.4% zip 0.4% postscript 0.3% other 1.4% (60M pages)

  9. Further Considerations • May want to prioritize certain pages as being most worth crawling • Focused crawling tries to prioritize based on relevance • May need to refresh certain pages more often

  10. Web Search Summarized • Two important factors: • Indexing and ranking scheme that allows most relevant documents to be prioritized highest • Crawler that manages to be (1) well-mannered, (2) avoid traps, (3) scale • We’ll be using Pastry to distribute the work of crawling and to distribute the data (what Google calls “barrels”)

  11. We Need More Than Synchronization What needs to happen when you… • Click on “purchase” on Amazon? Suppose you purchased by credit card? • Use online bill-paying services from your bank? • Place a bid in an eBay-like auction system? • Order music from iTunes? What if your connection drops in the middle of downloading? Is this more than a case of making a simple Web Service (-like) call?

  12. Transactions Are a Means of Handling Failures • There are many (especially, financial) applications where we want to create atomic operations that either commit or roll back • This is one of the most basic services provided by database management systems, but we want to do it in a broader sense • Part of “ACID” semantics…

  13. ACID Semantics • Atomicity: operations are atomic, either committing or aborting as a single entity • Consistency: the state of the data is internally consistent • Isolation: all operations act as if they were run by themselves • Durability: all writes stay persistent!

  14. A Problem Confronted by eBay • eBay wants to sell an item to: • The highest bidder, once the auction is over, or • The person who’s first to click “Buy It Now!” • But: • What if the bidder doesn’t have the cash? • A solution: • Record the item as sold • Validate the PayPal or credit card info with a 3rd party • If not valid, discard this bidder and resume in prior state

  15. Purchase: sb = Seller.bal bb = Buyer.bal Write Buyer.bal= bb - $100 Write Item.sellTo = Buyer Write Seller.bal= sb + $100 CRASH! “No Payment” Isn’t the Only Source of Failure • Suppose we start to transfer the money, but a server goes down…

  16. Providing Atomicity and Consistency • Database systems provide transactions with the ability to abort a transaction upon some failure condition • Based on transaction logging – record all operations and undo them as necessary • Database systems also use the log to perform recovery from crashes • Undo all of the steps in a partially-complete transaction • Then redo them in their entirety • This is part of a protocol called ARIES

  17. The Need for Isolation • Suppose eBay seller S has a bank account that we’re depositing money into, as people buy: • What if two purchases occur simultaneously, from two different servers on different continents? S = Accounts.Get(1234) Write S.bal = S.bal + $50

  18. Concurrent Deposits • This update code is represented as a sequence of read and write operations on “data items” (which for now should be thought of as individual accounts): where S is the data item representing the seller’s account # 1234 Deposit 1 Deposit 2 Read(S.bal) Read(S.bal) S.bal := S.bal + $50 S.bal:= S.bal + €10 Write(S.bal) Write(S.bal)

  19. BAD! time A “Bad” Concurrent Execution Only one action (e.g. a read or a write) can actually happen at a time for a given database, and we can interleave deposit operations in many ways: Deposit 1 Deposit 2 Read(S.bal) Read(S.bal) S.bal := S.bal + $50 S.bal:= S.bal + €10 Write(S.bal) Write(S.bal)

  20. GOOD! time A “Good” Execution • Previous execution would have been fine if the accounts were different (i.e. one were S and one were T), i.e., transactions were independent • The following execution is a serial execution, and executes one transaction after the other: Deposit 1 Deposit 2 Read(S.bal) S.bal := S.bal + $50 write(S.bal) Read(S.bal) S.bal:= S.bal + $10 Write(S.bal)

  21. Good Executions • An execution is “good” if it is serial (transactions are executed atomically and consecutively) or serializable (i.e. equivalent to some serial execution) • Equivalent to executing Deposit 1 then 3, or vice versa • Why would we want to do this instead? Deposit 1 Deposit 3 read(S.bal) read(T.bal) S.bal := S.bal + $50 T.bal:= T.bal + €10 write(S.bal) write(T.bal)

  22. Concurrency Control • A means of ensuring that transactions are serializable • There are many methods, of which we’ll see one • Lock-based concurrency control (2-phase locking) • Optimistic concurrency control (no locks – based on timestamps) • Multiversion CC • …

More Related