1 / 19

Robust Hyperlinks

Robust Hyperlinks. Thomas A. Phelps Robert Wilensky. Problem: Broken Links. Links dangle because resource has been deleted, renamed, moved, or otherwise changed. Proposed solutions: additional naming schemes (URNs, handles, Persistent Uniform Resource Locator (PURLs) or Common Names)

mandell
Download Presentation

Robust Hyperlinks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robust Hyperlinks Thomas A. Phelps Robert Wilensky

  2. Problem: Broken Links • Links dangle because resource has been deleted, renamed, moved, or otherwise changed. • Proposed solutions: • additional naming schemes (URNs, handles, Persistent Uniform Resource Locator (PURLs) or Common Names) • monitoring and notification to insure referential integrity (e.g., Ingham et al., (1996), Mind-it, Macskassy and Shklar, (1997), Francis et al., (1995)) • None have been widely adopted, probably because they depend on administrative buy-in. • Archiving the web (Alexa) would be a solution • if you believe a persistent complete archive scales • and are comfortable with all your errors archived forever • and have no data behind a firewall.

  3. Proposed Solution: “Robust Hyperlinks” • Design links so that, even when the target moves (and mutates), one can still locate it with high probability. • Note that burden is on link creator, so that • no administrative buy-in is required • links can be made robust on a piecemeal basis.

  4. Requirements for Robust Hyperlinks • Provide high likelihood of successful dereferencing when an item is moved, but largely unchanged. • Performance should degrade gracefully as document content changes from its state at the time the hyperlink was created. • Should not impose a performance penalty when not used. • Storage required for a robust hyperlink must be trivially small • so that it is practical to make all URLs robust. • Implementation in clients or via proxies should be straightforward so as to encourage widespread adoption. • Should be largely non-interfering with clients and services that do not support them. • Making a hyperlink robust should be cheap and fully automated. • An author should be able to point to a hyperlink or site, and have it automatically become robust.

  5. General Idea: Enhance URLs with “signatures” • Add to a URL a “signature”, a small piece of document content. • When “traditional” (i.e., address-based) dereferencing fails, do “signature-(i.e., content-)based dereferencing: • Pass the signature to some search service, and hope that the target will be prominent among a very small result set. • Two issues: • Computing small, yet effective signatures • Adding them innocuously to hyperlinks

  6. Computing Small, Effective Signatures • “Lexical” signatures: The top n words of a document according to TF-IDF measure. • Almost all the desiderata are obviously met. • Question is, how big a signature is needed to locate a document more or less uniquely on the Web? • Inktomi says there are approximately 1 billion web pages now.

  7. Answer: 5 words! • I.e., a signature of 5 words will, in most cases, cause search engines to return the target document within the top few hits. • Actually, a smaller signature will probably do just to locate exact matches, but length helps provide robustness and for growth.

  8. Some Examples • Signature for Randy Katz’s home page is • “Californa ISRG Culler rimmed gaunt” • Here is what happens when we feed this signature to HotBot:

  9. Another Example • Signature for Endeavour home page is • “amplifies Endeavour leverages Charting Expedition” • Here is what happens when we feed this signature to Google:

  10. Examples: Report rank in result set of original URL

  11. Examples (con’t)

  12. Examples (con’t) In most cases, only one or two documents are returned by the more “stringent” searches.

  13. Content-based Dereferencing Strategies • Query a set of engines, and return the top few results from each one. • In our sample set, each hyperlink is successfully dereferenced by this strategy. • Make "stringent" queries to one or more engines; if fails, make progressively less stringent queries. • Performing the Google query, and then performing the Alta Vista query if Google fails, locates the desired reference in all but one case.

  14. Why Does This Work? • If there are 500,000 distinct terms in the Web, then the number of distinct combinations of 5 terms is greater than 3x10*28. • If the web is populated by documents whose most characteristic terms are uniformly drawn at random, the probability that more than one document matches a set of 5 characteristic terms is very small. • There is lots of room for randomness to be off. • Many documents contain very infrequently used words.

  15. Augmenting URLs with Signatures • Robust URLs (incompatible) • Make a new syntax, a la XPointer. • Robust URLs (mostly compatible) http://www.something.dom/a/b/c?lexical-signature="w1+w2+w3+w4+w5" • Aware agents (client or proxy) strip before traditional dereferencing. • Turns out widely used servers mostly ignore • True for all of our examples • and for Apache (50% of the market), Microsoft Internet Information Server (24%), and Netscape Enterprise (7%) • and for many cgi-bin scripts, • Robust Link Elements <a href="http://www.something.dom/a/b/c" lexical-signature="w1+w2+w3+w4+w5">click here</a> • Completely transparent to unaware clients. • Not amenable to proxy agent, can’t easily pass around.

  16. Support for Robust Hyperlinks • Ideally, support in browser. • Current version of MVD supports robust URLs. • Generate signatures and “robust reports”. • All bookmarks are automatically made robust. • Does signature-based dereferencing when traditional fails. • Robust Proxy Module • Strips signatures from robust URLs. • If the server returns error code 404, engages in signature-based dereferencing. • Can also sign unsigned URLs, redirect to client, so robust URL can be bookmarked, etc. • Robust Proxy Service http://www.myproxyserver.dom/cgi-bin?url="http://www.something.dom/a/b/c"&lex-signature=w1+w2+w3+w4+w5 • All agents can maintain URL remapping list. • Would be nice if search engines supported robustness in various ways.

  17. Limitations • Non-indexed documents • behind firewalls, cgi-bin scripts • non-indexed data types • Non-textual resources • Resources with highly variable content • Duplicates • Variation in search engine performance

  18. Extensions • Signature Creation • Examine signatures upon creation, improve. • Incrementally create signature to deal with changing content. • Signature Variation • To improve robustness, preclude • all signature terms from occurring in the same sentence • many words that appear to be misspellings (hard) • many words that occur only once in a document. • Signature Dereferencing Strategies • Compute signatures of documents in result set and match those. • Robust Hyperlink Agents • Perform signature-based dereferencing under some circumstances, even it traditional dereferencing succeeds. • Apply to other tasks, e.g., plagiarism detection.

  19. Other Robust Reference Ideas • Robust Intra-document References • uses content of local context plus tree location • Robust hyperlinks work by building on top of other Internet resources (e.g., search engines). What other types of services might there be? (Your idea here.)

More Related