1 / 62

Web Data Management

Web Data Management. COSC 4806. Introduction. The ‘ world wide web’ a vast, widely distributed collection of semi-structured multimedia documents heterogeneous collection of documents documents in the form of web pages documents connected via hyperlinks. World Wide Web.

lauren
Download Presentation

Web Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Data Management COSC 4806

  2. Introduction • The ‘world wide web’ • a vast, widely distributed collection of semi-structured multimedia documents • heterogeneous collection of documents • documents in the form of web pages • documents connected via hyperlinks

  3. World Wide Web • The web is growing rapidly • Business organizations increasingly presenting information on the Web • ‘Business on the highway’ • Myriad of raw data to be processed for information

  4. World Wide Web • The web is a fast growing, distributed & non-administered global information resource • WWW allows access to text, images, video, sound and graphical data • Ever-increasing number of businesses building web servers • A chaotic environment to locate information of interest • Lost in hyperspace syndrome

  5. World Wide Web • Characteristics of the WWW: • it’s a set of directed graphs • data is heterogeneous, self-describing & schema less • unstructured, deeply nested information • no central authority for information management • dynamic information vs. static information • web information discovery – search engines

  6. World Wide Web • Rapid growth of web: • In 1994, WWW grew by 1758 % !! • June 1993 - 130 • June 1994 - 1265 • Dec. 1994 - 11,576 • April 1995 - 15,768 • July 1995 - 23,000+ • January 2005 – 11.5 billion publicly-indexed web pages

  7. World Wide Web • .com domains on the rise, as of July 2006: • 76,683,115 hosts for ‘com’ domains • 10,232,188 hosts for ‘edu’ domains • 185,919,955 hosts for ‘net’ domains • 727,773 hosts for ‘gov’ domains • 1,933,551 hosts for ‘mil’ domains • 1,660,470 hosts for ‘org’ domains

  8. World Wide Web • The exponential growth of the Internet is reflected in the number of hosts on the net • 1.000 in 1984 • 10.000 in 1987 • 100.000 in 1989 • 1.000.000 in 1992 • 10.000.000 in 1996 • 100.000.000 in 2000 • 171,638,297 in 2003 • 489,774,269 in July 2007 • Net Timeline (http://www.pbs.org/internet/timeline/) • Internet Domain Survey (http://www.isc.org/ds/)

  9. World Wide Web • Distribution of hosts (worldwide) • US 195,138,696 • European Union 22,000,414 • Japan 21,304,292 • Germany 7,657,162 • Netherlands 6,781,729 • South Korea 5,433,591 • Australia 5,351,622 • UK 4,688,307 • Brazil 4,392,693 • Taiwan 3,838,383

  10. World Wide Web • Popular search methods • email 77% • Search engine 63% • Get news 46% • Job related search 29% • Instant messaging 18% • Online banking 18% • Chat room 8% • Travel reservation 5% • Read blogs 3% • Online auction 3%

  11. World Wide Web • Key limitations of search engines: • do not exploit hyperlinks • search limited to string matching • queries evaluated on archived data rather than up-to-date data; no indexing on current data • low accuracy; replicated results • no further manipulation possible

  12. World Wide Web • Key limitations of search engines (contd.): • ERROR 404! • No efficient document management • Query results cannot be further manipulated • No efficient means for knowledge discovery

  13. World Wide Web • more issues.. • specifying/understanding what information is wanted • the high degree of variability of accessible information • the variability in conceptual vocabulary or “ontology” used to describe information • complexity of querying unstructured data

  14. World Wide Web • contd. • complexity of querying structured data • uncontrolled nature of web-based information content • determining which information sources to search/query

  15. World Wide Web • Search Engines capabilities: • Selection of language • Keywords with disjunction, adjacency, presence, absence, ... • Word stemming (Hotbot) • Similarity search (Excite) • Natural language (LycosPro) • Restrict by modification date (Hotbot) or range of dates (AltaVista) • Restrict result types (e.g., must include images) (Hotbot) • Restrict by geographical source (content or domain) (Hotbot) • Restrict within various structured regions of a document (titles or URLs) (LycosPro); (summary, first heading, title, URL) (Opentext)

  16. World Wide Web • Search & Retrieval.. • Using several search engines is better than using only one Search engine % web covered Hotbot 34 AltaVista 28 Northern Light 20 Excite 14 Infoseek 10 Lycos 3

  17. World Wide Web • Schemes to locate information: • Supervised links between sites • ask at the reference desk • Gopher (Univ. Of Minnesota): menu format with links both to sites and content • Classification of documents • search in the catalog • Archie (McGill Univ.): system to automatically gather, index and serve information from all anonymous FTP sites • Automated searching • wander around the library • Use META tags to gethermeta data • Spiders (robots, web-crawlers)

  18. World Wide Web • Popular search engines.. Year 2000 AltaVista Yahoo HotBot Year 2001 Google NorthernLight AltaVista

  19. World Wide Web • Boolean search in Alta vista..

  20. World Wide Web • Specifying field content in HotBot..

  21. World Wide Web • Natural language interface in AskJeeves

  22. World Wide Web • Examples of search strategies: • Rank web pages based on popularity • Rank web pages based on word frequency • Match query to an expert database • The major search engines use a mixed strategy

  23. World Wide Web • Frequency based ranking: • Library analogue: Keyword search • Basic factors in HotBot ranking of pages: - words in the title - keyword meta tags - word frequency in the document - document length

  24. World Wide Web • Alternative word frequency measures: • Excite uses a thesaurus to search for what you want, rather than what you ask for • AltaVista allows you to look for words that occur within a set distance of each other • NorthernLight weighs results by search term sequence, from left to right

  25. World Wide Web • Popularity based ranking: • Library analogue: citation index • The Google strategy for ranking pages: - Rank is based on the number of links to a page - Pages with a high rank have a lot of other web pages that link to it - The formula is on the Google help page 

  26. World Wide Web • More on popularity ranking: • The Google philosophy is also applied by others, such as NorthernLight • HotBot measures popularity of a page by how frequently users have clicked on it in past search results

  27. World Wide Web • Expert Databases, Yahoo • An expert database contains predefined responses to common queries • A simple approach is subject directory, e.g. in Yahoo!, which contains a selection of links for each topic • The selection is small, but can be useful • Library analogue: Trustworthy references

  28. World Wide Web • Expert Databases, AskJeeves • AskJeeves has predefined responses to various types of common queries • These prepared answers are augmented by a meta-search, which searches other SEs • Library analogue: Reference desk

  29. World Wide Web • Example, best wines in France; AskJeeves

  30. World Wide Web • Best wines in France; HotBot

  31. World Wide Web • Best wines in France; Google

  32. World Wide Web • Linux in Iceland; Google

  33. World Wide Web • Linux in Iceland; HotBot

  34. World Wide Web • Linux in Iceland; AskJeeves

  35. Web Data Management • Web Data Management; key objectives • Design a suitable data model to represent web information • Development of web algebra and query language, query optimization • Maintenance of Web data - view maintenance • Development of knowledge discovery and web mining tools • Web warehouse • Data integration, secondary storages, indexes

  36. Web Data Management • Limitations of the web.. • Applications cannot consume HTML • HTML wrapper technology is brittle • Companies merge , need interoperability

  37. Web Data Management • Paradigm Shift • New Web standards – XML • XML generated by applications and consumed by applications • Data exchange - Across platforms: enterprise interoperability - Across enterprises Web : from documents to data

  38. Web Data Management • Database challenges: • Query optimization and processing • Views and transformations • Data warehousing and data integration • Mediators and query rewriting • Secondary storages • Indexes

  39. Web Data Management • DBMS needs paradigm shift too • Web data differs from database data - self describing, schema less, - structure changes without notice, - heterogeneous, deeply nested, - irregular documents and data mixed - designed by document expert, but not DB expert - need Web Data Management

  40. Web Data Management • Web data representation • HTML - Hypertext Markup Language - fixed grammar, no regular expressions - Simple representation of data - good for simple data and intended for human consumption - difficult to extract information • SGML - Standard Generalized Markup Language - good for publishing deeply structured document • XML - Extended Markup Language - a subset of SGML

  41. Web Data Management • Terminology • HTML - Hypertext Mark-up Language • HTTP - Hypertext Transmission Protocol • URL - Uniform Resource Locator • example - <URL>:=<protocol>://<Host>/<path>/filename>[<#location>] where - <protocol> is http, ftp, gopher - host is internet address … - #location is a textual label in the file

  42. Web Data Management • Prevalent, persistent and informative • HTML documents (now XML) created by humans or applications • Accessed day in and day out by Humans and Applications • Persistent HTML documents • Can database technology help?

  43. Web Data Management • Some recent research projects • Web Query System - W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, Araneus • Semi structured Data Management - LOREL, UnQL, WebOQL, Florid • Website Management System - STRUDEL, Araneus • Web Warehouse - WHOWEDA

  44. Web Data Management • Main tasks.. • Modeling and Querying the Web • view web as directed graph • content and link based queries - example - find the page that contain the word “Clinton” which has a link from a page containing word “Monica”

  45. Web Data Management • Main tasks contd. • Information Extraction and integration • wrapper - program to extract a structured representation of the data; a set of tuples from HTML pages. - mediator: integration of data - software that accesses multiple sources from a uniform interface • Web Site Construction and Restructuring - creating sites - modeling the structure of web sites - restructuring data

  46. Web Data Management • What to model? • Structure of Web sites • Internal structure of web pages • Contents of web sites in finer granularities

  47. Web Data Management • Data representation of Web data • Graph Data Models • Semi structured Data Models (also graph based)

  48. Web Data Management • Graph data model • Labeled graph data model where nodes represent web pages & arcs represent links between pages • Labels on arcs can be viewed as attribute names • Regular path expression queries

  49. Web Data Management • Semi structured data models • Irregular data structure, no fixed schema known and may be implicit in the data • Schema may be large and may change frequently • Schema is descriptive rather than perspective; describes current state of data, but violations of schema still tolerated

  50. Web Data Management • Semi structured data models • Data is not strongly typed; for different objects the values of the same attributes may be of differing types. (heterogeneous sources) • No restriction on the set of arcs that emanate from a given node in a graph or on the types of the values of attributes • Ability to query the schemas; arc variables which get bound to labels on arcs, rather than nodes in the graph

More Related