Web Data Management

Web Data Management COSC 4806

Introduction • The ‘world wide web’ • a vast, widely distributed collection of semi-structured multimedia documents • heterogeneous collection of documents • documents in the form of web pages • documents connected via hyperlinks

World Wide Web • The web is growing rapidly • Business organizations increasingly presenting information on the Web • ‘Business on the highway’ • Myriad of raw data to be processed for information

World Wide Web • The web is a fast growing, distributed & non-administered global information resource • WWW allows access to text, images, video, sound and graphical data • Ever-increasing number of businesses building web servers • A chaotic environment to locate information of interest • Lost in hyperspace syndrome

World Wide Web • Characteristics of the WWW: • it’s a set of directed graphs • data is heterogeneous, self-describing & schema less • unstructured, deeply nested information • no central authority for information management • dynamic information vs. static information • web information discovery – search engines

World Wide Web • Rapid growth of web: • In 1994, WWW grew by 1758 % !! • June 1993 - 130 • June 1994 - 1265 • Dec. 1994 - 11,576 • April 1995 - 15,768 • July 1995 - 23,000+ • January 2005 – 11.5 billion publicly-indexed web pages

World Wide Web • .com domains on the rise, as of July 2006: • 76,683,115 hosts for ‘com’ domains • 10,232,188 hosts for ‘edu’ domains • 185,919,955 hosts for ‘net’ domains • 727,773 hosts for ‘gov’ domains • 1,933,551 hosts for ‘mil’ domains • 1,660,470 hosts for ‘org’ domains

World Wide Web • The exponential growth of the Internet is reflected in the number of hosts on the net • 1.000 in 1984 • 10.000 in 1987 • 100.000 in 1989 • 1.000.000 in 1992 • 10.000.000 in 1996 • 100.000.000 in 2000 • 171,638,297 in 2003 • 489,774,269 in July 2007 • Net Timeline (http://www.pbs.org/internet/timeline/) • Internet Domain Survey (http://www.isc.org/ds/)

World Wide Web • Distribution of hosts (worldwide) • US 195,138,696 • European Union 22,000,414 • Japan 21,304,292 • Germany 7,657,162 • Netherlands 6,781,729 • South Korea 5,433,591 • Australia 5,351,622 • UK 4,688,307 • Brazil 4,392,693 • Taiwan 3,838,383

World Wide Web • Popular search methods • email 77% • Search engine 63% • Get news 46% • Job related search 29% • Instant messaging 18% • Online banking 18% • Chat room 8% • Travel reservation 5% • Read blogs 3% • Online auction 3%

World Wide Web • Key limitations of search engines: • do not exploit hyperlinks • search limited to string matching • queries evaluated on archived data rather than up-to-date data; no indexing on current data • low accuracy; replicated results • no further manipulation possible

World Wide Web • Key limitations of search engines (contd.): • ERROR 404! • No efficient document management • Query results cannot be further manipulated • No efficient means for knowledge discovery

World Wide Web • more issues.. • specifying/understanding what information is wanted • the high degree of variability of accessible information • the variability in conceptual vocabulary or “ontology” used to describe information • complexity of querying unstructured data

World Wide Web • contd. • complexity of querying structured data • uncontrolled nature of web-based information content • determining which information sources to search/query

World Wide Web • Search Engines capabilities: • Selection of language • Keywords with disjunction, adjacency, presence, absence, ... • Word stemming (Hotbot) • Similarity search (Excite) • Natural language (LycosPro) • Restrict by modification date (Hotbot) or range of dates (AltaVista) • Restrict result types (e.g., must include images) (Hotbot) • Restrict by geographical source (content or domain) (Hotbot) • Restrict within various structured regions of a document (titles or URLs) (LycosPro); (summary, first heading, title, URL) (Opentext)

World Wide Web • Search & Retrieval.. • Using several search engines is better than using only one Search engine % web covered Hotbot 34 AltaVista 28 Northern Light 20 Excite 14 Infoseek 10 Lycos 3

World Wide Web • Schemes to locate information: • Supervised links between sites • ask at the reference desk • Gopher (Univ. Of Minnesota): menu format with links both to sites and content • Classification of documents • search in the catalog • Archie (McGill Univ.): system to automatically gather, index and serve information from all anonymous FTP sites • Automated searching • wander around the library • Use META tags to gethermeta data • Spiders (robots, web-crawlers)

World Wide Web • Popular search engines.. Year 2000 AltaVista Yahoo HotBot Year 2001 Google NorthernLight AltaVista

World Wide Web • Boolean search in Alta vista..

World Wide Web • Specifying field content in HotBot..

World Wide Web • Natural language interface in AskJeeves

World Wide Web • Examples of search strategies: • Rank web pages based on popularity • Rank web pages based on word frequency • Match query to an expert database • The major search engines use a mixed strategy

World Wide Web • Frequency based ranking: • Library analogue: Keyword search • Basic factors in HotBot ranking of pages: - words in the title - keyword meta tags - word frequency in the document - document length

World Wide Web • Alternative word frequency measures: • Excite uses a thesaurus to search for what you want, rather than what you ask for • AltaVista allows you to look for words that occur within a set distance of each other • NorthernLight weighs results by search term sequence, from left to right

World Wide Web • Popularity based ranking: • Library analogue: citation index • The Google strategy for ranking pages: - Rank is based on the number of links to a page - Pages with a high rank have a lot of other web pages that link to it - The formula is on the Google help page 

World Wide Web • More on popularity ranking: • The Google philosophy is also applied by others, such as NorthernLight • HotBot measures popularity of a page by how frequently users have clicked on it in past search results

World Wide Web • Expert Databases, Yahoo • An expert database contains predefined responses to common queries • A simple approach is subject directory, e.g. in Yahoo!, which contains a selection of links for each topic • The selection is small, but can be useful • Library analogue: Trustworthy references

World Wide Web • Expert Databases, AskJeeves • AskJeeves has predefined responses to various types of common queries • These prepared answers are augmented by a meta-search, which searches other SEs • Library analogue: Reference desk

World Wide Web • Example, best wines in France; AskJeeves

World Wide Web • Best wines in France; HotBot

World Wide Web • Best wines in France; Google

World Wide Web • Linux in Iceland; Google

World Wide Web • Linux in Iceland; HotBot

World Wide Web • Linux in Iceland; AskJeeves

Web Data Management • Web Data Management; key objectives • Design a suitable data model to represent web information • Development of web algebra and query language, query optimization • Maintenance of Web data - view maintenance • Development of knowledge discovery and web mining tools • Web warehouse • Data integration, secondary storages, indexes

Web Data Management • Limitations of the web.. • Applications cannot consume HTML • HTML wrapper technology is brittle • Companies merge , need interoperability

Web Data Management • Paradigm Shift • New Web standards – XML • XML generated by applications and consumed by applications • Data exchange - Across platforms: enterprise interoperability - Across enterprises Web : from documents to data

Web Data Management • Database challenges: • Query optimization and processing • Views and transformations • Data warehousing and data integration • Mediators and query rewriting • Secondary storages • Indexes

Web Data Management • DBMS needs paradigm shift too • Web data differs from database data - self describing, schema less, - structure changes without notice, - heterogeneous, deeply nested, - irregular documents and data mixed - designed by document expert, but not DB expert - need Web Data Management

Web Data Management • Web data representation • HTML - Hypertext Markup Language - fixed grammar, no regular expressions - Simple representation of data - good for simple data and intended for human consumption - difficult to extract information • SGML - Standard Generalized Markup Language - good for publishing deeply structured document • XML - Extended Markup Language - a subset of SGML

Web Data Management • Terminology • HTML - Hypertext Mark-up Language • HTTP - Hypertext Transmission Protocol • URL - Uniform Resource Locator • example - <URL>:=<protocol>://<Host>/<path>/filename>[<#location>] where - <protocol> is http, ftp, gopher - host is internet address … - #location is a textual label in the file

Web Data Management • Prevalent, persistent and informative • HTML documents (now XML) created by humans or applications • Accessed day in and day out by Humans and Applications • Persistent HTML documents • Can database technology help?

Web Data Management • Some recent research projects • Web Query System - W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, Araneus • Semi structured Data Management - LOREL, UnQL, WebOQL, Florid • Website Management System - STRUDEL, Araneus • Web Warehouse - WHOWEDA

Web Data Management • Main tasks.. • Modeling and Querying the Web • view web as directed graph • content and link based queries - example - find the page that contain the word “Clinton” which has a link from a page containing word “Monica”

Web Data Management • Main tasks contd. • Information Extraction and integration • wrapper - program to extract a structured representation of the data; a set of tuples from HTML pages. - mediator: integration of data - software that accesses multiple sources from a uniform interface • Web Site Construction and Restructuring - creating sites - modeling the structure of web sites - restructuring data

Web Data Management • What to model? • Structure of Web sites • Internal structure of web pages • Contents of web sites in finer granularities

Web Data Management • Data representation of Web data • Graph Data Models • Semi structured Data Models (also graph based)

Web Data Management • Graph data model • Labeled graph data model where nodes represent web pages & arcs represent links between pages • Labels on arcs can be viewed as attribute names • Regular path expression queries

Web Data Management • Semi structured data models • Irregular data structure, no fixed schema known and may be implicit in the data • Schema may be large and may change frequently • Schema is descriptive rather than perspective; describes current state of data, but violations of schema still tolerated

Web Data Management • Semi structured data models • Data is not strongly typed; for different objects the values of the same attributes may be of differing types. (heterogeneous sources) • No restriction on the set of arcs that emanate from a given node in a graph or on the types of the values of attributes • Ability to query the schemas; arc variables which get bound to labels on arcs, rather than nodes in the graph

Web Data Management