1 / 66

Class Number – CS 412

Class Number – CS 412. Web Data MGMT and XML. Instructor – Sanjay Madria. Lesson Title - Introduction. The link for the Real Player live stream for the is: http://movie.umr.edu/ramgen/encoder/liveCS412F03.rm The link to view the archived Real Player lecture at 28 and 56 kbs is:

Download Presentation

Class Number – CS 412

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

  2. The link for the Real Player live stream for the is: • http://movie.umr.edu/ramgen/encoder/liveCS412F03.rm • The link to view the archived Real Player lecture at 28 and 56 kbs is: • http://movie.umr.edu/ramgen/CoursesF02/CS412F03/CS412Lec082803kbs2856.rm • (The lecture date section 082803 will change for each produced class) • The link to view the Real Player archived lecture at 200 kbs is: http://movie.umr.edu/ramgen/CoursesF03/CS412F03/CS412Lec082803kbs200.rm •  For example, to watch the lecture using real player for say 15th Sept, you modify the date as “CS412Lec091503kbs200.rm”

  3. Web Data Management and XML Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla madrias@umr.edu

  4. WWW • Huge, widely distributed, heterogeneous collection of semi-structured multimedia documents in the form of web pages connected via hyperlinks.

  5. World Wide Web • Web is fast growing • More business organizations putting information in the Web • Business on the highway • Myriad of raw data to be processed for information

  6. As WWW grows, more chaotic it becomes • Web is fast growing, distributed, non-administered global information resource • WWW allows access to text, image, video, sound and graphic data • More business organizations creating web servers • More chaotic environment to locate information of interest • Lost in hyperspace syndrome

  7. Characteristics of WWW • WWW is a set of directed graphs • Data in the WWW has a heterogeneous nature, self-describing and schema less • Unstructured information , deeply nested • No central authority to manage information • Dynamic verses static information • Web information discoveries - search engines

  8. Web is Growing! • In 1994, WWW grew by 1758 % !! • June 1993 - 130 • June 1994 - 1265 • Dec. 1994 - 11,576 • April 1995 - 15,768 • July 1995 - 23,000+ • 2000 - !!!!!

  9. ‘COM’ domains are increasing! • As of July 1995, 6.64 million host computers on the Internet: • 1.74 million are ‘com’ domains • 1.41 million are ‘edu’ domains • 0.30 million are ‘net’ • 0.27 million are ‘gov’ • 0.22 million are ‘mil’ • 0.20 million are ‘org’

  10. The number of Internet hosts exceeded... • 1000 in 1984 • 10000 in 1987 • 100000 in 1989 • 1.000.000 in 1992 • 10.000.000 in 1996 • 100.000.000 in 2000

  11. Top web countries 1. Canada (1) 80% 9. New Zealand(7)101 2. US (4) 140% 10. Sweden (9) 101% 3. Ireland (3) 110% 11. Israel (12) 112% 4. Iceland (2) 68% 12. Cyprus (8) 72% 5. UK (14) 336 % 13. Hong Kong (15)148% 6. Malta (5) 155% 14. Norway (10) 64% 7. Australia (6) 133% 15. Switzerland (13) 75% 8. Singapore (11) 207% 16. Denmark (16) 105%

  12. How users find web sites • Indexes and search engines 75 • UseNet newsgroups 44 • Cool lists 27 • New lists 24 • Listservers 23 • Print ads 21 • Word-of-mouth and e-mail 17 • Linked web advertisement 4

  13. Limitations of Search Engines • Do not exploit hyperlinks • Search is limited to string matching • Queries are evaluated on archived data rather than up-to-date data; no indexing on current data • Low accuracy • Replicated results • No further manipulation possible

  14. Limitations of Search Engines • ERROR 404! • No efficient document management • Query results cannot be further manipulated • No efficient means for knowledge discovery

  15. More PROBLEMS • Specifying/understanding what information is wanted • High degree of variability of accessible information • Variability in conceptual vocabulary or “ontology” used to describe information • Complexity of querying unstructured data

  16. Complexity of querying structured data • Uncontrolled nature of web-based information content • Determining which information sources to search/query

  17. Search Engine Capabilities • Selection of language • Keywords with disjunction, adjacency, presence, absence, ... • Word stemming (Hotbot) • Similarity search (Excite) • Natural language (LycosPro) • Restrict by modification date (Hotbot) or range of dates (Alta Vista) • Restrict result types (e.g., must include images) (Hotbot) • Restrict by geographical source (content or domain) (Hotbot) • Restrict within various structured regions of a document (titles or URLs) (Lycos Pro); (summary, first heading, title, URL) (Opentext)

  18. SEARCH & RETRIEVAL Search engine % web covered Hotbot 34 AltaVista 28 Northern Light 20 Excite 14 Infoseek 10 Lycos 3 Search Engines • using several search engines is better than using only one • Source: Lawrence, S., and Giles, C.L., “Searching the World Wide Web,” Science 280, pp. 98-100, 1998.

  19. Schemes to locate information • Supervised links between sites • ask at the reference desk • Classification of documents • search in the catalog • Automated searching • wander around the library

  20. Year 2000 AltaVista Yahoo HotBot Year 2001 Google NorthernLight AltaVista The most popular search engines

  21. Boolean search in AltaVista

  22. Specifying field content in HotBot

  23. Natural language interface in AskJeeves

  24. Three examples of search strategies • Rank web pages based on popularity • Rank web pages based on word frequency • Match query to an expert database All the major search engines use a mixed strategy in ranking web pages and responding to queries

  25. Rank based on word frequency • Library analogue: Keyword search • Basic factors in HotBot ranking of pages: • words in the title • keyword meta tags • word frequency in the document • document length

  26. Alternative word frequency measures • Excite uses a thesaurus to search for what you want, rather than what you ask for • AltaVista allows you to look for words that occur within a set distance of each other • NorthernLight weighs results by search term sequence, from left to right

  27. Rank based on popularity • Library analogue: citation index • The Google strategy for ranking pages: • Rank is based on the number of links to a page • Pages with a high rank have a lot of other web pages that link to it • The formula is on the Google help page 

  28. More on popularity ranking • The Google philosophy is also applied by others, such as NorthernLight • HotBot measures the popularity of a page by how frequently users have clicked on it in past search results

  29. Expert databases: Yahoo! • An expert database contains predefined responses to common queries • A simple approach is subject directory, e.g. in Yahoo!, which contains a selection of links for each topic • The selection is small, but can be useful • Library analogue: Trustworthy references

  30. Expert databases: AskJeeves • AskJeeves has predefined responses to various types of common queries • These prepared answers are augmented by a meta-search, which searches other SEs • Library analogue: Reference desk

  31. Best wines in France: AskJeeves

  32. Best wines in France: HotBot

  33. Best wines in France: Google

  34. Linux in Iceland: Google

  35. Linux in Iceland: HotBot

  36. Linux in Iceland: AskJeeves

  37. Web Data Management is the Key

  38. Key Objectives • Design a suitable data model to represent web information • Development of web algebra and query language, query optimization • Maintenance of Web data - View Maintenance • Development of knowledge discovery and web mining tools • Web warehouse • Web data integration , secondary storages, indexes

  39. Limitations of the Web Today • Applications can not consume HTML • HTML wrapper technology is brittle • Companies merge , need interoperability fast

  40. Paradigm Shift • New Web standards – XML • XML generated by applications and consumed by applications • Data exchange • Across platforms: enterprise interoperability • Across enterprises Web : from documents to data

  41. Database challenges • Query optimization and processing • Views and transformations • Data warehousing and data integration • Mediators and query rewriting • Secondary storages • indexes

  42. DBMS needs paradigm shift to • Web data differs from database data self describing, schema less structure changes without notice heterogeneous, deeply nested, irregular documents and data mixed • Designed by document, but not db expert • Need web data mgmt

  43. Web Data Representation • HTML - Hypertext Markup Language • fixed grammar, no regular expressions • Simple representation of data • good for simple data and intended for human consumption • difficult to extract information • SGML - Standard Generalized Markup Language - good for publishing deeply structured document • XML - Extended Markup Language -a subset of SGML

  44. Terminology • HTML - Hypertext Mark-up Language • HTTP - Hypertext Transmission Protocol • URL - Uniform Resource Locator • example - <URL>:=<protocol>://<Host>/<path>/filename>[<#location>] where • <protocol> is http, ftp, gopher • host is internet address … • #location is a textual label in the file.

  45. Links are specified as <A HREF=“Destination URL”>Anhor Text</A> • “destination URL is the URL of the destination document and Anchor Text is the text that appears as an anchor when displayed. • Example: • <A HREF=http://www.ntu.edu.sg/ >Nanyang Technological University</A> • Absolute and relative • URL <A HREF="AtlanticStates/NYStats.html">New York</A> is relative • <A HREF="http://www.ncsa.uiuc.edu/General/Internet/ WWW/HTMLPrimer.html"> NCSA's Beginner's Guide to HTML</A> absolute address

  46. World Wide Web • HTML documents (soon, XML) created by humans or applications. • Prevalent, persistent and informative • Accessed day in and day out by humans and applications. • Persistent HTML documents!!! Can database technology help?

  47. Current Research Projects • Web Query System • W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, Araneus • Semistructured Data Management • LOREL, UnQL, WebOQL, Florid • Website Management System • STRUDEL, Araneus • Web Warehouse • WHOWEDA, Xylem.com

  48. Main Tasks • Modeling and Querying the Web • view web as directed graph • content and link based queries • example - find the page that contain the word “clinton” which has a link from a page containing word “monica”.

  49. Information Extraction and integration • wrapper - program to extract a structured representation of the data; a set of tuples from HTML pages. • Mediator - integration of data-softwares that access multiple source from a uniform interface • Web Site Construction and Restructuring • creating sites • modeling the structure of web sites • restructuring data

  50. What to Model • Structure of Web sites • Internal structure of web pages • Contents of web sites in finer granularities

More Related