1 / 46

Why Databases?

CS545 intro. Why Databases?. September 2001 Gio Wiederhold Stanford University www-db.stanford.edu/people/gio.html. Abstract. The distinction of storing data in files and databases is that databases are intended to be used by multiple programs and types of users.

nami
Download Presentation

Why Databases?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS545 intro Why Databases? September 2001 Gio Wiederhold Stanford University www-db.stanford.edu/people/gio.html CS545 Intro

  2. Abstract The distinction of storing data in files and databases is that databases are intended to be used by multiple programs and types of users. Databases have been available in various forms since 1958. The major paper defining database functionality in a formal sense is due to Ted Codd, of IBM, published in 1970. Information is created by applying knowledge (encoded as programs or rules) to collected data and message received. Data and computation resources are provided by a variety of suppliers, public and private. The number of potential suppliers and their autonomy also creates information overload To cope with these issues novel intermediate services are needed, opening up new opportunities. Many traditional relationships among consumers and vendors will change. The autonomy of the suppliers causes heterogeneity and inconsistencies.The semantics of diverse sources are captured by their ontologies, the collection of terms and their relationships as used in the domain of discourse for the source. When sources are to be related we rely on their ontologies to make the linkages. . Creating a sound algebra encompassing the required operations allows manipulation and composition of the interoperation process. CS545 Intro

  3. Outline • Motivation and Functions needed • Early Inventions • Architecture • Formal basis • Breadth of applicability • Unsolved problems • Research Directions CS545 Intro

  4. Files versus Databases Files: provide input and output for a program  (transient) • Devices: Paper tape (ascii), Cards, Magnetic Tapes • Examples: • FORTRAN: tapes 1-5 input, 5 standard in ( 80 column cards) tapes 6-7 output, 6 print (120 cols), 7 punch ( 80 cols) still visible in files, IBM VM OS • UNIX: standard in > Standard out • Data-processing: in >  > out = in >  > out = in >  > out .... Databases: storage (persistent, reliable, random access) • Enabled by disk - technology, starting in 1960 (5MB) • Many users, i.e., many (small) programs     • Example: • BOMP – Bill-of-materials (inventory), airline seats, processing CS545 Intro

  5. Files • Files: a means for programs to store data for later use • The initial program  determines • what data are being stored (all? – memory dump [LISP] ) • how it is being stored – structure and format • when it is being stored and available • successor programs must follow these decisions • often the successor program is another invocation of the initial program  • Problems • One program requires a different structure than another: BOMP • Data must be available rapidly, incrementally: • Class-assignments • seat reservations • library checkout • Programs  must be available continuously, depend on data CS545 Intro

  6. Input program Records parts being delivered Supplier :> parts Output program Records parts being consumed Products :> parts Inventory Suppliers, Products :> parts Databases • Data are intended to be used by many programs • Often small – transactions • Various subsets of the all the relevant data • Structural transformations: Bill-of-Materials Programs: CS545 Intro

  7. BoMPs are common • Supplier Parts Product-Assemblies • Clinical-labs Observations Patient-Records • Employees Salary & Tasks Productivity • Accidents Reports Failure-Analysis • Flights Seats Passengers • Classes Grades Student-Performance • . . . Two directions / hierarchies needed for data access: Data sources Data consumption Solutions? Stuff CS545 Intro

  8. s1 s2 s3 sn c1 c2 c3 cm Design Problem & Solutions Conceptual - model • Supplier program: • Use a hierarchy: supplier parts supplied ( 1: n ) • Consumer program: • Use a hierarchy: consumer parts used ( 1: m ) Actual solution in memory: Matrix: if it exceeds memory then either supplier or consumer part accesses become costly Actual solution beyond memory: 1. redundant transformed data 2. pointer and index structures P CS545 Intro

  9. Factors influencing design • Size --- memories are getting bigger, problems too • Density of matrix: • suppliers supply only some parts, overlapping • products consume only some parts, overlapping • Performance requirements: • supplier response can be less critical • airline seats made available versus seats being sold • laboratory data obtained versus patient records needed • Usage patterns: • batches versus single item accesses • linked according to yet other criteria: CS545 Intro

  10. DBMSs Database Management Systems • Collection of the software needed to manage databases • Components: • Storage management – intertwined with the operating systems • Query and update processor – uses the schema • Schema interpreter and compiler • Transaction management and concurrency control/protection – also jointly with OS • Logger for backup • Recovery programs • Large, complex, not all features always needed • Many fewer vendors now than 10 yesrs ago CS545 Intro

  11. Inventions – 1 - Data Description • Schemas [McGee, 1958] program independence • A symbolic description of each column, to be interpreted by update and retrieval programs as well as users • Allows programs to use subsets • Allows columns to be added without affecting current programs • Compilation of Schemas [1975] = avoids interpretation cost • requires keeping track of last update for auto-recompile • Views [Chamberlin et al., 1976] Bounded schemas • Data base adminiistrator defines schema subset for user roles • Can be compiled for fast execution • Must be recompiled when base schema or view is changed. CS545 Intro

  12. Inventions – 2 – access trees • Indexes [Landauer 1963] balanced trees • Efficient ancillary access path • Requires updating to stay current • Multiple Indexes [DavisLin 1965] multi-attribute-based access • Multiple ancillary access paths • Allows access by multiple paths • Requires much updating to stay current • B-trees [Bayer, 1972] Index Updateability • Index blocks are kept only 50%-100% full for mostly fast update • Improves performance greatly for indexes CS545 Intro

  13. Inventions – 3 - structures • Hierarchical Structures [IMS, 1963] Dense data structures • Trees mapped to sequential structures for fast access to sparse data • Fast access when many related values are needed • Costly to update, often done periodically • Must be combined with trees for multiple-access paths • Triple storage [Feldman, 1969] Arbitrary structures • All data represented by object-attribute-value entries • High cost when many related values are needed Note that these two conflict – in today's database implementations performance has won out over flexibility CS545 Intro

  14. Inventions – 4 – model foodfight • Relational Model [Codd 1970] = tabular model, with an algebraic set of operations, normalization • Formalization enabled understanding, dissemination • No inter-relation semantics, specified when query is made • Later constraints were added, implicitly defining keys, connections • Hierarchical - (also applied to one view of BOMPs) = describe hierarchical connections among data records, no algebra • An attempt to describe earlier, simple implementations in model terms • Network – generalization of BOMP = describe structure, procedural navigation in near-arbitrarily linked data Strong inter-record connections, needed for locating data CS545 Intro

  15. Why did the relational model win? • Relational Model DBMSes Sequel  QUEL, SQL • Formality – allowed essential optimization algorithms • Restrictions – as normalization, provide guidance • Teachability – exposed principles: • can't teach only from examples • DBMS independence – safety blanket for mission-critical users • But implementations added features • Use least common set of features? • Hard to enforce once a system has been bought • Few suppliers remain {ORACLE. IBM. MS, mySQL} • ER model [Chen, 1976] = Focuses on design, can be mapped to multiple implementations • Few tools for direct translation • Poor maintenance of model, ignored when DBs are expanded CS545 Intro

  16. Databases and the Web • HTML presentation: Hierarchical Markup Language = Data are transformed for human consumption, external refs • Often hierarchical – object-oriented view • If there was a schema, it is now hidden • XML presentation = Schema data is embedded • Much flexibility • Much more space when entries are small • Requires an interpretation for viewing as XSLT • RDF Resource description Formalism = Triple representation: object-attribute-value • Great flexibility • Uncertain implementation CS545 Intro

  17. Information overload Data starvation • More databases • public & corporate • Faster communication • digital • packeting: TCP-IP, ATM • World-wide connectivity • Internet & Intranets • world-wide web • Disintermediation • ubiquitous publishing CS545 Intro

  18. Change in Supply vs Demand What information consumes is rather obvious, it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it. [Herbert Simon] CS545 Intro

  19. Making data relevant • Data reduction • Data abstraction • Level changing • Summarization • Exception search • Level change to integrate with other data sources • Follow Customer Model: hierarchical, divide-and-conquer, a common paradigm CS545 Intro

  20. Recording Data and Knowledge Data Loop Knowledge Loop Information is created at the confluence of data -- the state & knowledge -- the ability to select and project the state into the future Storage Education Selection Abstraction Integration Summarization Experience Decision-making State changes Action CS545 Intro

  21. Transforming Data to Information Application Layer Mediation Layer Foundation Layer users at workstations value-added services data and simulation resources CS545 Intro

  22. Summarize articulation Inte- -gration Hetero- genous resources Transform Selection Functionsinside Mediation CS545 Intro

  23. Function of Mediation Apply Domain-specific Specialist Knowledge to add value • to locate data sources • to convert for consistency • to integrate from diverse sources • to describe data for processing • to abstract for insight / models • to extrapolate to new situations • to summarize for presentation • INFORMATION CS545 Intro

  24. OEM OEM QEM QEM OEM OEM QEM OEM QEM wrapper QEM OEM OEM wrapper QEM wrapper wrapper ERIS IEDMS LOCKHEED MARTIN Idaho National Engineering Laboratory Environmental Restoration at INEL Undoing 50 years of messes …. MSL [Stanford] OQL [ODMG] MQL [ISX] QEM other mediators mediator CORBA QEM Many projects many sources ISX - Stanford Univ. June 1998 CS545 Intro

  25. From Schemas to Ontologies Ontologies allow communication among partners in enterprises (rarely in machine-readable form) Relationships determine meaning - parent, school, company Databases use ontologies during design in their E-R diagrams (implicitly) and to represent the leaf nodes in their schemas. Variable and Class names in Software Knowledge-bases use term ontologies (often explicitely), add class definition (to hold instances), constraints, and operations among the terms. CS545 Intro

  26. Ontology: components . We represent the contents and structure of a languages by its ontology: • a set of well-defined terms, which delimit the domain of discourse • relationships among those terms, chosen from a limited set a formalizable subset of expert knowledge CS545 Intro

  27. Heterogeneity among Domains If interoperation involves distinct domains mismatch ensues • Autonomy conflicts with consistency, • Local Needs have Priority, • Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems  • Representation and Access Conventions  • Naming and Ontologies  CS545 Intro

  28. Unsolved problem in Interoperation Common assumption in assembling and integrating distributed information resources • The language used by the resources is the same • Sublanguages used by the resources are subsets of a globally consistent language This assumption is provably false. Working towards the goal of global consistency is 1. naïve -- the goal cannot be achieved 2. inefficient -- languages are efficient in local contexts CS545 Intro

  29. Large Ontologies: good or bad? • Have all the Knowledge together • simple for customers of KBs • hard for owners of KBs, must synchronize with many others • in the limit -- everybody must be globally consistent • Large KB will cover multiple / all domains • created by a committee -- slow • maintained by a committee– costly to impssible • Differences in level of abstraction -- efficiency • homeowner: nail • carpenter: sinker, brad, boxnail, . . . CS545 Intro

  30. mediators network Evolution of mediation applications A3 A4 A2 A5 A1 A6 integrators a. I2 I1 M1 b. M2 c. d. e. wrappers D1 W3 D6 W2 D5 D4 W1 D2 D3 datasources CS545 Intro

  31. Definition* A mediator is a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications. It should be small and simple, so that it can be maintained by one expert or, at most, a small and coherent group of experts. * Wiederhold: IEEE Computer March 1992 CS545 Intro

  32. Human « Computer {x-widgets, HTML} Application « Mediator {OQL, KQML, ...} Mediator « Data sources {SQL, TQL, XML, … } Data ¬ real world {sensors, clerks, … } Interfaces CS545 Intro

  33. Dialog An Integration Architecture Client Application portfolios for each company Mediator stock market prices business reports Wrapper Wrapper Ticker Tape CS545 Intro

  34. Today Handcrafted Expert consults with programmer Programmer codes the knowledge needed Resource changes require advise, program update Future Generated from models Domain Expert maintains models Specification determines functions Resource changes trigger regeneration Status of Mediation Technology CS545 Intro

  35. Application Interface Changes of user needs Domain changes Owner / Creator Maintainer Lessor - Seller Advertisor Resource Interfaces A mediator is not static software: Knowledge ages Software & People Models, programs, rules, caches, . . . Resource changes CS545 Intro

  36. Empowerment automously maintainable Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed by • Domain specialists • Professional organizations • Field teams of modest size * based on experience with software CS545 Intro

  37. Computer Scientists Provide tools adapatation integration matching composing Assess Standards Assure scalability Domain Experts Learn to use the tools Select resources Assess their value Rank their quality Resolve semantics Get client feedback Give provide feedback Roles CS545 Intro

  38. Mediation Research Topics • Mediator management and maintenance • Representation of knowledge and customer models • Balancing dynamic and warehouse solutions • Formalization of semantic heterogneities • many levels and types • roles for wrappers vs. mediators vs. applications • scalability by partitioning -- make it simple! • Domain Ontologies --- tools, validation, . . . • Effect of object paradigm and method-based access • Service and business models • New types of information systems CS545 Intro

  39. Long Range Science Vision Artificial Intelligence knowledge mgmt domain expertise uncertainty Systems Engineering analysis documentation costing Databases access storage algebras Integration Methods Integration Science GIS CS545 Intro

  40. Fat versus thin mediators • too thin: insufficient added value • Too fat: hard to • compose • Too narrow: few costumers • too broad: hard to maintain, needs a committee Just right service scope domain scope CS545 Intro

  41. ? 13 12 11 years 10 100% 9 90 8 80 7 70 6 60 5 50 4 40 3 30 2 20 1 10 0 Maintenance is good for you relative annual maintenance cost depreciation = 1 / lifetime lifetime automobile hardware software CS545 Intro

  42. X s Client system Fast build of clients by resource reuse data and simulation resources Changes (x) are difficult, can affect many clients Client-Server Architecture CS545 Intro

  43. Systems with Mediators Applications . . . . Gio Wiederhold. 1995 Mediators . . . . . . Data Resources . . . CS545 Intro

  44. Growth through Reuse New Application Gio Wiederhold. 1995 Prior & Revised Mediators Extended Data Resources CS545 Intro

  45. Linear O(n) Cost of Growth-- now O(n2) 7 • 2 • Data changes only affect some mediators; only in their domain • Mediators can 1. supply old information to n-1 prior applications 2. provide better information to the new application 3. be partially or completely reused • New applications, using the new data, can be developed and inserted dynamically CS545 Intro

  46. Assigning maintenance responsibility a. Source data quality – supplier database, files, or web pages b. Interface to the source – wrapper, supplier or vendor for supplier c. Source selection – expert specialist in mediator d. Source quality assessment – customer input to mediator e. Semantic interoperation – specialist group providing input to the mediator f. Consistency and metadata information – mediator service operation or warehouse g. Informal, pragmatic integration – client services with customer input h. User presentation formats – client services with customer input Sources Services Customers CS545 Intro

More Related