Application software – office packets, databases and data warehouses.

Introduction to Computer Systems (8) Application software –office packets, databases and data warehouses. PiotrMielecki Ph. D. http://www.wssk.wroc.pl/~mielecki mielecki@wssk.wroc.pl Piotr.Mielecki@pwr.wroc.pl

Office packets. 1.1. Definitions. • Office packet of software, usually called an office application suite or productivity suite is a set of applications intended to be used by typical office worker and/or knowledge workers. The components are generally distributed together, have a consistent user interface and usually can interact with each other, sometimes in ways that the operating system would not normally allow – mechanisms like Object Linking and Embedding (OLE), for example. • Most of office application suites include at least: • Word processor – (more formally known as document preparation system), which is an application used for the production (including composition, editing, formatting, and possibly printing) of any sort of printable material, stored as the electronic document. • Spreadsheet – rectangular table (or grid) of arranged information (financial very often). The electronic spreadsheet supports automatic calculations (mathematical, statistical, financial etc.) and tools for graphical presentations (different diagrams).

In addition to these, the suite may contain: • Presentation program – designed for preparing sets of electronic slides, usually based on contents of other electronic documents and different multimedia formats. The applications of this kind have usually poor support for edition of multimedia files or documents, but the OLE mechanism (advanced graphic editor as an OLE server, for example) can be used to speed-up work on presentation. MS-PowerPoint has become a standard for applications of this kind. • Desktop database tool – the application which can be used to create small (desktop) databases or as the client and/or report-generator for remote client-server (SQL) databases. Actually MS-Access is the most popular application of this kind. • Graphics suite – the application (or set of applications) designed for editing different bitmap and vector graphical formats. Actually the more advanced graphics suites (like CorelDraw! or Adobe Photoshop, for example) are not included in the particular office packages, but they can be OLE servers for them. On the other hand, the OpenOffice.org suite has its own, not very poor vector graphic editor.

Other additional components of the office suite: • Communication tools, including: • e-mail client and/or • Personal Information Manager (PIM) or groupware package. Microsoft Outlook is a good example of this kind. • Programming language designed for supporting automatic processing of documents and data included in these documents (like Visual Basic for Applications in MS-Office, for example).

1.2. Common problems. • One of the most important problems is the standard format (or set of formats) for electronic documents. The attempts were (and still are) made to establish a format suitable for different office packets (from different manufacturers).Using the format not dependent on the particular office suite would be much better for all the customers – they could change the office software without changing the document formats, still having access to older documents. On the other hand, they could exchange electronic documents between each-other (via e-mail, for example) not necessary using the same office software to display them and work on them. Of course, the large manufacturers like Microsoft are not interested in developing universal standards. • The Rich Text Format (RTF) is one of the most well-known formats for word processors (formatted text files), but its implementation in MS-Word is still not quite correct. • The ISO/IEC 26300 Open Document Format (ODF), based on XML and supporting formatted text files, spreadsheets, diagrams and presentations is an alternative for ”closed” formats like DOC (DOCX), XLS or PPT. The international non-profit committee which is developing this standard is Organization for the Advancement of Structured Information Standards (OASIS).

The other important thing is good support for work in organized groups. That means the document-flow management inside the organization’s structure, in many locations sometimes. In the small team of people it’s relatively easy to share the documents in Local Area Network (LAN) environment, sometimes using Virtual Private Network (VPN) channels to connect with company’s LAN from remote locations and interchanging document-files just accessing them from the shared folders or using common e-mail. • The large companies have problems with safe and efficient flow of different documents (and their subsequent versions), created by single people or by workgroups together. Application software which can support these tasks is rather sophisticated and usually database-oriented (it means that the efficient database server with appropriate client software is the center of the company’s document-flow system). IBM Lotus Notes / Domino is now probably most advanced system of this kind. ”Domino” is the name of the server software package, while ”Lotus” is client application (a bit similar to MS-Outlook) which organizes the user’s work. Another well-known solutions are Microsoft Exchange (relatively poor in comparison with Lotus but much cheaper and fully integrated with ActiveDirectory, based on non-SQL database and MS-Outlook as the only client) and Novell GroupWise.

2. Databases and Data Warehouses. 2.1. Definitions. • Database is a structured collection of recordsor data that is stored in a computer system so that a computer program or person using a query language (SQL in most of cases) can consult it to answer queries or append new data. The records retrieved in answer to queries are information that can be used to make decisions. The computer software used to manage and query a database is known as a Database Management System (DBMS). • The central concept of a database is that of a collection of records, or pieces of information. Typically, for a given database, there is a structural description of the type of facts held in that database: this description is known as a schema. The schema describes the objects that are represented in the database, and the relationships among them. There are a number of different ways of organizing a schema, that is, of modeling the database structure – they are known as database models (or data models).

The data model in most common use today is the relational model, which represents all information in the form of multiple related tables, each consisting of rows and columns (the formal definition uses mathematical terminology and is much less understandable, so rather not used by IT professionals). • This model represents relationships by the use of values common for more than one table – keys. Other models such as the hierarchical model and the network model use a more explicit representation of relationships

2.2. Relational databases. • Three key terms are used extensively in relational database models: • Relations – a relation is a table with columns and rows (notice: term relation doesn’t mean the relationship between two or more tables). • Attributes – the named columns of the relation are called attributes. • Domains – the domain is the set of values the attributes are allowed to take. • The basic data structure of the relational model is the table, where information about a particular kind of objects – entity (a student, for example) is represented in columns and rows (also called tuples). Thus, the ”relation” refers to the various different tables in the database – a relation is a set of tuples. The columns enumerate the various attributes of the entity (the student’s name, first name, unique ID number and date of birth, for example), and a row is an actual instance of the entity (particular student) that is represented by the relation. As a result, each tuple (or simply row) of the table of students represents various attributes of a single student.

All relations (tables) in a relational database have to adhere to some basic rules to be qualified as relations: • the ordering of columns is not important in a table, • there can’t be identical tuples (rows) in a table, • each tuple will contain a single (only one at the moment) value for each of its attributes i.e. each tuple has an atomic value. • A relational database contains multiple tables, each similar to the one in the ”flat” database model. One of the advantages of the relational model is that any value occurring in two different records (belonging to the same table or more frequently to different tables), implies a relationship among those two records. In order to enforce explicit integrity constraints, relationships between records in tables can also be defined explicitly, by identifying parent-child relationships characterized by assigning cardinality: • 1 to 1 (or 0), • 1 to Many, • Many to Many.

Tables can also have a designated single attribute or a set of attributes that can act as a ”key”, which can be used to uniquely identify each tuple in the table. Such a unique key is called a primary key. • Keys are commonly used to join or combine data from two or more tables. For example, the Students table may contain a column named Faculty which contains a value that matches the key of a Faculties table (notice that one student can have relationships with many faculties, so this example with only one Faculty column in the Students table is maybe too simplified). • Keys are also critical in the creation of indices (indexes), which facilitate fast retrieval of data from large tables. • Any column can be a key, or multiple columns can be grouped together into a compound key.

2.3. Relational operations. • Users (or programs) request data from a relational database by sending it a query that is written in a special language, usually a dialect of Structured Query Language (SQL). Although SQL was originally intended for end-users, it is much more common for SQL queries to be embedded into software that provides an easier user interface (see the chapter about Data Warehouses). In response to a query, the database returns a result set, which is just a list of rows from one or more (related) tables, containing the answers. • The simplest query is just to return all the rows from a particular, single table which interests us at this moment: • SELECT * FROM <table> • More often, the rows are filtered in some way to return just the answer wanted: • SELECT <column_list> FROM <table> WHERE <condition> • Very often data from multiple tables are combined into one, by doing a JOIN. There are a number of relational operations in addition to JOIN.

2.4. Database normalization. • Database normalization is a technique for designing relational database tables to minimize duplication of information and, in so doing, to safeguard the database against certain types of logical or structural problems (data anomalies). • For example, when multiple instances of a given piece of information occur in a table, the possibility exists that these instances will not be kept consistent when the data within the table is updated, leading to a loss of data integrity. • A table that is sufficiently normalized is less sensitive for problems of this kind. • On the other hand, in systems designed to hold the important electronic documentation (medical, for example) it’s very important to save all the subsequent versions of each document, so sometimes very similar or even identical tuples can be found. Good design should lead to distinguish between them (using the ”version number” column, for example).

Database theory describes a table’s degree of normalization in terms of normal forms of successively higher degrees. A table in third normal form (3NF), for example, is consequently in second normal form (2NF) as well. Higher degrees of normalization typically involve more tables and create the need for a larger number of joins, which can reduce performance. Accordingly, more highly normalized tables are typically used in database applications involving many isolated transactions, while less normalized tables tend to be used in database applications that do not need to map complex relationships between data entities and data attributes. • Although the normal forms are often defined informally in terms of the characteristics of tables, rigorous definitions of the normal forms are concerned with the characteristics of pure mathematical constructs known as relations (more theoretical approach than just tables). Whenever information is represented relationally, it is meaningful to consider the extent to which the representation is normalized.

2.5. Data Warehouses. 2.5.1. Definitions. • A Data Warehouse is something more than database-supported information system or simply database itself. It’s the main repository of all organization’s historical data, its ”corporate memory”. It contains the raw materials for management’s decision support systems like Enterprise Resource Planning (ERP) systems first of all. • The critical factor leading to the use of a Data Warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operation systems or application-level information systems(sometimes called Operational Systems) which are supporting everyday work (accounting, human-resource management and so on).

Formally we can define a Data Warehouse in the following terms: • Subject-oriented – the data in the database is organized so that all the data elements relating to the same real-world event or object are linked together. • Time-variant – all the changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time. • Non-volatile – data in the database is never over-written nor deleted; once committed, the data is static, read-only, but retained for future reporting. • Integrated – the database contains data from most or all of an organization’s operational applications, and that this data is made consistent. • Operational Systems are optimized for simplicity and speed of modification, using Online Transaction Processing (OLTP) for data entry and retrieval, database normalization and an entity-relationship model for clear design. The data warehouse is optimized for reporting and analysis (Online Analytical Processing – OLAP). Frequently data in Data Warehouses are heavily denormalised, summarised or stored in a dimension-based model. However, this is not always required to achieve very short query response times.

2.5.2. History. • Data Warehouses are a distinct type of computer database that were first developed during the late 1980-ties and early 1990-ties. They were developed to meet a growing demand for management information and analysis that could not be met by operational systems. Operational systems were unable to meet this need for a range of reasons: • The processing load of sophisticated reporting was not neutral for the response time of the operational systems, so the everyday work in strongly computer-supported organizations was slower. • The database designs of operational systems were not optimized for advanced information analysis and reporting. • Most organizations had more than one operational system (several domain subsystems), so company-wide reporting couldn’t be supported from a single system. • Development of reports in operational systems often required writing specific computer programs which was slow, difficult to use and expensive.

As a result, separate computer databases began to be built that were specifically designed to support management information and advanced analysis purposes. • These Data Warehouses were able to bring in data from a range of different data sources, such as mainframe computers, minicomputers, as well as personal computers and office automation software such as spreadsheets, and integrate this information in a single place. • This capability, coupled with user-friendly reporting tools and freedom from operational impacts, has led to a growth of this type of computer systems. • As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), Data Warehouses have evolved through several fundamental stages: • Off-line Operational Databases – Data Warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server, where the processing load of reporting does not impact on the operational system’s performance.

Off-line Data Warehouses – Data Warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure (different from the operational). In this case Extract-Transform-Load (ETL) processes performed by enterprise -level integration tools are responsible for supplying data from operational systems. • Real Time Data Warehouses – Data Warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.). On-line Enterprise Service Bus (ESB) is responsible for supplying data from operational systems. • Integrated Data Warehouses – Data Warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization. Today’s application-level protocols, defined for data interchange between different domain-oriented applications used by the organization or some organizations working together (HL7 in medicine, for example), were designed on this concept and can be used with ESB subsystems.

2.5.3. Architecture of Data Warehouses. • The concept of ”data warehousing” dates back at least to the mid-1980-ties, and possibly earlier. In essence, it was intended to provide an architectural model for the flow of data from operational systems to decision support environments. It attempted to address the various problems associated with this flow, and the high costs associated with it. • In the absence of such architecture, there usually existed an enormous amount of redundancy in the delivery of management information. In larger corporations it was typical for multiple decision support projects to operate independently, each serving different users but often requiring much of the same data. • The process of gathering, cleaning and integrating data from various sources, often legacy systems (old computer systems or application programs that continue to be used because the organization doesn’t want to replace or redesign them), was typically replicated for each decision support project. Moreover, legacy systems were frequently being revisited as new requirements emerged, each requiring a different view of the legacy data.

Based on analogies with real-life warehouses, Data Warehouses were intended as large-scale collection/storage/staging areas for corporate data. • From here data could be distributed to ”retail stores” or ”data marts” which were tailored for access by decision support users (or ”consumers”). While the Data Warehouse was designed to manage the bulk supply of data from its ”suppliers” (e.g. operational systems), and to handle the organization and storage of this data, the ”retail stores” or ”data marts” could be focused on packaging and presenting selections of the data to end-users, to meet specific management information needs. • Somewhere along the way this analogy and architectural vision was lost, as some vendors and industry speakers redefined the Data Warehouse as simply a management reporting database. This is a subtle but important deviation from the original vision of the Data Warehouse as the hub of a management information architecture, where the decision support systems were actually the ”data marts” or ”retail stores”.

2.5.4. Data storage models for Data Warehouses. • The goal of a Data Warehouse is to bring data together from a variety of existing databases to support management and reporting needs. The generally accepted principle is that data should be stored at its most elementary level because this provides for the most useful and flexible basis for use in reporting and information analysis. • However, because of different focus on specific requirements, there can be alternative methods for design and implementing data warehouses. There are two leading approaches to organizing the data in a data warehouse: the dimensional approach and the normalized approach.

The dimensional approach is very useful in data mart design, but it can result in serious problems of long term data integration and abstraction complications when used in a Data Warehouse. In the ”dimensional” approach, transaction data is partitioned into either a measured ”facts”, which are generally numeric data that captures specific values or ”dimensions”, which contain the reference information that gives each transaction its context: • As an example, a sales transaction could be broken up into facts such as the number of products ordered and the price paid, and dimensions such as date, customer, product (and its price), geographical location and salesperson. • The main advantage of a dimensional approach is that the Data Warehouse is easy for business staff with limited IT experience to understand and use. • Also, because the data is pre-joined into the dimensional form, the data warehouse tends to operate very quickly. • The main disadvantage of the dimensional approach is that it’s difficult to add or change the stored information later, if the company changes the way in which it makes business.

The normalized approach uses database normalization. In this method, the data in the Data Warehouse is stored in 3rd normal form. Tables are then grouped together by subject areas that reflect the general definition of the data (customer, product, finance, etc.). • The main advantage of this approach is that it is quite straightforward to add new information into the database. • The primary disadvantage is that because of the number of tables involved, it can be rather slow to produce information and reports. • Furthermore, since the segregation of facts and dimensions is not explicit in this type of data model, it is difficult for users to join the required data elements into meaningful information without a precise understanding of the data structure(having detailed documentation about database and more advanced skills).

Operational (Domain) Systems Business Intelligence (BI) System ERP BI applications (Data Marts) Sales Accounting Analytical workers Data Warehouse • Integration Tools: • ETL • ESB Personal Resources Production

Application software – office packets, databases and data warehouses.

Application software – office packets, databases and data warehouses.

Presentation Transcript

Data Mining: Association

Business Driven Information Systems 2e

XML and Databases

The Software Infrastructure for Electronic Commerce

Intro to Databases (using Microsoft Access)

Temporal Databases

Data Communication Basics

Data

CHAPTER 2: Data Warehouse: The Building Blocks

Data Management: Databases and Organizations Richard Watson

Chapter 6 DATABASES, DATA WAREHOUSES AND OLAP

Statistics software for the LHC

Essential Software Architecture

Chapter 19: Distributed Databases

Tutorial Document of ITM638 Data Warehousing and Data Mining

Accessing Your Data

Application Security

Chapter 13 Databases and Information Management

SNP Resources: Finding SNPs Databases and Data Extraction

DATABASES AND WEB TECHNOLOGY

15-826: Multimedia Databases and Data Mining