Vertical Processing of Big Data: A New Paradigm in Data Mining

DATA An emerging idea in data mining is to vertically structure data and to process it across those vertical structures. This is in contrast to the traditional method of structuring data into horizontal structures and processing down those horizontal structures (The horizontal structures are called records, e.g., an employee file containing horizontal employee records, likely made up fields such as, Name, Address, Salary, Phone, etc.) Again, Horizontal Processing of Vertical Data (HPVD) is an alternative to the traditional Vertical Processing of Horizontal Data (VPHD). Do we need to structure and process data differently than we have in the past? If so, why? What has changed? Data (digital data) has gotten really BIG and processing speed and power has not kept pace. Data has gotten very wide (in terms of the number of columns or attributes or fields - width of the file, table or dataset) in some cases, but the main explosion has been in the number of instances of the data (The number of rows, tuples or records - the depth of the file, table or data set). Thus the vertical processing of data in traditional VPHD can take forever! How big (deep) is BIG DATA these days and how big will it get? Example: US Library of Congress is storing EVERY tweet sent since Twitter launched in 2006. Each tweet record contains 50 fields (not really terribly wide) but there are billions+ of them (very deep). Let's assume each tweet records is about 1000 bits wide (so, on average, 20 bits per field). Let's estimate approximately 1 trillion tweets from 1 billion tweeters, to 1 billion tweetees over 10 years of tweeting? As a full 3-dimensional dataset, that's 1030 data items (1012 *109 * 109) That's BIG! Is it going to get even bigger? Yes. Let’s look at how the definition of big data has evolved just over my work lifetime My first CS job was as THE technician at St. John’s Univ IBM 1620 Computer Center. 1. I turned the 1620 switch on. 2. I waited for the ready light bulb to come on (~15 minutes) 3. I put the Op/Sys punch card stack on the card reader (~4 inch high stack of cards like ) 4. I put the FORTRAN compiler card stack on the reader (~3 inch high stack) 5. I put the FORTRAN program card stack on the reader (~2 inch high stack) 6. The 1620 produced an object code stack which I read in (~1 inch high stack) 7. I read in the object stack and a 1964 BIG DATA stack (~40 inch high stack) How high would the Big Data STACK reach today if it were put on punch cards? Let's be conservative and assume an exabyte (218 bytes) of data on cards. So how high is an exabyte punch card stack? Take a guess...............? We're being conservative because the US LoC tweet DB may be > ~1030B (if it's fully and losslessly stored in all dimensions: tweet, tweeter, tweetee, hashtag, …)). That exabyte stack of punch cards would reach to JUPITER! So, in my work lifetime, BIG DATA has gone from a 40 inch high stack of cards to one that reaches all the way to Jupiter! What will happen to BIG DATA over your work lifetime? I don’t know, but I predict that any estimate you make will be way low. I must deal with the LoC Tweet Dataset which would reach Jupiter as a punch card stack. To do so, I find that I need to structure it losslessly as 1000 extendable vertical pTrees and write programs to process horizontally across those 1000 vertical structures. You may have to deal with a Tweet Dataset that would reach the end of space (if on cards), but you can structure it losslessly as 1000 extendable vertical pTrees and write programs to process across those 1000 vertical structures horizontally (or deal with it in some new way – there is always a better way to do anytbing!!!). The next generation may have to deal with a data file that creates new space, but if we do our job as data mining researchers, they weill be able to structure it losslessly by 1000 extendable vertical pTrees and write programs to process across those 1000 vertical structures horizontally or deal with it effectively in some new way that you invented!! (or that they invent). You will be able to use my Horizonal processing code! The next generation will be able to use my code (or your new code)! It seems clear to me that BIG DATA WILL HAVE TO BE COMPRESSED and that data will have to be VERTICALLY structured. Let's take a quick look at how one might organize and compressed vertical data (and keep in mind that, so far, we have found that the compression step is unnecessary because processors are so fast at bit AND/OR/COMPLEMENT operations these days/).

predicate Trees = pTrees: slice by column (4 vertical structures). A Vertical Data Structuring Traditional Vertical Processing of Horizontal Data (VPHD) e.g., find the number of occurences of 7 0 1 4 using vertical pTreesfind number occurrences of 7 0 1 4 R[A1] R[A2] R[A3] R[A4] R(A1 A2 A3 A4) 2 7 6 1 6 7 6 0 3 7 5 1 2 7 5 7 3 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 for Horizontally structured, record-oriented data, one scans vertically = pure1? true=1 pure1? false=0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 pure1? false=0 pure1? false=0 pure1? false=0 0 0 0 1 0 01 0 1 0 1 0 1 1. Whole thing pure1? false  0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 2. Left half pure1? false  0 P11 0 0 0 0 0 01 3. Right half pure1? false  0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 4. Left half of rt half? false0 7 0 1 4 0 0 1 0 0 01 5. Rt half of right half? true1 0 *23 0 0 *22 =2 0 1 *21 *20 0 1 0 To count (7,0,1,4)s use 111000001100P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = vertically slice off each bit position (12 vertical structures) then compress each bit slice into a treeusing a predicate (We will walk thru the compression of R11 into pTree, P11 ) =2 Base 10 Base 2 R11 0 0 0 0 0 0 1 1 Imagine an excillion records, not just 8 (We need speed!). Record truth of predicate: "purely 1-bits" in a tree, recursively on halves, until the half is pure. More typically, we compress strings of bits not single bits (eg, ≥64 bit strings or strides). So far, we have found that the step of compressing the vertical bit strips or slices into tree structures (pTrees) is unnecessary because processors are so fast at bit AND/OR/COMPLEMENT operations ). P11 But it's pure0 so this branch ends

ANSI schema model Levels of Data Abstraction defined by various schem” levels View 1 View 2 View 3 Many views, Physical schema. Conceptual (logical) schema Views describe how users see data (different data models for different views) Conceptual schema defines logical structure of entire data enterprise Physical schema describes underlying files and indexes used. Conceptual Schema Schemas are defined using Data Definition Languages or DDLs; data are modified/queried using Data Manipulation Languages or DMLs. A typical DBMS has a layered architecture. Structure of a DBMS This is one of several possible architectures. Another with a little more detail on next slide. Physical Schema QUERIES from users (or Transactions or user-workload requests) SQL (or some other User Interface Language) QUERY OPTIMIZATION LAYER Relational Operators (Select, Project, Join) DATABASE OPERATOR LAYER  File processing operators (open,closefile,read/write record FILE MANAGER LAYER (provide the file concept)  Buffer managment operators (read/flush page) BUFFER MANAGER LAYER Disk transfer operators (malloc, read/write block DISK SPACE MANAGER LAYER  DB on DISK Query Opt/Execution DISK SPACE MANAGER deals with space on disk offers an interface to higher layers (mainly the BUFFER MGR) consisting of: allocate/deallocate space; read/write block can be implement on a raw disk system directly, then it would likely access data as follows: read block b of track t of cylinder c on disk d or can use OS file system (OS file = sequence of bytes) then it would likely access data as follows: read bytes b of file f and then the Operating System file manager would translate that into read block b of track t of cylinder c on disk d most systems do not use OS files system for portability reasons and to avoid OS file size peculiarities (limitations) Relational Operators Layers w conc ctrl/ rec Files and Access Methods BUFFER MANAGERpartitions the main memory allocated to the DBMS into buffer page frames,brings pages to and from disk as requested by higher layers (mainly the FILE Mgr). Buffer Management FILE MANAGER supports the file concept to higher layers (DBMS file = collection of records and pages of records), supports access paths to the data in those files (e.g., Indexes). Not all Higher level DBMS code recognizes/ uses page concept. Almost all DBMS use the record concept, though. Disk Space Management DATABASE OPERATOR LAYERimplements physical data model operators (e.g., relational operators; select, project, join. QUERY OPTIMIZER produces efficient execution plans for answering user queries (e.g., execution plans as trees of relational operators: select, project, join, union, intersect translated from, e.g., SQL queries). DB SQL is not adequate to answer all user-database questions, e.g., Knowledge workers working on Data Warehouses ask "what if" questions (On-Line Analytic Processing or OLAP) not retrieval questions (SQL)

name ssn lot Employee name since name dname ssn lot ssn budget lot did Employee Works_In Employee Department super-visor subor-dinate Reports_To Degree=2 relationship between entities, Employees and Departments. Must specify the “role” of each entity to distinguish them. Overview of Database Design Conceptual design: • What are the entities and relationships in the enterprise? • What information about these entities and relationships should be stored in the database? • What integrity constraints or business rules should be enforced? A database `schema’ Model diagram answers these question pictorially (Entity-Relationship or ER diagrams). • Then one maps the ER diagrams into a relational schema (using the Data Definition Language provided) Entity:Real-world object type distinguishable from other object types. An entity is described using a set of Attributes. Each entity set has a key. (which is the chosen identifier attribute(s) and is underlined in these notes) Each attribute has a domain.(allowable value universe) Relationship: Association among two or more entities. E.g., Employee Jonesworks inPharmacy department. Relationships can have attributes too! Degree=2 relationship between an entity and Itself? E.g., Employee Reports_To Employee.

name ssn lot since name dname Employee ssn lot did budget m n Covering yes hours_worked hourly_wages since name dname ISA Employee Manages Department contractid Overlap allowed ssn budget lot did Contract_Emp Hourly_Emp Works_In Employee Department 1 m 1-to-1 1-to Many Many-to-1 Many-to-Many Relationship Cardinality Constraints (1-many) e.g., Manages: It may be required that each dept has at most 1 manager. (1-1) Manages: In addition it may be required that each manager manages at most 1 department. (many-to-many) Works_In: An employee can work in many departments. A dept can have many employees working in it. Participation Constraints: Every department may have to have a manager? This is an example of total participation constraint: the participation of Department in Manages is said to be total (vs. partial). since since name name dname dname ssn did did budget budget lot ISA (`is a’) Hierarchies total We can use attribute inheritance to save repeating shared attributes. Departments Employees Manages If we declare an ISA relationship among entity types, e.g., A ISA B (every instance of A entity is also an instance entity of entity B), then B entities “inherit” A entity attributes e.g., every Hourly_Emp ISA Employee every Contract_Emp ISA Employee Hourly_Emps and Contract_Emps can have their own separate attributes also. Works_In Overlap constraints: Can Joe be an Hourly_Emp and a Contract_Emp? (Allowed/disallowed) Covering constraints: Does every Employee entity also have to be an Hourly_Emp or a Contract_Emp entity? (Yes/no) since

Relational database: a set of relations • Relation: made up of 2 parts: • Instance or occurrence : a table, with rows and columns. #Rows = cardinality, #fields = degree • Schema or type:specifiesname of relation & name, type of each attribute • Students(sid: string, name: string, login: string, age: integer, gpa: real). • Strictly, a relation is a setof tuples but it is common to think of it as a table (sequence of rows made up of a sequence of attribute values) sid name login age gpa 53666 Jones jones@cs 18 3.4 name login Jones jones@cs SQL is one of the simplest languages on earth - very English-like! Specify what, not how. • E.g., SELECT attributes FROM relations WHERE condition What columns you want What rows you want. A major strength of the relational model: supports simple, powerful querying of data. Queries can be written intuitively (specifying what, not how), DBMS is responsible for evaluation. DBMS does your programming! Allows a module called the optimizer to extensively re-order operations (even combine similar operations from different concurrent requests), and still ensure that the answer does not change. Find all 18 year old students (a selection) SELECT * FROM Students S WHERE S.gpa=3.4 Find just names and logins (a projection), replace 1st line: SELECT S.name, S.login FROM Students S WHERE S.age=18

What does the following query produce? SELECT S.name, E.cid FROM Students S, Enrolled E WHERE S.sid=E.sid AND E.grade=“A” suceeds Joined But Select fails sid name login age gpa 53666 Jones jones@cs 18 3.4 53650 Smith smith@ee 18 3.2 we get: S.name E.cid Smith Topology112 Querying Multiple Relations (Join, implemented using nested loop – alternative 1) Where also used to combine (join) S & E

Destroying and Altering Relations (also DDL) Destroys the relation Students. The schema information and the tuples are deleted. DROP TABLE Students ALTER TABLE Students ADD COLUMN Year: integer Schema of Students is altered by adding a new field; every current tuple is extended, e.g., with null in the new field. Adding and Deleting Tuples insert a single tuple using: INSERT INTO Students (sid, name, login, age, gpa) VALUES (53688, ‘Smith’, ‘smith@ee’, 18, 3.2) delete all tuples satisfying some condition (e.g., name = Smith): DELETE FROM Students S WHERE S.name = ‘Smith’ many powerful variants of these commands are available! is a relation constructed from stored or base relations. Store a definition of it, rather than the instance (actual tuples) View CREATE VIEW YoungActiveStudents (name, grade) AS SELECT S.name, E.grade FROM Students S, Enrolled E WHERE S.sid = E.sid and S.age<21 Views can be dropped using the DROP VIEW command. How to handle DROP TABLE if there’s a view on the table? DROP TABLE command has options to let user specify this. Views can be used to present necessary information (or a summary), while hiding details in underlying relation(s). Integrity Constraint, IC:condition that must be true for any instance in the database; e.g., domain constraints.ICs are specified when (or after) relations are created. ICs are checked when relations are modified. A legalinstance of a relation is one that satisfies all its ICs. DBMS should not allow illegal instances. Avoids data entry errors, too! Primary Key Constraints: A set of fields is a key (strictly speaking, a candidate key) for a relation if it satisfies: • (Uniqueness condition) No two distinct tuples can have same values in the key (which may be a composite) • 2. (Minimality condition) The Uniqueness condition is not true for any subset of a composite key. If Part 2 is false, it’s called a superkey (for superset of a key). There’s always at least one key for a relation, one of the keys is chosen (by DBA) to be the primary key, the primary record identification or lookup column(s), E.g., sidis a key for Students. Set {sid, gpa} is a superkey. Foreign keys and Referential Integrity : A field (or set of fields) in one relation used to `refer’ to a tuple in another relation. (by listing the the primary key value in the second relation.) Like a `logical pointer’. E.g. sid in ENROLL is a foreign key referring to sid in Students (sid is the primary key of S)If all foreign key constraints are enforced, a special integrity constraint, referential integrity , is achieved, i.e., no dangling references E.g., if Referential Integrity is enforced (and it almost always is) an Enrolled record cannot have a sid that is not present in Students (students cannot enroll in courses until they register in the school)

Foreign Keys Enrolled Students Only students in Students should be allowed to enroll for courses. Enforcing Referential Integrity: Consider Students and Enrolled; sid in Enrolled is a foreign key that references Students. What should be done if an Enrolled tuple with a non-existent student id is inserted? (Reject it!). What should be done if a Students tuple is deleted? Also delete all Enrolled tuples referring to it? Disallow deletion if an Enrolled tuple refers to it? Set sid in Enrolled tuples that refer to it to a default sid? (sometimes there is a “default default, e.g., set sid in Enrolled tuples to a special value null, denoting `not applicable’ if no other default is specified.) Referential Integrity in SQL: supports all 4 options on deletes and updates. • Default action = NO ACTION(the violating delete/update request is rejected) CASCADE(also delete all tuples that refer to deleted tuple) SET NULL / SET DEFAULT(sets foreign key value of referencing tuple) CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid), FOREIGN KEY (sid) REFERENCES Students ON DELETE CASCADE ON UPDATE SET NULL)

ICs are based on the real-world enterprise semantics that is described in the DB. I.e., the users decide semantics, not the DB experts! Why? We can check a DB instance to see if an IC is violated, but we can NEVER infer an IC by only looking at the data instances. sid name login age gpa 53666 Jones jones@cs 18 3.4 53688 Smith smith@ee 18 3.5 An IC is a statement about all possibleinstances! It is not a statement that can be inferred from the set of currently existing instances. If ICs were inferred from current instances, then when a relation is newly created and has, say, just 2 tuple, many, many ICs would be inferred (e.g., in the table at right, The system might infer that students MUST be 18 or that names have to be 5 characters or worse yet, that gpa ranking must be the same as alphabetical name ordering! Key and foreign key ICs are the most common. Who decides primary key, etc.?The Database design expert? NO! Not in isolation, anyway. Someone from the enterprise who understands the data and the procedures should be consulted. The following story illustrates this point. CAST: Mr. Goodwrench = MG (parts manager); Pointy-headed Dbexpert = Ph D Ph D I've looked at your data, and decided Part Number (P#) will be designated the primary key for the relation, PARTS(P#, COLOR, WT, TIME-OF-ARRIVAL). MG You're the expert. Ph D Well, according to what I’ve learned in school, P# should be the primary key, because IT IS the lookup attribute! . . . later MG Why is lookup so slow? Ph D You do store parts in the stock room ordered by P#, right? MG No. We store by weight! When a shipment comes in, I take each part into the back room and throw it as far as I can. The lighter ones go further than the heavy ones so they get ordered by weight! Ph D But, but… weight doesn't have Uniqueness property! Parts with the same weight end up together in a pile! MG No they don't. I tire quickly, so the first one I throw goes furthest. Ph D Then we’ll use a composite primary key, (weight, time-of-arrival). MG We get our keys primarily from Curt’s Lock and Key. Point: This conversation should have taken place during the 1st meeting.

ATTRIBUTES are written next to Entity they describe. An ER Example: COMPANY is described to us as follows: Company is organized into depts - each with a name, number, manager. - Each manager has a startdate. - Each dept can have several locations. Departments control projects - each with a name, number, location. Each employee has a name, SSN, sex, address, salary, birthdate, dept, supervisor. - An employee may work on several projects (not necessarily all controlled by his dept) for which we keep hoursworked by project. Each employee dependent has a name, sex, birthdate, relationship. In ER diagrams, entities are represented in boxes: |EMPLOYEE| |DEPENDENT| |DEPT| |PROJECT| An attribute (or property) of an entity describes that entity. An ENTITY has a TYPE, including name and list of its attributes. ENTITY TYPE SCHEMA describes the common structure shared by all entities of that type. Project (Name, Num,Location, Dept) ENTITY INSTANCE = individual occurrence of an entity of a particular type at a particular time (Dome, 46, 19 Ave N & Univ, Athletics) (IACC, 52, Bolley & Centennial, C.S.) (Bean Res, 31, 12 Ave N & Federal, P.S.) . . . Entity Type does not change often. Entity instances get added, changed often - very dynamic CATEGORIES OF ATTRIBUTES: COMPOSITE ATTRIBUTE = subdivided into smaller parts with indep meaning, e.g., Name of Employee subdivided into FName, Minit, LName. Indicated: Name (FName, Minit, LName) Also, WorksOn may be a composite attr of Employee of Project and Hours: WorksOn (Project, Hours) SINGLE-VALUED ATTRIBUTE: one value per entry. MULTIVALUED ATTRIBUTE (repeating grp) have multiple values per entry: eg, Locs (as attribute of Dept since Dept can have multiple locs) Multivalued Attribute, use {Locations} WorksOn may be a mutlivaluedattr of Employee as well composite: {WorksOn (Project,Hours)} DERIVED ATTRIBUTE: value can be calculated from other attribute values. eg, Age calculated from BirthDate and CurrentDate. KEY ATTRIBUTE: Each value can occur at most once. (has uniqueness) Used to identify entity instances. We will * key attribute(s). ATTRIBUTE DOMAIN: Set of values that may be assigned (Value Set). Thus the Preliminary Design of Entity Types for COMPANY db is. *Name-------------. *Name-------------. *Number-----------| *Number-----------| Locations--------|--|DEPARTMENT| Location---------|-|_PROJECT| Manager----------| ControlDepartment' ManagerStartDate-' .---Name |--*SSN .--Employee |__EMPLOYEE__|----|---Sex |-*DependentName |---Address |_DEPENDENT|--|--Sex |---Salary |--BirthDate |---BirthDate `--Relationship |---Department |---Supervisor `---WorksOn Relationships have TYPEs (consisting of names of entities, relationship). A Relationship type diagram for a relationship between EMPLOYEE and DEPARTMENT called "WorksFor" is diagrammed: (in a roundish box) |EMPLOYEE|-( WorksFor )-|DEPARTMENT| RELATIONSHIP INSTANCEs for above relationship might be, eg: ( John Q. Smith, Athletics ), ( Fred T. Brown, Comp. Sci.), ( Betty R. Hahn, Business ). . . RELATIONSHIP DEGREE: Number of participating entities (usually 2). If an entity participates more than once in the same relationship, then ROLE NAMES are needed to distinguish multiple participations. eg, Supervisor, Supervisee in Supervision relationship (Called Reflexive Relationships. One decision that has to be made is to decide whether attribute or relationship is the appropriate way to model, e.g., "WorksOn". Above we modeled it as an attribute of EMPLOYEE {WorksOn(Project,Hours)} The fact that it is multivalued and composite (involving another entity, project) ssuggest that it would be better to model it as a relationship (i.e., it makes a very complex attribute!) WORKS_FOR(EMPLOYEE, DEPARTMENT)

CONSTRAINTS ON A RELATIONSHIP CARDINALITY CONSTRAINT can be 1-to-1 many-to-1 1-to-many or many-to-many 1-1: MANAGES(EMPLOYEE, DEPARTMENT) Each manager MANAGES 1 dept Each dept is MANAGED-BY 1 manager Many-1: WORKS_FOR(EMPLOYEE, DEPARTMENT) Each employee WORKS_FOR 1 dept. Each dept is WORKED_FOR by many emps Many-Many: WORKS_ON(EMPLOYEE, PROJECT) Each employee WORKS_ON many projects. Each project WORKED_ON by many emps PARTICIPATION CONSTRAINT (for an entity in a relationship) can be Total, Partial or Min-Max Total: Every EMPLOYEE WORKS_FOR some DEPARTMENT Partial: Not every EMPLOYEE MANAGES some DEPT RELATIONSHIP can have ATTRIBUTES (properties), eg, Hours for WORKS_ON Relationship, Manager_Start_Date in MANAGES rel. ER Example COMPANY Entity-Relationship Diagram (showing the Schema) (double connecting lines means "total" while single line means partial participation.) ( MANAGES ) 1|| |1 || (WORKS_FOR) | *Name-----------. || 1|| many|| | *Number---------| || || || / {Locations}-----|- DEPARTMENT // / number_employees' /1 // / .----' // / ( CONTROLS ) // / many // / || // / || (SUPERVISE) // / || | | // / || 1| |many // / || 'er| |'ee // / || |____|_______//_____/ .Name(FN,Mi,LN) || |_EMPLOYEE__________|---|-*SSN || // | |-Sex || // | |-Address || Hours-. // | |-Salary || | /many | `-BirthDate \\ (WORKS_ON) | \\ 1| \\ many| | \\ || ( Dependent_0f ) \\ || |many *Nane-. \\_______||___ || *Numb-|--| PROJECT | || Locatn' || || *DependentName---. . Sex--------------|--|| DEPENDENT || BirthDate--------| Relationship-----' 6 RELATIONSHIPS; (role names, if any, above) CARDINALITY RELATIONSHIP ATTRIBUTES ----------- ------------ -(partic below) 1:1 MANAGES (EMPLOYEE, DEPARTMENT) partial total 1:many WORKS_FOR (DEPARTMENT, EMPLOYEE) total total many:many WORKS_ON (EMPLOYEE, PROJECT) total total 1:many CONTROLS (DEPT, PROJECT) partial total Reflexive relationship with role names supervisor supervisee 1:many SUPERVISION (EMP, EMPLOYEE) partial partial 1:many DEPENDENTS_OF ( EMP, DEPENDENT) partial total

Data Issues and Problems: CENTRALITY OF DATA Data are central to every computer program. If a program has no data, there is no input, no output, no constants, no variables... It is hard to imagine a program with no data. Thus, virtually all programs are data management programs and therefore, virtually all computing involves data management. However, not the all data in computer programs is RESIDUALIZED. RESIDUALIZED data is data stored and managed after the termination of the program that generated it (for reuse later). Database Management Systems (DBMSs) store and manage residualized data. WHAT IS A DATABASE? HUGE VOLUME (EVERYONE HAS LOTS OF DATA AVAILABLE TO THEM TODAY!) Data are collected much faster than data are process or managed. NASA's Earth Observation System (EOS), alone, has collected over 15 petabytes of data already (15,000,000,000,000,000 bytes). Most of it will never be use! Most of it will never be seen! Why not? There's so much volume, usefulness of much of it will never be discovered. SOLUTION: Reduce the volume and raise the information density through structuring, querying, filtering, mining, summarizing, aggregating... That's the main task of Data and Database workers today! Claude Shannon's information theory principle comes into play here: More volume means less information. A simple illustration: Which phone book has more useful information? (both have the same 4 data granules; Smith, Jones, 234-9814, 231-7237) BOOK-1BOOK-2 Name NumberName Number Smith 234-9816 Smith 234-9816 Jones 231-7237 Smith 231-7237 Jones 234-9816 Jones 231-7237 The Red Book has no useful information! Data analysis, querying and mining reduce volume and raises info level Proper Structuring of datamay be the second most important task in data and database system! At the highest level, is the decision whether a data set should be structured as horizontal or vertical data (or some combination). Another important task to be addressed in data systems work today is RESIDUALIZATION OF DATA MUCH WELL-STRUCTURED DATA IS DISCARDED PREMATURELY DBs are about storing data persistently, for later use. RESIDUALIZING DATA may be the third most important task! There are many definitions in the literature. An integrated shared repository of operational data of interest to an enterprise. INTEGRATED: it must be the unification of several distinct files. SHARED: same data can be used by more than 1 user (concurrently) REPOSITORY: implies "persistence". OPERATIONAL DATA: data on accounts, parts, patients, students, employees, genes, stock, pixels,... By contrast, nonoperational incl. I/O data, transient data in buffers, queues... ENTERPRISE: bank, warehouse, hospital, school, corp, gov agency, person.. What is a DBMS? A program which organizes and manages access to residual data. Databases also contains METADATA also (data on the data). Metadata is non-user data which contains the descriptive information about the data and database organization (i.e., Catalog data). Why use a DBMS? COMPACTNESS (saves space - no paper files necessary). EASE OF USE(less drudgery, more of the organizational and search work done by the system; user specifies what, not how). CENTRALIZED CONTROL (by DB Administrator (DBA) and by the CEO). REDUCES REDUNDANCY(1 copy is enough, but concurrent use must be controlled NO INCONSISTENCIES(again, since there is only 1 copy necessary). ENFORCE STANDARDS(corporate, dept, industry, national, international). INTEGRITY CONSTRAINTS(automatically maintained) (e.g., GENDER=male => MAIDEN_NAME=null). BALANCE REQUIREMENTS(even conflicting requirements? DataBase Administrator (DBA) can optimize for the whole company). DATA INDEPENDENCE(occurs because apps are immune to storage structure and access strategy changes.

Almost all commerical databases today are HORIZONTAL. That is, the contain horizontally structure data. Horizontal data is data is formed into files of horizontal records of a common type.HORIZONTAL DATA TERMINOLOGY FIELDS, RECORDS, FILES: stored (physical, on disk) logical (as viewed by user) FIELDS, RECORDS, FILES type (e.g., datatype) occurrences (instances) TYPE: defines structure/expected contents (time-independent - changes only upon DB reorg) OCCURRENCE: actual data instances at a given time (time-dependent - changes with every insert/delete/update) STORED RECORDis a named horizontal concatenation of related stored fields. e.g., | Jones | John | 412 Elm St | Fargo | ND | 58102 | an instance field names City St Zip Lname Fname Address Lname(char25), Fname(char15), Address(char20), City(char15), St(char2), Zip(char5) field types Employee | Lname | Fname | Address | City | St | Zip | record and field names EoF End of File marker | Jones | John | 412 Elm |Fargo| ND|58102| record instance | Smith | James | 415 Oak | Mhd |MN|56560| record instance | Thom | Bob | 12 Main | Mhd |MN|56560| record instance | Trath | Phil | 234 12St |Fargo|ND |58105| record instance . . . HORIZONTAL DATA Jones STORED FIELDis the smallest unit of stored data in a database. e.g., is Lname stored field occurrence . Char 25=metadata type of that occurrence. STORED FILEis a named collection of all occurrences of 1 type of stored record.

| Jones | John | 412 Elm |Fargo| ND|58102| | Smith | James | | Thom | Bob | 12 Main | Mhd |MN|56560| | Trath | | 415 Oak | Mhd |MN|56560| | Jones | John |Fargo| ND| | Phil | 234 12St |Fargo|ND |58105| EoF | How these entities are stored and how they are viewed or known to users may differ. They may be known to the users in various logical variations. A logical record based on the 1st occuring employee record above might be: Stored continued The employee file type IS the common employee record type (+ possibly, some other type characteristics, e.g., max-#-records) In todays storage device world, there is only linear storage space, so the 2-D picture of a stored file, strictly speaking, not possible in physical storage media today. Some day there may be truly 2-D storage (e.g., holographic storage) and even 3-D. A more accurately depiction of the store Employee file (as stored on linear storage): So we also have LOGICAL FIELD = smallest unit of logical data LOGICAL RECORD= named collection of related logical fields. LOGICAL FILE = named collection of occurrences of 1 type of logical record which may or may not correspond to the physical entities.

For conceptualizing (logically) and storing (physically) data in a database we have horizontal and vertical models. Here are some of the HORIZONTAL MODELS for files of horizontal records: (in which processing is typically done through vertical scans, e.g., Get and process 1st record. Get and process next record ...) RELATIONAL (simple flat unordered files or relations of records of tuples of unordered field values) TABULAR (ordered files of ordered fields) INVERTED LIST (Tabular with an access paths (index?) on every field) HIERARCHICAL (files with hierarchical links) NETWORK(files with record chains) OBJECT-RELATIONAL (Relational with "Large OBject" (LOBs) fields) (attributes which point to or contain complex objects). DATA MODELS RELATIONAL DATA MODEL The only construct allowed is a [simple, flat] relation for both entity description and relationship definition STUDENT COURSE S# SNAME LCODE C# CNAME SITE |25|CLAY |NJ5101| |8 |DSDE |ND | |32|THAISZ|NJ5102| |7 |CUS |ND | |38|GOOD |FL6321| |6 |3UA |NJ | |17|BAID |NY2091| |5 |3UA |ND | |57|BROWN |NY2092| ENROLL LOCATION S# C# GRADE LCODE STATUS |32|8 | 89 | |NJ5101| 1 | |32|7 | 91 | |NJ5102| 1 | |25|7 | 68 | |FL6321| 4 | |25|6 | 76 | |NY2091| 3 | |32|6 | 62 | |NY2092| 3 | |38|6 | 98 | |17|5 | 96 | Here are some of the VERTICAL MODELS (for vertical vectors or trees of attribute values, processing is typically through logical horizontal AND/OR programs). BINARY STORAGE MODEL (Copeland ~1986) (This model used vertical value and bit vectors. It has virtually dissappeared!) BIT TRANSPOSE FILES (Wang ~1988) (This model used vertical bit files. It has also virtually dissappeared!) VIPER STRUCTURES (~1998) (Used vertical bit vectors for data mining.) PREDICATE-Trees or Ptrees (This model and technology is patented by NDSU and uses vertical bit trees) (~1997). The STUDENT and COURSE relations represent entities The LOCATION relation represents a relationship between the LCODE and STATUS attributes (1-to-many). The ENROLL relations represents a relationshipbetween Student and Course entities (a many-many relationship)

HIERARCHICAL DATA MODELentities=records relationships=links of records forming trees 25|CLAY|OTBK 32|THAISZ|KNB 38|GOOD|GTR STUDENTS 7|CUS 8|DSDE 6|3UA COURSES 6|3UA 7|CUS 6|3UA ND|68 ND|89 NJ|98 ENROLLMENTS NJ|76 ND|62 ND|91 EX: root type is STUDENT (with attributes S#, NAME, LOCATION), dependent type is COURSE (with attributes C#, CNAME), 2nd-level dependent type ENROLLMENT (with attributes, GRADE, LOC) If the typical workload involves producing class lists for students, this organization is very good. Why? If the typical workload is producing course enrollment lists for professors, this is very poor. Why? The problem with the Hierarchical Data Model is that it almost always favors a particular workload category (at the expense of the others).

NETWORK DATA MODELentities = records relationships = owner-member chains (sets) 25| CLAY | MJ511 32 | THAISZ | NJ512 STUDENT records 68 76 89 91 62 ENROLLMENT records page1 RRN S# ST STATE-INDEX | 0 | 25 |NJ| RID STATE | 1 | 32 |NJ| |1,2| FL | | 2 | 38 |FL| |1,0| NJ | | 3 | 47 |NY| |1,1| NJ | |1,3| NY | page2 |2,0| NY | | 0 | 57 |NY| | | | | | | | | | | | | 8|DSDE|ND 7|CUS |ND 6|3UA |NJ COURSE records many-to-many relationships easily accomodated EX: 3 entities (STUDENT ENROLLMENT COURSE) 2 owner-member chains: STUDENT-ENROLLMENT COURSE-ENROLLMENT Easy to insert (create new record and reset pointers), delete (reset pointers), update (always just 1 copy to worry about, ZERO REDUNDANCY!) network approach: fast processing, complicated structure (usually requires data processing shop) Again, it favors one workload type over others. INVERTED LIST MODEL (TABULAR): Flat Ordered Files (like relational except there's intrinsic order visible to user programs on both tuples and attributes). Order is usually "arrival order", meaning each record is given a unique "Relative Record Number" or RRN when it is inserted. - RRNs never change (unless there is a reorganization). Programs can access records by RRN. Physical placement of records on pages is in RRN order ("clustered on RRN" so that application programs can efficiently retrieve in RRN order. Indexes, etc can be provided for other access paths (and orderings). Object Relational Model (OR model) is like relational model except repeating groups are allowed (many levels of repeating groups - even nested repeating groups) and Pointers to very complex structures are allowed. (LOBs for Large OBjects, BLOBs for Binary Large OBjects, etc. for storing, e.g., pictures, movies, and other binary large objects.

Vertical Processing of Big Data: A New Paradigm in Data Mining

Vertical Processing of Big Data: A New Paradigm in Data Mining

Presentation Transcript

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data, Data Everywhere

Data, Data, and more Data

Data, Data, and more Data

Data Data Data

Data Mining: Data

Data ! Data! Data!

Data, Data Everywhere….

Faster data – Better data – Cheaper data

DATA? What DATA?

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Begets Data

Data, Data, Everywhere...

Data, Data, and More Data

Data Mining: Data

Data Mining: Data