Challenges and Opportunities in Efficient Management of Big Data

Challenges and Opportunities in Efficient Management of Big Data Assoc. Professor Bela Stantic Director IIIS Big Data and Smart Analytics Lab Deputy Head of School of Information and Communication Technology Institute for Integrated and Intelligent Systems - IIIS

Overview • Facts about traditional DBMS • Catch-phrase Big Data • What it is, forms, characteristics • Sources of Big Data • Classification into groups • Challenges and methods to overcome them for each group • Current achievements • Concluding remarks – Future of Big Data management Institute for Integrated and Intelligent Systems - IIIS

Background Processes SYSTEM GLOBAL AREA Redo Log Files • Data Files Control Files Traditional Database Management Systems • Disk based • SQL oriented • data is stored in heavily encoded disk blocks • main memory buffer pool to improve performance • blocks moved between memory buffers and disk • Optimization of CPU usage and disk I/Os (best way to execute SQL commands ) • The fundamental operation is a read or an update of a table row. Servers Users 20 Institute for Integrated and Intelligent Systems - IIIS

Traditional Database Management Systems • Indexing is done by B-trees in the form of clustered or unclustered or by hash indexes [Comer, 79]. • Concurrency control dynamic row-level locking (ACID) • Crash recovery write-ahead log - ARIES [Mohan, 93]. • Replication - mostly by updating the primary node and then moving the log over the network to other sites and rolling forward! • Due to the proven capabilities it isconsidered that the traditional database management systems fit all purpose. • Commercial traditional DBMS successfully absorbed new concepts and trends in past (Object databases, XML databases) • Brief details indicate robustness. 4 Institute for Integrated and Intelligent Systems - IIIS

One size does not fit all !!! • An increase in form of application domains, database sizes, as well as the variety of captured data commonly called “Big Data” began to cause problems for this technology as it started to be too robust and not able to answer the requirements of new demands. • A three decades of commercial DBMS development can be summed up in a single phrase: ”One size fits all”. • Stonebraker in his 2006 paper - one size does not fit all. • Originally designed and optimized for business data processing is not best solution for data-centric applications. • Traditional DBMS architecture is no longer best option for whole database market, and the commercial world will need to venture into a collection of independent database concepts. 5 Institute for Integrated and Intelligent Systems - IIIS

What is Big Data? Image Source: http://pivotcon.com/three-ways-to-use-big-data-to-be-a-better-marketer/ 6 Institute for Integrated and Intelligent Systems - IIIS

Big Data • Dataset size: beyond the ability of the current system to collect, process, retrieve and manage. • Collections in traditional legacy databases and Data Warehouses, social networking and media, mobile devices, logs, data streams generated by remote sensors and other IT hardware, e-science computational biology, bioinformatics, genomics, astronomy, etc). • A typical feature of most Big Data is absence of a schema, causes problems with the integration. • The value of information explodes when data can be linked. • Google - track the spread of influenza (Nature paper 2009) ttp://flickr.com/photos/53242483@N00/5839399412 7 Institute for Integrated and Intelligent Systems - IIIS

Big Data - V’s.1. Volume • Data volume is increasing exponentially. • In the next five years, these files will grow by a factor of 8. • Ebay 90PB, Facebook 50 billion photos • Large Hadron collider 40TB/sec • The amount of information individuals create themselves in form of writing documents, taking pictures, downloading music, etc, is far less than the amount of information being created about them.

Characteristics of Big Data 2. Velocity • How quickly data is being produced and how quickly the data must be processed to meet the demand for extracting useful information. • Sources of high-velocity data such as log files of websites, devices that log events, etc. • Social media: Twitter 500mil T/day (peak150,000T/s), Facebook 800mil active users, 4,5 bilion likes/day. The value of this data degrades over time. • Useful information must be extracted in a timely manner otherwise it looses meaning.

Characteristics of Big Data 3. Variety • A single application can generate and collect many types of data. • many format and types: structured, unstructured, semi-structured, text (even in different languages), media, etc. • problems not only for storing and efficient management but also for mining and for analyzing data. • To extract the knowledge all these types of data need to be linked together.

Characteristics of Big Data 4. Veracity • Relatesto reliability of data and predictability of inherently imprecise data with noise and abnormality. • data must be meaningful to the problem being analyzed. • the biggest challenge is to keep your data clean .

Characteristics of Big Data • Variety and Velocity are working against the Veracity of the data. They decrease the ability to cleanse the data before the information is extracted. • 5th V: Value - is data worthwhile and has value for business. • 6th V: Variability - different meanings of data • 7th V: Volatility - how long data is valid and are relevant to the analysis and decisions. 12 Institute for Integrated and Intelligent Systems - IIIS

Sources of Big Data • We are no longer hindered by the ability to collect data but by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner. • Today’s data we can generally divided it into three main groups: • Online Transaction Processing OLTP: are basically all forms of legacy systems driven by ACID properties • Online Analytical Processing (OLAP) or Data Warehousing • Real-Time Analytic Processing (RTAP): new types of data as well as huge volumes and velocity pose significant demand and it is evident that the traditional database management systems are unable to answer this demand in real time. This group starts to dominate and is taking over with regard to the size of collected data as well as the useful information it can provide. 13 Institute for Integrated and Intelligent Systems - IIIS

Efficient Management of data in OLTP • Data stored in the form of rows in relations, • each relation has a primary key, • there can be one or more secondary indexes, • B-tree default indexes • Hash based indexes 14 Institute for Integrated and Intelligent Systems - IIIS

Efficient Management of data in OLTP • Spatial data: most often managed by family of R-tree indexes [Guttman, 1994], while some external structures are also utilized. • Efficient Management of temporal data represents a challenge: • The RI-tree [Kriegel, 2000] upper and lower composite indexes, Dedicated query transformations • TD-tree [Stantic, 2010]. 15 Institute for Integrated and Intelligent Systems - IIIS

Multidimensional data in OLTP • Efficient management of multidimensional data in commercial environment: • Multiple secondary indexes • Compound indexes (compound index outperforms secondary indexes when the result set is greater than 0.000015 % of the relation’s population. • At high number of dimensions full table scan faster (curse of dimensionality)! • UB-tree [Bayer, 2000] • VG-Curve [Stantic, 2013] 16 Institute for Integrated and Intelligent Systems - IIIS

Efficient Management of data in OLAP • At present it also relies on traditional DBMS however the request for ACID properties can be relaxed due to mostly read only operations. • The main attention is given to speed and therefore OLAP’s are mostly based on indexes. • B-tree indexes and variety of Bitmap indexes [Maayan, 1985] with different compressions are utilized. • To improve the response time different methods of identifying and materializing views have been also proposed in literature [Stantic, 2006]. 17 Institute for Integrated and Intelligent Systems - IIIS

Efficient Management of data in OLAP • Column stores can be used, can perform faster than the row stores by the factor of 50 or even 100. • The fact table has joins with dimension tables, which provide details about products, customers, stores, etc. These tables are organized in star or snowflake schema. • Also, compression is much easier and more productive in a column store As each block has only one kind of attribute. • Advantages of column stores has been identified, several column oriented database are available including Sybase IQ3, Amazon 4, Google BigQuery5, Vertica6, Teradata7, etc. • IBM, in 2012 released DB2 (code name Galileo) with row and column access control which enables ‘fine-grained’ control. 18 Institute for Integrated and Intelligent Systems - IIIS

Efficient Management of data in RTAP • In the past several main sources have been generating data and all others have been consuming data. • Today all of us are generating data, and all of us are also consumers of this shared data. • This new concept of data generation and number of users requires more feasible solutions of scaling than it is offered by traditional database architectures. • To efficiently exploit this new resource there is need to scale both infrastructures and techniques. 19 Institute for Integrated and Intelligent Systems - IIIS

RTAP Data Management Architectures • One of the biggest issue is the scalability of traditional DBMSs. • Vertical scaling (also called scale up),. • Horizontal scaling (scale-out) can ensure scalability in a more effective and cheaper way. • Data is distributed horizontally in the network ”Share nothing”. 20 Institute for Integrated and Intelligent Systems - IIIS

Data Management Architectures • Hadoop is now used as a highly scalable data-intensive platform. • The open source software Hadoop is based on the framework MapReduce and Hadoop Distributed File System (HDFS ). • Complexity of tasks for data processing in such architectures is minimized using programming languages, like MapReduce. • an effective implementation of the relational operation join in MapReduce requires special approach both in data distribution and indexing. 21 Institute for Integrated and Intelligent Systems - IIIS

NoSQL Databases • To accommodate data with no schema and for Real-Time Analytic ProcessingNoSQL databases have been utilized. • NoSQL means ”not only SQL” or ”no SQL at all”. • NoSQL concept dates back to 1990s. They provide simpler scalability and improved performance comparing to traditional relational databases. • Data model is in NoSQL databases described intuitively without any formal fundamentals. 22 Institute for Integrated and Intelligent Systems - IIIS

NoSQL Databases • In general NoSQL databases can be classified in the following groups: • key-value stores: SimpleDB, Redis, Memcached, Dynamo, Voldemort • column-oriented: BigTable, HBase, Hypertable, CASSANDRA, PNUTS • document-oriented: MongoDB, CouchDB • To improve performance some NoSQL databases are in-memory databases (Redis and Memcached). • ACID properties are not implemented fully, databases can be only eventually consistent or weakly consistent. • List of NoSQL databases: http://nosql-database.org/ 23 Institute for Integrated and Intelligent Systems - IIIS

NewSQL Databases • A more general category of parallel DBMSs called NewSQL databases are designed to scale out horizontally on shared nothing machines while ensuring transparent partitioning and still guaranteeing ACID properties, and employing lock-free concurrency control. • Applications interact with the database primarily using SQL. • NewSQL provides performance and scalability not comparable with traditional DBMS and with Hadoop as well. • For example, a comparison of Hadoop and Vertica shown that query times for Hadoop were a lot slower (1-2 orders of magnitude). • In context of Big Analytics, most representatives of NewSQL are suitable for Real-time analytics (e.g., ClustrixDB, Vertica, and VoltDB) or columnar storage (Vertica). • But in general the performance of NewSQL databases is still a problem. 24 Institute for Integrated and Intelligent Systems - IIIS

Concluding Remarks • It is evident ”one size does not fit all”. Different types of data and different requirements pose different challenges and needs, which can be only efficiently answered with different techniques and different concepts. • This is interesting time in databases, there is a lot of new database ideas as well as products. • Commercial traditional database management systems in the past successfully managed to add features to address different trends and needs and as a result they became too large and too resource demanding. • None of the newly proposed concepts in near future will guarantee requirements that OLTP application domains require. New product that will satisfy all requirements it will be most likely also too complex and resource demanding, • Therefore traditional database management systems are here to stay for OLTP applications, at least for a while. • However, it is expected that the OLTP will move toward the main memory systems and therefore current indexing methods will need to evolve to main memory search. 25 Institute for Integrated and Intelligent Systems - IIIS

Concluding Remarks • In relation to OLAP, answering needs for business intelligence, data warehouses will be bigger and bigger. • To be able to efficiently work with this volume of data DW will need to be a column store based. • We are already witnessing this change. • However, DW will need to head in the direction defining and building an execution engine that processes SQL framework without using a MapReduce layer (similar to Hive concept). • Such a concept could successfully compete in the data warehousing market. 26 Institute for Integrated and Intelligent Systems - IIIS

Concluding Remarks • Real-Time Analytic Processing it is expected that the changes will be driven by requests of complex analytics, which are able to predict future not just provide valuable information from data. • This will be possible as more and more data from different sources that contains useful information will be captured. • One of the new sources will be the ”internet of things” (RFID). • Big challenge will be to process efficiently all that data and to extract useful information in real time. • Current concept of NoSQL databases might survive butwill be only utilized in applications with schema-later concepts and obviously where ACID properties are not essential (logs). • We already witness that main NoSQL databases such as Cassandra and MongoDB are moving to SQL. They are also moving toward ACID. • Initially NoSQL used to mean ”No SQL”, then meant ”Not only SQL”, and at present most accurate definition would be ”Not yet SQL” . 27 Institute for Integrated and Intelligent Systems - IIIS

Concluding Remarks • Hadoop stack will need to change to something different as it is only good for very small fraction of application, which are highly parallel. • HDFS most likely will not survive, as it is very inefficient. • New data management architectures, e.g. distributed file systems and NoSQL databases, can solve Big Data problems only partially. • NewSQL systems are addressing these shortcomings. Several promising systems in this category are VoltDB, Hana (SAP), and SQLFire. • Distributed multi-dimensional indexing will be more important with the increase of number of nodes and it will need to overcome communication overhead, which is excessive for routing query requests between computer nodes. • Also, the current index framework cannot meet the velocity of Big data because communication overhead to synchronize the local index and global index is huge. • To overcome these problems most likely completely new index framework need to be introduced, and we are working on it. 28 Institute for Integrated and Intelligent Systems - IIIS

Challenges and Opportunities in Efficient Management of Big Data