The Age of Big Data. Thanks to Kayvan Tirdad at York University and Kalapriya Kannan at IBM. 1. Contents. Introduction: Explosion in Quantity of Data. 1. 1. Big Data Characteristics. 2. 2. Cost Problem (example). 3. 3. Importance of Big Data. 4. 4. Usage Example in Big Data. 5.
Thanks to Kayvan Tirdad atYork University
and Kalapriya Kannan at IBM
Introduction: Explosion in Quantity of Data
Big Data Characteristics
Cost Problem (example)
Importance of Big Data
Usage Example in Big Data
Some Challenges in Big Data
Other Aspects of Big Data
What we have / What we want
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
As of 2012[update], limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics,connectomics, complex physics simulations, and biological and environmental research. The limitations also affect Internet search, finance and business informatics. Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks. The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012[update], every day 2.5 exabytes (2.5×1018) of data were created. The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.
Big data is difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers". What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
This data is “big data.”
1 요타 바이트
X 6,000,000 = 1 (40 TB/S)
Air Bus A380
640TB per Flight
Twitter Generate approximately 12 TB of data per day
New York Stock Exchange 1TB of data everyday
storage capacity has doubled roughly every three years since the 1980s
Average Monthly Temperature of land and ocean
How big is the Big Data?
- What is big today maybe not big tomorrow
Big Data Vectors (3Vs)
"Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization”
Cost of processing 1 Petabyte of data with 1000 node ?
1 PB = 1015 B = 1 million gigabytes = 1 thousand terabytes
2000 * 3060$ = 6,120,000$
growing at almost 10% a year (roughly twice as fastas the software business)
US 2012 Election
- data mining for
individualized ad targeting
- Orca big-data app
- YouTube channel( 23,700 subscribers
and 26 million page views)
- Ace of Spades HQ
- predictive modeling
- drive traffic to other campaign sites
Facebook page (33 million "likes")
YouTube channel (240,000 subscribers
and 246 million page views).
- a contest to dine with Sarah Jessica Parker
- Every single night, the team ran 66,000
computer simulations, Reddit!!!
- Amazon web services
Data Analysis prediction for US 2012 Election
media continue reporting the race as very tight
Drew Linzer, June 2012
332 for Obama,
206 for Romney
Nate Silver’s, Five thirty Eight blog
Predict Obama had a 86% chance of winning
Predicted all 50 state correctly
Sam Wang, the Princeton Election Consortium
The probability of Obama's re-election
at more than 98%
Oakland Athletics baseball team and its general manager Billy Beane
- Oakland A's' front office took advantage of more analytical gauges
of player performance to field a team that could compete
successfully against richer competitors in MLB
- Oakland approximately $41 million in salary,
New York Yankees, $125 million in payroll that same season.
Oakland is forced to find players undervalued by the market,
- Moneyball had a huge impact in other teams in MLB
And there is a moneyball movie!!!!!
Six Provocations for Big Data
1- Automating Research Changes the Definition of Knowledge
2- Claim to Objectively and Accuracy are Misleading
3- Bigger Data are not always Better data
4- Not all Data are equivalent
5- Just because it is accessible doesn’t make it ethical
6- Limited access to big data creats new digital divides
Five Big Question about big Data:
1-What happens in a world of radical transparency, with data widely available?
2- If you could test all your decisions, how would that change the way you compete?
3- How would your business change if you used big data for widespread, real time customization?
4- How can big data augment or even replace Management?
5-Could you create a new business model based on data?
`Big- Data’ is similar to ‘Small-data’ but bigger
.. But having data bigger it requires different approaches:
Techniques, tools, architecture
… with an aim to solve new problems
Or old problems in a better way
4.6 billion camera phones world wide
30 billion RFID tags today (1.3B in 2005)
12+ TBsof tweet data every day
100s of millions of GPS enabled devices sold annually
? TBs ofdata every day
2+ billion people on the Web by end 2011
25+ TBs oflog data every day
76 million smart meters in 2009… 200M by 2014
Understand and navigate federated big data sources
Federated Discovery and Navigation
Hadoop File System
Manage & store huge volume of any data
Structure and control data
Manage streaming data
Text Analytics Engine
Analyze unstructured data
Integrate and govern all data sources
Integration, Data Quality, Security, Lifecycle Management, MDM