410 likes | 583 Views
Boston KM Forum. How big d ata b ecomes a ctionable information Tweaked version of Gilbane big data presentation Other Gilbane Conference impressions And some open source/content management market dynamics slides Discussion. Big Data 101 Agenda. Big data in context Recap Risks
E N D
Boston KM Forum • How big data becomes actionable information • Tweaked version of Gilbane big data presentation • Other Gilbane Conference impressions • And some open source/content management market dynamics slides • Discussion
Big Data 101 Agenda • Big data in context • Recap • Risks • Recommendations
Big Data in Context • What is “big data”? • Unhelpfully, both “big data” and “NoSQL,” generally considered a key part of the big data wave, are defined more in terms of what they aren’t than what they are • A typical big data definition (Wikipedia): • “[…] data sets that grow so large that they become awkward to work with using on-hand database management tools” • Often associated with Gartner’s volume, variety (and complexity), and velocity model • Also value and veracity considerations
Big Data in Context • Why is big data a big deal now? • The need to deal with really big data sources, e.g., Web site logs, social network activities, and sensor network feeds • Commoditized hardware, software, and networking • Capability and price/performance curves that continue to defy all economic “laws” • Cloud services with radical new capability/cost equations • Maturation and uptake of related open source software, especially Hadoop • Powerful and often no- or low-cost
Big Data in Context • Why is big data a big deal now (continued)? • Market enthusiasm for “NoSQL” systems • Which often simply means Hadoop • Useful and often “open source”/public domain data sources and services • Mainstreaming of semantic tools and techniques • Overall: many things that used to be complex, expensive, and scarce • Are now relatively straightforward, inexpensive, and abundant
Big Data in Context • Big data reality checks • Most decision-makers don’t want big data per se; instead, they probably want • Relevant, accurate, and timely answers to big questions • Including alerts pertaining to questions they may or may not have asked yet • The ability to purposefully analyze information without having to master arcane technologies • It’s more about the ability to formulate and ask big questions (and to effectively analyze and act on answers) than it is about related technologies
Hadoop • Hadoop is often considered central to big data • Originating with Google’s MapReduce architecture, Apache Hadoop is an open source architecture for distributed processing on networks of commodity hardware • From Wikipedia: • “’Map’ step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes • ‘Reduce’ step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve”
Hadoop commercial application domains (from Wikipedia) include • Log and/or clickstream analysis of various kinds • Marketing analytics • Machine learning and/or sophisticated data mining • Image processing • Processing of XML messages • Web crawling and/or text processing • General archiving, including of relational/tabular data, e.g. for compliance
Hadoop • Hadoop is popular and rapidly evolving • Most leading information management vendors have embraced Hadoop • There is now a Hadoop ecosystem
Meanwhile, Back in the Googleplex • Dremel, BigQuery, Spanner, and other really big data projects
A NoSQL Taxonomy • From the NoSQL Wikipedia article:
NoSQL Perspectives • The “NoSQL” meme confusingly conflates • Document database requirements • Best served by XML DBMS (XDBMS) • Physical database model decisions on which only DBAs and systems architects should focus • And which are more complementary than competitive with DBMS • Object databases, which have floundered for decades • But with which some application developers are nonetheless enamored, for minimized “impedance mismatch,” despite significant information management compromises • Semantic (e.g., RDF) models • Also more complementary than competitive with RDBMS/XDBMS • Also consider: the “traditional” DBMS players can leverage the same underlying technology power curves
Data as a Service • The (single source of) truth is out there?... • High-quality data sources are being commoditized • Value is shifting to the ability to discern and leverage conceptual connections, not just to manage big databases • Some resources and developments to explore • Social networking graphs and activities • Data.com(Salesforce.com) • Data.gov • Google Knowledge Graph • Linked Data • Microsoft Windows Azure Data Marketplace • Wikidata.org • Wolfram Alpha
Mainstreaming Semantics • Tools and techniques applied in search of more meaning, e.g., • Vocabulary management • Disambiguation and auto-categorization • Text mining and analysis • Context and relationship analysis • It’s still ideal to help people capture and apply data and metadata in context • Semantic tools/techniques are complementary
Mainstreaming Semantics • The Semantic Web is still more vision than reality • But Google, Microsoft, and Yahoo, and Yandex, for example, are improving Web searches by capturing and applying more metadata and relationships via schema.org schemas in Web pages • And Google’s Knowledge Graph is about “things, not strings,” with, as of mid-2012, “500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects”
Recap • Commoditization and cloud • Very significant new opportunities • Hadoop and related frameworks • Complementary to RDBMS and XDBMS • NoSQL • Likely headed for meme-bust… • Data services • Game-changing potential • Semantic tools and techniques • Rapidly gaining momentum
Risks • The potential for an ever-expanding set of information silos • Focus on minimized redundancy and optimized integration • GIGO (garbage in, garbage out) at super-scale • New opportunities for unprecedented self-inflicted damage, for organizations that don’t model or query effectively • Cognitive overreach • The potential for information workers to create and act on nonsensical queries based on poorly-designed and/or misunderstood information models • Skills gaps can create competitive disadvantages • Modeling, query formulation, and data analysis • Critical thinking and information literacy
Recommendations • Aim high: big data is in many respects just getting started… • A lot of technology recycling but also significant and disruptive innovation • Work to build consensus among stake-holders on the opportunities and risks • Focus on human skills – e.g., critical thinking and information literacy • For now, an instance of the most creative and powerful type of semantic big data processor we know of is between your ears [End of tweaked Gilbane presentation]
Gilbane 2012 Impressions • The big themes • Cloud • Social • Mobile • Big data • Web • Other recurring themes • Open source: enterprise-ready for many domains
Gilbane 2012 Impressions • Projections • Consolidation ahead for W*M and ECM vendors • Likely to be accelerated by market uptake of native XML information management systems • And rediscovery of the utility of modern DBMSs • Along with SQL/XML (e.g., XQuery) synergy • Cloud as accelerator • Ridiculously low entry cost and complexity, relative to earlier on-premises alternatives • Tipping point with other shifts to cloud, e.g., for social, CRM/SFA, and public data sources
Gilbane 2012 Impressions • Projections • New challenges and opportunities for IT groups • Potential to derive unprecedented value from both existing and new information resources • Transition systems to “the cloud” • With or without IT assistance… • Blurring boundaries • Application, document, page… • Ability to apply and capture data and metadata in context, e.g., activity streams
Gilbane 2012 Impressions • Projections • The next critical IT scarcity is not about technology • It is instead the number of people who can • Think critically and structure problems/scenarios • Understand and apply conceptual models • Formulate queries and objectively analyze results • And generally get into an event/action routine, for work and personal activities • Growing awareness of the critical need for information responsibility • Producer: information quality, integrity, context… • Consumer: information literacy; critical and purposeful thinking
Reference Slides • Content management + open source • Hypertext
Hypertext • Criteria from a 2006 Burton Group report: • A content model based on collections of information items and links • Pervasive support for info item labels • Typed and bidirectional info item relationships • A means of creating, organizing, and sharing info item collections • Journaling (tracking info item changes) • Robust access control privilege management
Discussion peter@okellyassociates.com