1 / 60

The Digital Universe Scientific Data– Science of Data ( Algorithmic Information Theoretical Analyses )

The Digital Universe Scientific Data– Science of Data ( Algorithmic Information Theoretical Analyses ) . András Benczúr ELTE Faculty of Informatics Supported by the following project: „Independent steps in scienece” ELTE TÁMOP-4.2.2/B-10/1-2010-0030. 1. Latest Press Releases.

odin
Download Presentation

The Digital Universe Scientific Data– Science of Data ( Algorithmic Information Theoretical Analyses )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Digital UniverseScientific Data– Science of Data (Algorithmic Information Theoretical Analyses) András Benczúr ELTE Faculty of Informatics Supported by the following project: „Independent steps in scienece” ELTE TÁMOP-4.2.2/B-10/1-2010-0030 1

  2. Latest Press Releases CERN awards major contract for computer infrastructure hosting to Wigner Research Centre for Physics in Hungary 08.05.2012 CERN today signed a contract with the Wigner Research Centre for Physics in Budapest for an extension to the CERN data centre. Under the new agreement, the Wigner Centre will host CERN equipment that will substantially extend the capabilities of the LHC Computing Grid Tier-0 activities and provide the opportunity for business continuity solutions to be implemented. This contract is initially until 31 December 2015, with the possibility of up to four, one year, extensions thereafter. 2

  3. Recent News Wigner-DataCenter at Wigner Research Institute Tier-0 center for LHC Computing – 150M EUR investment . Rolf-Dieter Heuer : 20 years participation of Hungarian physicists in CERN. New high-tech data connection between Budapest and CERN, new challenging project that will change the way of computing support for research in Europe. Some history: Gy. Vesztergombi – DATA Grid initiative, 1999. Hungarian projects: Demo-Grid, EGEE-I:,II.,III Hungarian Grid Competence Center, Hungrid, Cluster Grid, Desktop-Grid. 3

  4. Recent News Big data has the power to change scientific research from a hypothesis-driven field to one that’s data-driven, FarnamJahanian, chief of the National Science Foundation’s Computer and Information Science and Engineering Directorate, said Wednesday. (Two weewks ago) The term big data refers generally to the mass of new information created by the Internet and by scientific tools such as the Hubble Telescope and the Large Hadron Collider. The emerging field of big data analysis is aimed at sorting through the massive volume of that data -- whether it’s social media posts, video clips, satellite feeds or the reaction of accelerated particles -- to gather intelligence and spot new patterns. 4

  5. Recent News Federal officials announced in March that the government will invest $200 million in research grants and infrastructure building for big data. The investment was spawned by a June 2011 report from the President's Council of Advisors on Science and Technology, which found a gap in the private sector's investment in basic research and development for big data. 5

  6. Digital Universe and SemanticGap Mankind gave born to a new universe, the Digital Universe. Majority of our data and information is inside it somewhere and in digital form of some kind. Even new observations – from LHC, digital sensors,cameras etc. – go first in digital form into it. The conjectureonthe growing semantic gap between human beings and computers: With the growing of the size of databases the length of queries grows at least logarithmically, and may grow linearly. According to the estimation from IDC in [4] the size of the Digital Universe will grow in the next five year by a factor 9. It doubles every one and a half year. 6

  7. Digital Universe and SemanticGap The Digital Universe contains only the substitutions, or encodings of information, independently whatever information means. Inside the Digital Universe the physical processes are either transformations of signals from one form to other one or they are materialized computations. 7

  8. Digital Universe and SemanticGap Paradoxically, inside the Digital Universe, the basic components, the physically existing – even temporarily - digits as bits and bytes have no semantic meaning but operational, computational or transformational. The observer’s meanings at the very end of the interaction with the real world are in the mappings of the real world stuff to a formal computable model. This mapping is the kernel of filling the gap between human beings and computers. 8

  9. Digital Universe and SemanticGap H. Mason: data scientist need tree skills: mathematically modeling of data, build the model engineering in implementingdata processing find inside and tell stories on the data,asking the right questions – the hardest task We need them to fill the SEMANTIC GAP P. Gelsinger: „Thirty years ago we didn’t have CS departments, now every quality school on the planet has one. Now, nobody has a data-science department. In thirty years every school on the planet will have one.” In: „Big Data’sBigProblem: Little Talent (The Wall Street Journal, 04/29/2012 9

  10. Motivation 1967. Debrecen, Colloquium on Information Theory „Where does information come from?”(from past) S. Watanabe, abstract The question was raised for inductive inference and for deductive inference. „Human mind, being an information transducer, it can lose but not gain information.” So, Digital Universe, being an information transducer, it can lose but not gain information. 10

  11. Motivation Today: Where is information? In the Digital Universe. Digital Universe: can lose but not gain information. Information is collected in it. In 2011: 1.8 Zettabyte of data will be created. Is information there? There are signals only. How can we gain information from it? By computation. Computation: signal transformation. How Much Information? What is information? 11

  12. Motivation Data volume on the NET: Estimation: the data on the Web doubles in 11-18 months Exabyte: the size of new data in year 1998 IDC research: the size of new data in 2011 will exceed 1.8 Zettabyte (1,8*1021 Byte) Upper estimate: 108 programmers, 8 ours daily, one keystroke (one byte) per second: new programs in one year: 1015 byte 12

  13. Motivation Next generation science , data intensive science (Jim Grey, Alex Szalay et al. 2005). „Scientists generate new data much faster as they can analyze them. All looks like optical illusion.” (Hugh Kieffert) Big Data Scientific Data 13

  14. The Data-Scope Project - 6PB storage, 500GBytes/sec sequential IO, 20M IOPS, 130TFlops Thursday, February 2, 2012 at 9:10AM “Data is everywhere, never be at a single location. Not scalable, not maintainable.” –Alex Szalay interview by Nicole Hemsoth with Dr. Alexander Szalay, Data-Scope team lead, is available at The New Era of Computing: An Interview with "Dr. Data". 14

  15. Semantic Gap The semantic gap between two persons. The semantic gap between a person and a computer. The effect of growing data volume on the semantic gap: the law of algorithmic information theory. 15

  16. Mathematics: InformationTheory Mathematical theories of information deal with quantitative properties. They mainly deal with the objective parts of information (representation and the mapping to their referents). The subjective aspect, the semantics of the referents is the problem of the observer. In [1]: P.J. Denning summarizes the discussion on the definition of information in the following: “The formal definitions of data (objective symbols) and information (subjective meaning) do not help me to design computers and algorithms. … Still, what information is remains an open question. “ 16T

  17. Mathematics: InformationTheory If we want to get closer to the notions of information from the point of view of the mathematical models we have to investigate carefully what is measured by the entropy functions. We can measure the quantity of information in three ways, according to Kolmogorov [2]. All the three measures are related to the length of description and not to the meaning of information. They are connected to the length of optimal digital code. 17T

  18. Maesures of Information quantity Kolmogorov: three approaches • Probabilistic: Shannon-entropy • Algorithmic: Kolmogorov-entropy • Combinatorial: uniform code length for all elements of the set 18

  19. Mathematics: InformationTheory In the Shannon-model, the expected value of the code length is minimized, whilst Kolmogorov-entropy measures the minimal length of codes used by the Universal Reference machine. In both models we don’t know what information is, we only know that there is a way to construct/reconstruct it from a signal of given length. We don’t know what information is, we only know how much it is. Processing information you have to understand meaning. Meaning should be in the eye of the beholder. 19T

  20. Basics of AlgorithmicInformationTheory The two basic principles of algorithmic information theory: Different things need different encodings. Decoding needs computable functions. 20T

  21. Basic techniques: 1) counting the number of code words of given lengths, 2) using a reference machine that enumerates a set of decoding functions.Invariancetheorem. The algorithmic information quantity: the length of the shortest codeword used by the Universal Turing-Machine as reference machine. l(p): thelength of code p. 21

  22. Conditional Kolmogorov entropy • Definition: • Prefix entropy: • choose the prefix Universal Turing-MachineU(p,y)as reference machine 22

  23. Conditional Kolmogorov entropy The measure of the algorithmic information quantity, the Kolomogorov entropy is not good for direct investigation of the Digital Universe. Only the construction of the Universal Reference Machine is important as measurement tool in finding approximation of quantitative analyses of the behavior of the Digital Universe. 23

  24. Querying a computer- a modell Participants: the computer Watson , and person Holmes. Watson: Content of data system: M, contains codes of programs: Prog Answers a query (request) Q if there exists P in Prog, such that P computes some answer A from Q and M. The reference to P must be given in Q. 24

  25. Querying a computer- a model The person Holmes: Conscious content of the brain: knowledge K, contains a part on „Thinking”, the ability to Articulate and Codify Knowledge, Cognitive Processes, Mental Mechanisms Holmes should articulate and codify a formal query Q for retrieving data A from Watson. This process is called filling the semantic gap between Holmes and Watson. 25

  26. In our simple model Holmes submit the query Q and Watson answers A. Q contains some reference to a program P in M used to compute answer A=P(Q,M) . Now the conditional Kolmogorov-entropy (The Law of information no growth.) Meaning: the length of the shortest query used by U, Practical limitation: strong only for large A and Q.

  27. New reference machine: M with Prog inside The reference machine used in the definition of Kolmogorov-entropy utilizes the possibility of enumerateevery computable functions, and it is a bit far from practical applications. Following the basic idea in the construction of the reference machine, we can consider M with Prog inside as reference machine. (The anytime best approximation of the Universal Reference Machine is in the Digital Universe.) 27

  28. The conditional algorithmic entropy of A given M is the length of the shortest query for which Watson gives the answer A: In notation: note: q contains a reference to p An important difference from the universal Turing machine is that Watson contains a collection of facts in M. (Finite Oracle) We can measure the querying efficiency of Holmes in getting answer A from Watson as 28

  29. Quantitative modelling the human computer interaction Supose, today Holmes solves a problem D after entering query Q and retrieving some information A from Watson. This means, using a human reasoning “program” R, Holmes obtains solution S from D, K and A: R(D,K,Q,A)=S Note: the semantics of A is relative to Q. . 29

  30. Douglas Adams:The Hitchhiker’s Guide to the Galaxy “Tell us!” All right said Deep Thought. “The Answer to the Great Question…” “Yes…!” “Of Life, the Universe and Everything…” said Deep Thought “Is Forty-two.” “You have never actually known what the question is.” “So once you do know what the question actually is, you know what the answer means.”

  31. Individualinformationmeasure Similarly to Watson we can introduce information measures for Holmes. The needof querying Watson means that he can’t give a solution S, even if the problem is formulated in the form of D, so Explanation: K is closed 31

  32. Model fitting Model fitting between the problem domain of D and a pre-coded model in M is necessary for codifying query Q. During this process the knowledge on M contained in K plays an important role in formulating an efficient query. Also, M may contain some information on K, this is the possibility of personalization. All this influences the semantic gap in formulating query Q. Explanation – the role of stochastic modelling Problem of (scientific) databases: mapping the semantics of measurement information to computational data model 32

  33. Information no growth law revisited Formulating query Q he uses K and the problem description D. AddedQ to M he receives back some information that has been added to M by someone else. If the answer A is sufficient to solution S, then there is no semantic gap. Otherwise, inorder to obtain the solution Sfrom K,D,Q and A he uses some process R not codified for Watson. Anothersemanticgaparises: codifying R into a code QR, so that Watson gives answer SR for QR. 33

  34. How can we use the model? Estimate the cardinality of the sets of possible answers, questions, problems, and then estimate the average length of queries and answers. Let us fix the present situation as above. With growing M, the code length of new query and answer of the same semantics as the former A had are growing. 34

  35. The effect of growing M Conditional entropy of answer A according to the reference machine Watson, or the Digital Universe or the Universal Turing machine uses the condition that M is given. How will the conditional entropy vary when we add some new data (digital signals) to M? Denoting the new content by M’we can ask what the new conditional entropy of the same answer A is. 35

  36. The effect of growing M The number of possible answers grows exponentially. So the number of queries also grows exponentially. Typical Query lengths grows linearly. 36

  37. Example:subsetquery M encodes n elements of a set. A query retrieves a subset. Number of queries and answers: 2n. Average length of queries and answers: c*n. Adding m new element to the set: Number of queries and answers: 2n+m. Average length of queries and answers: c*(n+m). The average length is independent of the reference machine. 37

  38. The threat of growing semantic gap The size of queries and answers exceeds the processing capacities of a human beings. The difference between information quantity of K (human knowledge) and M (World’s data) is growing exponentially. The same will be true for the common knowledge of a group of people, and finally for the mankind. 38

  39. World’s Data ConductedbyRevolutionAnalyticsattheJointStatistical Meeting heldin Miami fromJuly 30 through Aug. 4, thesurveyshowsthat 97% of datascientistsbelieve "bigdata" analyticstechnologycurrently is fallingshort of enterpriseneeds. Specifically, the 200 orsoscientistssurveyedhighlightedthreeobstaclestorunninganalyticsonbigdata: * theinherentcomplexities of bigdata software * problemsapplyingvalidstatisticalmodelstothedata * a generallack of insightintowhatthedatameans 39

  40. Evolution of info communication technologies will help us Search engines – concentration (Google, Yahoo, Ms Explorer, Mozilla, …) Distributed and parallel technologies: HPC, Clusters, Grid, Cloud, … Social Networking: Twitter, Blogging, Youtube, Facebook, … Semantic technologies (Semantic Web, RDF, OWL,…) Data Mining, Data Warehousing, OLAP, Big Data No-SQL 40

  41. World’s Data Unstructured data, files, email, video will account for 90% of all data created over the next decade. Number of servers managing the world’s data stores will grow by ten times. The bad news: the number of IT professionals available to manage all that data will grow only by 1.5 times today’s levels. They simple won’t keeping pace with demand. (Threat of growing Semantic Gap.) New data sources: embedded systems, sensors in clothing, medical devices, buildings, …)

  42. Data intensive science Next generation science by Jim Grey, Alex Szalay et al. 2005. „Scientists generate new data much faster as they can analyze them. All looks like optical illusion.” (Hugh Kieffert) 42

  43. JimGray’s Law of Data Engineering Scientific cumputing is revolvong around data. Need scale-out solutions for analyses 43

  44. Experiments & Instruments facts questions facts ? Other Archives facts answers Literature facts Simulations Jim Gray: The Big Picture • Data ingest • Managing a petabyte • Common schema • How to organize it? • How to reorganize it? • How to coexist with others? The Big Problems • Data Query and Visualization tools • Support/training • Performance • Execute queries in a minute • Batch (big) query scheduling

  45. The Big Picture - extended Digital Universese Experiments & Instruments She is XY facts M Prog Questions facts Other Archives • Data ingest • Managing a petabyte • Common schema • How to organize it? • How to reorganize it? • How to coexist with others? facts Answers Literature facts Simulations codes Programs facts Documents The Big Problems • Data Query and Visualization tools • Support/training • Performance • Execute queries in a minute • Batch (big) query scheduling

  46. Computational Statistics Unstructured data, like recording facts on stochastic and random phenomena in M needs queries formulated in terms of computational statistics. from MIT Technology Review Jan/Feb 2010: Mike Lynch (cofounder of Autonomy) pp.24: Why can’t Google’s algorithms search unstructured information? Processing unstructured information you have to understand meaning. Meaning should be in the eye of the beholder. 46

  47. Theory of AlgorithmicStatistics. Two parts code: description of a set, conditional encoding of the elements Kolmogorov’s structure function: The description of the set S is the structural part; it gives the regular or statistical properties of x, and usually has some natural meaning. The second part, the long code, is the random component. Now, probably, the random part of the Digital Universe is much larger than the discovered structure. 47

  48. The threeUniverse The Universe The Univerese in a human brain The Digital Universe Three different past to be observed „Where does information come from?”(from past) Research: force and provoke the Nature (an Universe) to produce and show a past such that we have not observed yet. 48

  49. DEMON OF THE SECOND KIND "We want the Demon, you see, to extract from the dance of atoms only information that is genuine, like mathematical theorems, fashion magazines, blueprints, historical chronicles, or a recipe for ion crumpets, or how to clean and iron a suit of asbestos, and poetry too, and scientific advice, and almanacs, and calendars, and secret documents, and everything that ever appeared in any newspaper in the Universe, and telephone books of the future…" (Stanislaw Lem, The Cyberiad)

  50. DEMON OF THE SECOND KIND A Demon of the Second Kind is a fictionalmachine that writes factual statements, but only all too well. It appears in the short story "The Sixth Sally," which is part of the novel The Cyberiad by Stanislaw Lem. In the story, two clever, space-traveling robots (Trurl and Klapaucius) fall into the clutches of an evil robot, the giant piratePugg. This pirate does not want to rob them of gold or silver; instead, he wants information. Specifically, Pugg tells his two captives that he will forcibly hold them until they tell him everything they know. Faced with the possibility of spending eons reciting all their knowledge, Trurl and Klapaucius offer the pirate a bargain. If he promises to let them go afterwards, the pair will build him a Demon of the Second Kind, a special machine that can print out an infinite amount of information.

More Related