1 / 42

Microsoft Research: Big Databases at Your Fingertips

Learn about the innovative big database research projects at Microsoft Research, including TerraServer and SkyServer. Discover how Microsoft Research collaborates with the US Dept. of Homeland Security and tackles big data challenges.

lowelle
Download Presentation

Microsoft Research: Big Databases at Your Fingertips

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microsoft Researchand Big DatabasesInformation at your fingertips Jim Gray & Tom Barclay gray@microsoft.com & TBarclay@Microsoft.com Microsoft Research Presentation to US Dept. Homeland Security 7 April 2004

  2. Outline • Overview of Microsoft Research • Big-Database Research • TerraServer: Geospatial app • SkyServer: data mining app • Q&A

  3. Most R&D Is DHow to Do Basic Research in Industry?Critical questions (from Rick Rashid) • How can Icreate and maintain a world class research organization in an industrial setting? • How do Ikeep the lines of communication open between product teams and researchers? • How do Iget new technology into products quickly?

  4. ApproachAdapt the Academic Model • Organizational goal: Advance state of the art • University organizational model • Flat structure, critical mass groups • Open research environment • Aggressive publication in peer-reviewed literature • Frequent visitors, daily seminars • Strong ties to University Research • Nearly 15% of basic research budget directly invested in Universities • Lab grants, research grants, fellowships, etc. • Hundreds of interns and visitors

  5. Microsoft Research • Founded in 1991 • Staff of over 700 in over 55 areas • Internationally recognized research teams • Lab locations : • Redmond, Washington, USA 75% • Cambridge, United Kingdom 10% • Beijing, People’s Republic of China 10% • Mountain View, California, , USA 5% • San Francisco, California , USA 1%

  6. Microsoft ResearchExpanding the State of the Art • Thousands of peer-reviewed publications • 10%…30% of papers at our focus conferencesgraphics, programming, systems, data management… • Community leadership • Professional societies • Journals • Conferences • Mentoring Interns • Hosting academic summers and sabbaticals • Special workshops

  7. BARC’s Research Agenda • Scaleable Servers • TerraServer – US map online • SkyServer – All astronomy data online • Databases • Advancing Databases and data storage • Media Management • Organizing your digital shoebox

  8. How Can HLS & MSR Cooperate? • Lots of research at MSR on HLS relevant areas. • Data mining and visualization • Distributed systems. • Cryptography, security,… • Etc.,,, • Invite MS Researchers to HLS • workshops • study groups. • HLS visiting scientists at MSR?

  9. Outline • Overview of Microsoft Research • Big-Database Research • TerraServer: Geospatial app • SkyServer: data mining app • Q&A

  10. NumbersTerabytes and Gigabytes are BIG! • Mega – a house in California • Giga – a very rich person (billionaire) • Tera – ~ The national debt • Peta – more than all the money in the world • A Gigabyte: the Human Genome • A Terabyte: 150 mile long shelf of books.

  11. How much information is there? Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded • Soon everything can be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ All Books MultiMedia All books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

  12. e-Science Has BIG DATA • Data captured by instrumentsOr data generated by simulator • Processed by software • Placed in a files or database • Scientist analyzes files / database • Virtual laboratories • Networks connecting e-Scientists • Strong support from funding agencies • Better use of resources • Primitive today

  13. The eScience Big Picture Experiments & Instruments facts • Data ingest • Managing a petabyte • Common schema • How to organize it? • How to reorganize it • How to coexist with others questions facts ? Other Archives facts answers Literature facts Simulations The Big Problems • Query and Vis tools • Support/training • Performance • Execute queries in a minute • Batch query scheduling

  14. e-Science is Data Mining • There are LOTS of data • people cannot examine most of it. • Need computers to do analysis. • Manual or Automatic Exploration • Manual: person suggests hypothesis, computer checks hypothesis • Automatic: Computer suggests hypothesis person evaluates significance • Given an arbitrary parameter space: • Data Clusters • Points between Data Clusters • Isolated Data Clusters • Isolated Data Groups • Holes in Data Clusters • Isolated Points Nichol et al. 2001 Slide courtesy of and adapted from Robert Brunner @ CalTech.

  15. Data Analysis • Looking for • Needles in haystacks – the Higgs particle • Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • Global statistics have poor scaling • Correlation functions are N2, likelihood techniques N3 • As data and computers grow at same rate, we can only keep up with N logN • A way out? • Discard notion of optimal (data is fuzzy, answers are approximate) • Don’t assume infinite computational resources or memory • Requires combination of statistics & computer science

  16. Outline • Overview of Microsoft Research • Big-Database Research • TerraServer: Geospatial app • SkyServer: data mining app • Q&A

  17. TerraServer/TerraServicehttp://terraService.Net/http://TerraServer-USA.com/TerraServer/TerraServicehttp://terraService.Net/http://TerraServer-USA.com/ • US Geological Survey Photo (DOQ) & Topo (DRG) images • On Internet since June 1998 • Operated by Microsoft • Cross Indexed with • Demographics, • A web service • 20 TB data source • 10 M web hits/day

  18. Digital OrthoQuads 15 TB, 280,000 files uncompressed Digitized aerial imagery 96% coverage conterminous US 1 meter resolution < 15 years old Digital Raster Graphics 1 TB compressed TIFF, 65,000 files Scanned topo maps 100% U.S. coverage 1:24,000, 1:100,000 and 1:250,000 scale maps Maps vary in age USGS Image Data • Urban Area • 1 foot resolution • Natural Color • 133 major U.S. cities • 30 available 2004 • 2001 or later • Produced by NIMA for Homeland Security

  19. Image Coverage • 100% U.S., Topo Maps (light green)2m to 1024m resolution • 96% 48 Conterminous States, (dark green)Ortho Imagery, 1m to 1024m resolution Urban Area Cities Seattle, Portland, Stockton, Modesto, Fresno, Sacramento, Chicago, Orlando, Atlanta, Amarillo, Houston, Lubbock, Springfield, Birmingham, Dallas, Albuquerque, Oklahoma City, El Paso, Lincoln, Lexington, Tampa, Washington DC, Mobile Ft Wayne, Colorado Springs, Baton Rouge, …

  20. User Interface Concept • Display Imagery: • 316 m 200 x 200 pixel images • 7 level image pyramid • Resolution 1 meter/pixel to 64 meter/pixel • Navigation Tools: • 1.5 m place names • “Click-on” Coverage map • Longitude and Latitude search • U.S. Address Search • External Geo-Spatial Links to: • USGS On-line Stream Gauges • Home Advisor Demographics • Home Advisor Real Estate • Encarta Articles • Steam flow gauges Concept: User navigates an ‘almost seamless’ image of earth Click on image to zoom in Buttons to pan NW, N, NE, W, E, SW, S, SE Links to switch between Topo, Imagery, and Relief data Links to Print, Download and view meta-data information

  21. New “Urban Area” Data Microsoft Campus at 4 meter resolution “Redundant Bunch 1” Ball field at .25 meter resolution

  22. Load Programs WinForm App C# Classes Database Server .NET Framework 1.1 Web Server Windows 2003 Server TerraServer Stored Procedures (T-SQL) TerraServer Web Pages, Services, Classes (C#) SQL Server 2000 ASP.NET 1.1 .NET Framework 1.1 .NET Framework 1.1 IIS 6.0 IIS 6.0 Windows 2003 Server Windows 2003 Server Software Architecture ADO.NET 1.1 ADO.NET 1.1

  23. TerraServer Becomes a Web ServiceTerraServer.net -> TerraService.Net • Web server is for people. • Web Service is for programs • The end of screen scraping • No faking a URL: pass real parameters. • No parsing the answer: data formatted into your address space. • Hundreds of users but a specific example: • US Department of Agriculture Lighthouse app. • USDA has internal TerraServer

  24. Place Search GetPlaceFacts GetPlaceList GetPlaceListInRect CountPlacesInRect Projection ConvertLonLatPtToUtmPt ConvertUtmPtToLonLatPt ConvertLonLatTo NearestPlace GetTheme GetLatLonMetrics Tile GetAreaFromPt GetAreaFromRect GetAreaFromTileId GetTileMetaFromLonLatPt GetTileMetaFromTileId GetTile (Image) Landmark GetLandmarkTypes CountOfLandmarkPointsByRect GetLandmarkPointsByRect CountOfLandmarkShapesByRect GetLandmarkShapesByRect Web Service Methods http://terraservice.net

  25. Get image meta-data Query TS Gazetteer Retrieve TS ImageTiles Projection conversions Web Map Client OpenGIS “like” Landmarks layered on TerraServer imagery Geo-coded data of well-known objects (points), e.g. Schools, Golf Courses, Hospitals, etc. Polygons of well-known objects (shapes), e.g. Zip Codes, Cities, etc Fat Map Client Visual Basic / C# Windows Form Access Web Services for all data TerraServer Web Services Terra-Tile-Service Landmark-Service Sample Apps http://terraservice.net

  26. KVM / IP Hardware Evolution • 1998 – 2000: DEC Alpha 8400, StorageWorks DAS • 1 x 8 x 440mhz RISC processor, 2gb RAM • 2.5 TB RAID-5, 9gb SCSI drives 7 racks • $2.1m (World’s Largest PC) – “Single Server Scale Up” • 2000 – 2003: 4-node Compaq Windows 2000 DataCenter Cluster, StorageWorks SAN • 4 x 8 x 700mhz Intel (Xeon) Processor, 4 gb RAM each • 18 TB RAID-10 (triple mirrored) 73gb drives, 4 racks • $1.6m – “High Availability Large Scale Cluster” • 2004 - …: “White-box Storage Bricks” • Low Cost Availability • 4 copies of the data • RAID1 SATA Mirroring • 2 redundant “Bunches” • Spare brick to repair failed brick 2N+1 design • Web Application “bunch aware” • Load balances between redundant databases • Fails over to surviving database on failure • ~100K$ capital expense.

  27. Outline • Overview of Microsoft Research • Big-Database Research • TerraServer: Geospatial app • SkyServer: data mining app • Q&A

  28. Virtual Observatoryhttp://www.astro.caltech.edu/nvoconf/http://www.voforum.org/ • Premise: Most data is (or could be online) • So, the Internet is the world’s best telescope: • It has data on every part of the sky • In every measured spectral band: optical, x-ray, radio.. • As deep as the best instruments (2 years ago). • It is up when you are up.The “seeing” is always great (no working at night, no clouds no moons no..). • It’s a smart telescope: links objects and data to literature on them.

  29. ROSAT ~keV DSS Optical IRAS 25m 2MASS 2m GB 6cm WENSS 92cm NVSS 20cm IRAS 100m Why Astronomy Data? • It has no commercial value • No privacy concerns • Can freely share results with others • Great for experimenting with algorithms • It is real and well documented • High-dimensional data (with confidence intervals) • Spatial data • Temporal data • Many different instruments from many different places and many different times • Federation is a goal • The questions are interesting • How did the universe form? • There is a lot of it (petabytes)

  30. Time and Spectral DimensionsThe Multiwavelength Crab Nebulae Crab star 1053 AD X-ray, optical, infrared, and radio views of the nearby Crab Nebula, which is now in a state of chaotic expansion after a supernova explosion first sighted in 1054 A.D. by Chinese Astronomers. Slide courtesy of Robert Brunner @ CalTech.

  31. SkyServer.SDSS.org • A modern archive • Raw Pixel data lives in file servers • Catalog data (derived objects) lives in Database • Online query to any and all • Also used for education • 150 hours of online Astronomy • Implicitly teaches data analysis • Interesting things • Spatial data search • Client query interface via Java Applet • Query interface via Emacs • Popular -- 1% of Terraserver  • Cloned by other surveys (a template design) • Web services are core of it.

  32. Demo of SkyServer • Shows standard web server • Pixel/image data • Point and click • Explore one object • Explore sets of objects (data mining)

  33. Data Federations of Web Services • Massive datasets live near their owners: • Near the instrument’s software pipeline • Near the applications • Near data knowledge and curation • Super Computer centers become Super Data Centers • Each Archive publishes a web service • Schema: documents the data • Methods on objects (queries) • Scientists get “personalized” extracts • Uniform access to multiple Archives • A common global schema Federation

  34. Federation: SkyQuery.Net • Combine 4 archives initially • Just added 10 more • Send query to portal, portal joins data from archives. • Problem: want to do multi-step data analysis (not just single query). • Solution: Allow personal databases on portal • Problem: some queries are monsters • Solution: “batch schedule” on portal server, Deposits answer in personal database.

  35. Each SkyNode publishes Schema Web Service Database Web Service Portal is Plans Query (2 phase) Integrates answers Is itself a web service ImageCutout SkyQuery Portal 2MASS INT SDSS FIRST SkyQuery Structure

  36. SkyQuery: http://skyquery.net/ • Distributed Query tool using a set of web services • Four astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England). • Feasibility study, built in 6 weeks • Tanu Malik (JHU CS grad student) • Tamas Budavari (JHU astro postdoc) • With help from Szalay, Thakar, Gray • Implemented in C# and .NET • Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

  37. SkyNode Basic Web Services • Metadata information about resources • Waveband • Sky coverage • Translation of names to universal dictionary (UCD) • Simple search patterns on the resources • Cone Search • Image mosaic • Unit conversions • Simple filtering, counting, histogramming • On-the-fly recalibrations

  38. Portals: Higher Level Services • Built on Atomic Services • Perform more complex tasks • Examples • Automated resource discovery • Cross-identifications • Photometric redshifts • Outlier detections • Visualization facilities • Goal: • Build custom portals in days from existing building blocks (like today in IRAF or IDL)

  39. Let users add personal DB 1GB for now. Use it as a workbook. Online and batch queries. Moves analysis to the data Users can cooperate (share MyDB) Still exploring this ImageCutout SkyQuery Portal 2MASS INT SDSS FIRST MyDB added to SkyQuery MyDB

  40. The Big Picture Experiments & Instruments facts • Data ingest • Managing a petabyte • Common schema • How to organize it? • How to reorganize it • How to coexist with others questions facts ? Other Archives facts answers Literature facts Simulations The Big Problems • Query and Vis tools • Support/training • Performance • Execute queries in a minute • Batch query scheduling

  41. Outline • Overview of Microsoft Research • Big-Database Research • TerraServer: Geospatial app • SkyServer: data mining app • Q&A

  42. Grid and Web Services Synergy • I believe the Grid will be many web services share data (computrons are free) • IETF standards Provide • Naming • Authorization / Security / Privacy • Distributed Objects Discovery, Definition, Invocation, Object Model • Higher level services: workflow, transactions, DB,.. • Synergy: commercial Internet & Grid tools

More Related