1 / 40

Time Series Center a next generation search engine using semantics, machine learning, and GPGPU

Time Series Center a next generation search engine using semantics, machine learning, and GPGPU. Pavlos Protopapas (Harvard CfA and SEAS) Rahul Dave, Gabriel Wachman, Matthias Lee, Roni Khardon. time series center: short overview what is it about and what we do. database, web interface:

Leo
Download Presentation

Time Series Center a next generation search engine using semantics, machine learning, and GPGPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Time Series Center a next generation search engine using semantics, machine learning, and GPGPU Pavlos Protopapas (Harvard CfA and SEAS) Rahul Dave, Gabriel Wachman, Matthias Lee, Roni Khardon

  2. timeseries center: short overview what is it about and what we do. database, web interface: data model/database design web services web interface (demo) analysis classification using kernels results search engine morphological searches GPGPU architecture web interface (demo) overview

  3. idea: create the largest collection of time series in the world and do interesting science discoveries. recipe: 5 tons of data 1 dozen of people with science questions 2-3 people with skills 2 tons of hardware focus: astronomy (light curves = time series). We have other data too such as labor data, real estate data, heart monitor data, archeological data, brain activityetc. time series center

  4. Cyclist’sheartrate tcppackets Cadence design stock prices FishcatchoftheNorth-eastpacific time series everywhere

  5. 0 200 400 600 800 1000 1200 Supporting Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures VLDB 2006: 882-893. Eamonn Keogh · Li Wei · Xiaopeng Xi · Michail Vlachos · Sang-Hee Lee · Pavlos Protopapas

  6. the fish

  7. data MACHO(Microlensing survey)- 66 million objects. 1000 flux observations per object in 2 bands (wavelengths) TAOS (outer solar system survey) - 100000 objects. 100K flux observations per object. 4 telescopes. ESSENCE (supernovae survey). thousandsobjects, hundred observations. Minor Planet Center Light curves - few hundred objects. Fewhundred observations Pan-STARRS.billion of objects. (Video Mode data too) OGLE (microlensing and extra solar planet surveys) MMT variability studies, some HAT-NET SDSS 82 EROS SuperMACHO(another microlensing survey)- Close to a million objects. 100 flux observations per objects. DASCH Digital Access to a Sky Century @ Harvard

  8. astronomy eclipsing binaries: determining masses and distances to objects extra-solar planets: either discovery of extra solar planet or statistical estimates of the abundance of planetary systems cosmology: supernovae from Pan-STARRS will help determine cosmological constants. asteroids, Trans Neptunian objects etc: using occultation signals –for the detection of the smaller objects in the edge of the Solar System. AGN: automatic classification of AGN from the time series. variable stars: study of variable stars microlensing: determine dark matter question. and many more

  9. outlier/anomaly detection • find the anomalous cases. • clustering • unsupervised clustering could help identify new subclasses. • classification • automatic classification either supervised or semi supervised. • motif detection • finding patterns especially low signal to noise ratio. scalability analyzing a large data set requires efficient algorithms that scale linearly in the number of time series. the feature space representation of the time series (Fourier Transform, Wavelets, Piecewise Linear, and symbolic methods) distance metric determine similarities in time series.

  10. size: • size of data in astronomy, medicine and other fields are presently exploding. he time series center needs to be prepared for data rates starting in the 10’s of gigabytes per night, scaling up to terabytes soon. • PARALLEL FILE SYSTEMS [gpfs, luster, NFS] do not perform well • interplay: • between the algorithms used to study the time series, and the appropriate database indexing of the time series itself must be optimized. • seed: • read time access and real time processing • distributed computing and disbursement of data: • standards (VO), subscription query etc. computational challenges

  11. disk: ~100 TB of disk GPFS, LUSTER, NFS computing nodes: part of Odyssey cluster at Harvard (~5362 cores). db server: dual core with 16 GB of memory and 2 TB of disk. few servers for development exotic: GPGPU dedicate machineNvidiaGTX285(2GB 240 cores corespeed:1400MHz) GPU cluster (Nikola) with 16 machines with Nvidia Tesla T10 GPU's attached to each node web server 2 dual machines. [who cares] hardware

  12. source: usno_id=xxx tsc_id=yyy astobject: A.23.1234 A.123.5632 B.71.12 Survey A, Field 23 Survey A, Field 132 Survey B, Field 71 data model

  13. data model

  14. source: usno_id=xxx tsc_id=yyy astobject: SURVEY: MACHO FIELD: 012 STAR_ID: 87 astobject: SUVEY: OGLE FIELD: 12 STAR_ID: 0894343 LCOT BAND: R LOCT BAND: B SNIPPET: STAR_TIME: 01/13/98 END_TIME: 05/1/98 SNIPPET: STAR_TIME: 01/13/98 END_TIME: 05/1/98 SNIPPET: STAR_TIME: 01/13/98 END_TIME: 05/1/98 data model

  15. who entity: Wachman OGLE source: usno_id=xxx tsc_id=yyy using astobject: SURVEY: MACHO FIELD: 012 STAR_ID: 87 algorithm/method: SVN/kNN CMD what is it LCOT BAND: R about Variable judjemet: Eclipsing Binary Cepheid Quasar SNIPPET: STAR_TIME: 01/13/98 END_TIME: 05/1/98 variability data model

  16. Pulsating Eruptive Rotating IntenseVariableXraySources … RRlyrae Cepheid DCEPS CWB BCEP … variable tree

  17. why web services ? grandma can program ascii and json web site is a web services client you can program with wget, curl, perl, python etc next ? VAO compatibility ie XML we have python language API provided, inside cluster and outside usage web services

  18. ★everything is web services ★ every query gets a UID. ★ with the UID: summary, page info, results useto build extra queries GET: http://timemachine.iic.harvard.edu/search/lcdb/astobject/filter=survey__exact:ASAS3/ POST: /astobject/justquery/ searchtext: survey = ASAS3;variable = EBI queryname: "Myquery" prevquerycontext: "68b994ca-08a3-4d79-affd-8cc1636aa6c2" privacy: false web services

  19. demo timemachine

  20. classification

  21. we have millions of light curves triangular inequality tree structures Euclidean distance between light curves Q and Cz-normalized what is similar?

  22. DEVICE GeForce GTX 285 30 multiprocessors 8 cores per multiprocessor 240 cores Each core has 2 ports = 3 instructions 1.4 GHz • ~0.9 TFLOPS 512 threads / multiprocessor • 15360 threads • Organized in: • 1 Block per multiprocessor (3D array) • GRID of Blocks 8 KBshared memory 8 KBshared memory multiprocessor 1 8 cores multiprocessor 2 8 cores …30 texture memory constant memory 2 GBytes DDR2 global memory HOST (CPU) GPGPU

  23. compile: C/C++ go to gcc cuda go to nvcc // Kernel definition __global__void VecAdd(float* A, float* B, float* C) { ... } intmain() { ... cudaMalloc((void**)&d_A, size); // Kernel invocation VecAdd<<<1, N>>>(A, B, C); } CUDA

  24. CUDA RESULT

  25. Results

  26. demo fshare

  27. demo fshare

  28. demo fshare

  29. demo fshare

  30. done: build the infrastructure to host data, search, distribute methodology computational new discoveries doing: moredata (EROS, SDSS 82) more hardware (mainly disk) ~50TB renormalized db, slice indexing. data model/database design web services web interface (demo) analysis classification using kernels results search engine morphological searches GPGPU architecture web interface (demo) summary

More Related