- 480 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Database Systems Research on Data Mining' - salena

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Database Systems Research on Data Mining

Reference:

Ordonez, C, Garcia-Garcia, J, Database Systems Research on Data Mining,Proc. ACM SIGMOD 2010, p.000-999 (tutorial).

Global Outline

1. Data mining models and algorithms (JG,15 min)

1.1 Preprocess to get Data set

1.2 Data set

1.3 Data Mining Models

1.4 Data Mining Algorithms

2.Processing alternatives (JG, 35 min)

2.1 Inside DBMS: SQL

2.2 Outside DBMS: MapReduce

2.3 Example

3. Storage and Optimizations (CO, 35 min)

3.1 Layouts: Horizontal and Vertical

3.2 Optimizations: Algorithmic and Systems

1. Data Mining Models & Algorithms

Data set preparation

Data set

Data mining models and patterns

Algorithms

1.1 Data set preparation [CDHHL1999,KDD] [GCBLRVPP1997,JDMKD] [O2004,DMKD]

In practice, 80% of project time

significant SQL code writing; some tools help query writing

iterative: between modeling and data set prep

Little attention in research literature

query optimization mostly in OLAP context

new operators: PIVOT, horizontal aggregations

research issue: can algorithms directly analyze 3NF tables?

Data set preparation [GO2010,DKE] [OG2008,DSS]

Overall goal: getting data set for analysis

Database processing generally required

normalized (many 3NF) databases cannot be directly analyzed

joins, aggregations and pivoting (transposing)

Data cleaning

remove outliers

null replacement

repair referential integrity

Data transformation: categorical columns; rescale; code

1.2 Data set

Data set with n records

Each has attributes: numeric, discrete or both (mixed)

Focus of the tutorial, d dimensions

Generally,

High d makes problem mathematically more difficult

Extra column G/Y

Example of data sethorizontal layout

n=5, d=3 and G/Y

1.3 Data Mining Models [STA1998,SIGMOD]

Models, coming mostly from statistics and machine learning

based on matrix computations, probability and calculus

time dimension not considered

Patterns, mostly combinatorial

association rules, cubes, sequences and graphs

important, but not considered in tutorial

quite different algorithms from models

no strong statistical foundation

Common data mining models [DLR1977,RSS]

Unsupervised:

math: simpler

task: clustering, dimensionality reduction

models: KM, EM, PCA/SVD, FA

statistical tests overlap both

Supervised

math: tuning and validation than unsupervised

task: classification, regression

models: decision trees, Naïve Bayes, Bayes, linear/logistic regression, SVM, neural nets

Data mining models characteristics

Multidimensional

tens, hundreds of dimensions

feature selection and dimensionality reduction

Represented & computed with matrices & vectors

data set: set of vectors or set of records; all numeric, mixed attributes

model: numeric=matrices, discrete: histograms

intermediate computations: matrices and histograms

Data mining Major tasks

Model computation

focus of most research

generally requires matrix computations

complex and slow algorithms (iterative)

large n makes it slower

Scoring data set

assumes model exists

useful for tuning, testing and model exchange

fast: generally requires only one pass over X

research issue: not studied enough in literature

1.4 Data Mining Algorithmsinput and output

Input: data set X with n records, d dimensions

Output: model, quality

Parameters (representative; vary a lot):

k (clusters, principal components, discrete states)

epsilon for stopping (accuracy, convergence, local optima)

feature/variable selection (algorithm dependent, step-wise or now bayesian statistics)

Data Mining Algorithms [ZRL1996,SIGMOD]

Behavior with respect to data set X:

one pass, few passes

multiple passes, convergence, bigger issue (most algorithms)

Time complexity:

Research issues:

preserve time complexity in SQL/MapReduce

incremental learning

2.1 Inside DBMS

- Assumption:
- data records are in the DBMS; exporting slow
- row-based storage (not column-based)
- Programming alternatives:
- SQL and UDFs: SQL code generation (JDBC), precompiled UDFs. Extra: SP, embedded SQL, cursors
- Internal C Code (direct access to file system and mem)
- DBMS advantages:
- important: storage, queries, security
- maybe: recovery, concurrency control, integrity, transactions

Inside DBMSSQL code:CREATE + SELECT, Consider Layout[CDDHW2009,VLDB]

- Vertical layout: A(i,j,v), B(i,j,v)
- A*B: SELECT A.i, B.j
- , sum(A.v * B.v)
- FROM A JOIN B ON A.j = B.i
- GROUP BY A.i, B.j

- CREATE TABLE
- Row storage: Clustered (to group rows of pivoted tables), Block size (for large tables)
- Index: primary (gen. for pk, critical for joins), secondary (may help joins & searches)
- SELECT
- Basic mechanism to write queries; standard across DBMSs, arbitrarily complex queries, including arithmetic expressions

Inside DBMSUser-Defined Function (UDF)

- Classification:
- Scalar UDF
- Aggregate UDF
- Table UDF
- Programming:
- Called in a SELECT statement
- C code or similar language
- API provided by DBMS, in C/C++
- Data type mapping

Inside DBMSUDF pros and cons

- Advantages:
- arrays and flow control
- flexibility in code writing and no side effects
- No need to modify DBMS internal code
- In general, simple data types
- Limitations:
- OS and DBMS architecture dependent, not portable
- No I/O capability, no side effects
- Null handling and fixed memory allocation
- Memory leaks with arrays (matrices): fenced/protected mode

Inside DBMSScalar UDF [DNPT2006,SAC]

Memory allocation in the stack

Returns one value of simple data type

Basic SQL data types (e.g. int, float, char)

May support UDT

Call & return value with every row

Useful for vector operations

Inside DBMSAggregate UDF [JM1998,SIGMOD]

Table scan

Memory allocation in the heap

GROUP BY extend their power

Also require handling nulls

Advantage: parallel & multithreaded processing

Drawback: returns a single value, not a table

DBMSs: SQL Server, PostgreSQL,Teradata, Oracle, DB2, among others

Useful for model computations

Inside DBMSUDF: Aggregate User-Defined Function

1. Initialization: allocates variable storage

2. Accumulate: processes every record, aggregate some value (vector). Bottleneck.

3. Merge: consolidates partial results from multiple threads

4. Terminate: final processing, return result

Inside DBMSTable UDF [BRKPHK2008,SIGMOD]

Main difference with aggregate UDF: returns a table (instead of single value)

Also, it can take several input values

Called in the FROM clause in a SELECT

Stream: no parallel processing, external file

Computation power same as aggregate UDF

Suitable for complex math operations and algorithms

Since result is a table it can be joined

DBMS: SQL Server ,DB2, Oracle,PostgreSQL

Inside DBMSInternal C code [LTWZ2005,SIGMOD], [MYC2005,VLDB] [SD2001,CIKM]

Advantages:

access to file system (table record blocks),

physical operators (scan, join, sort, search)

main memory, data structures, libraries

hardware optimizations: multithreading, multicore, caching RAM, caching LI/L2

Disadvantages:

requires careful integration with rest of system

not available to end users and practitioners

may require exposing functionality with DM language or SQL

Inside DBMSPhysical Operators [DG1992,CACM] [SMAHHH2007,VLDB] [WH2009,SIGMOD]

- Serial DBMS (one CPU, maybe RAID):
- table Scan
- join: hash join, sort merge join, nested loop
- external merge sort
- Parallel DBMS (shared-nothing):
- even row distribution, hashing
- parallel table scan
- parallel joins: large/large (sort-merge, hash); large/short (replicate short)
- distributed sort

2.2 Outside DBMS

- Alternatives:
- MapReduce
- Packages, libraries, Java/C++
- Issue: I/O bottleneck

Outside DBMSMapReduce [DG2008,CACM]

- Parallel processing; simple; shared-nothing
- Functions are programmed in a high-level programming language (e.g. Java, Python); flexible.
- <key,value> pairs processed in two phases:
- map(): computation is distributed and evaluated in parallel; independent mappers
- reduce(): partial results are combined/summarized
- Can be categorized as inside/outside DBMS, depending on level of integration with DBMS
- DBMS integration: Greenplum, Aster Data, Teradata...

Outside DBMSMapReduce Files and Processing

- File Types:
- Text Files: Common storage (e.g. CSV files.)
- SequenceFiles: Efficient processing
- Custom InputFormat (rarely used.)
- Processing:
- Points are sorted by “key” before sending to reducers
- Small files should be merged
- Partial results are stored in file system
- Intermediate files should be managed in SequenceFiles for efficiency

Outside DBMSPackages, libraries, Java/C++ [ZHY2009,CIDR][ZZY2010,ICDE]

- Statistical and data mining packages:
- exported flat files; proprietary file formats
- Memory-based (processing data records, models, internal data structures)
- Programming languages:
- Arrays
- flexibility of control statements
- Limitation: large number of records
- Packages: R, SAS, SPSS, KXEN,Matlab, WEKA

30/60

2.3 Naïve Bayes ExampleHorizontal layout

NB

one pass

Gaussian, sufficient statistics (NLQ)

Example in:

SQL

UDF

MapReduce

Data Structures

public double N;

public double[] L;

public double[] Q;

Naïve BayesSQL (2 passes, n L Q, triangular Q )

/*Lower triangular for PCA and LR*/

SELECT sum(X1*X1), null, ... ,null

,sum(X2*X1), sum(X2*X2), ... ,null

...

,sum(Xd*X1), sum(Xd*X2), ... ,sum(Xd*Xd)

FROM X

3. Storage and Optimizations

- Storage layouts:
- horizontal
- vertical
- Optimizations:
- algorithmic: general
- systems-oriented: SQL and MapReduce

Horizontal LayoutDBMS Data Set

X(i,X1,…,Xd,G/Y)

- Most common format in DM
- Join/Aggregations/arithmetic

expressions, pivot

36/60

Horizontal LayoutDBMS[DM2006,SIGMOD]

- Physical operator (most common):
- Table scan (default in SQL query or UDF)
- External Sort or hash table:
- SQL Group By query
- UDF Group By
- Join algorithm :
- SQL queries
- Not required in UDF

Horizontal LayoutDBMS

- Size of table: n rows
- Limited DDL control in SQL
- limited number of columns
- requires assigning point id i (Primary index)
- Clustered storage by block on i allows processing several rows at the same time

Horizontal LayoutDBMS: X1+X2+X3

String query = ‘SELECT i’;

String sum =‘’;

for( int h = 1 ; h <= d; h++ ) {

query += ‘, X’+h;

sum += ‘X’+h+’+’;

}

query += ‘,‘ +sum.substring(0,sum.length()-1)

+’ FROM X’;

JAVA

double X[4], SUM=0.0;

int i = 1,h=0, d=3;

while( fscanf(fp,"%lf%lf%lf%d",&X[1],&X[2],&X[3]) != EOF )

{

SUM = 0.0;

for( h = 1; h <= d; h++ ) {

SUM += X[h];

}

printf("%d\t%g\r\n",i,SUM);

i++;

}

SELECT i, x1, x2, x3, X1+X2+X3 FROM X

C++

- No arrays.
- Access to dimensions through SQL generation (Java/ C++)

Example:

39/60

Vertical LayoutDBMS-X(i,h,v,g)

When d exceeds the DBMS limits;d>n

Index by point i and dimension h

Clustered row storage by i (correctness in UDF, efficiency in SQL query)

Queries require: joins & aggregations

Columns as subscripts

Size <= dn rows (sparse)

Two tables with n rows with same PK can be joined in time O(n) using hash join

40/60

Vertical LayoutDBMS: X1+X2+X3

double X, SUM = 0.0;

int h, old_i = 0, i;

while( fscanf(fp,"%d%d%lf",&i,&h,&X) != EOF ) {

if ( old_i != i ) {

if ( old_i != 0 )

printf("%d\t%g\r\n",old_i,SUM);

old_i = i;

SUM = X;

} else {

SUM += X;

}

}

printf("%d\t%g\r\n",i,SUM);

String query = ‘SELECT i, SUM(v) FROM X’;

query += ‘GROUP BY i’

SELECT i, SUM(v) FROM X GROUP BY i

JAVA

C++

Requires using UDF functions

SQL statements using [index] joins are required

Horizontal LayoutMapReduce

X.csv

Line number represents the point ID (implicit)

No indexes in general

Flat file

42/60

3.2 OptimizationsAlgorithmic & Systems

Algorithmic

90% research, many efficient algorithms

accelerate/reduce computations or convergence

database systems focus: reduce I/O

approximate solutions

Systems (SQL, MapReduce)

Platform: parallel DBMS server vs cluster of computers

Programming: SQL/C++ versus Java

Algorithmic [ZRL1996,SIGMOD]

Implementation: data set available as flat file, binary file required for random access

May require data structures working in main memory and disk

Programming not in SQL: C/C++ are preferred languages, although Java becoming common

MapReduce is becoming popular

Assumption d<<n: n has received more attention

Issue: d>n produces numerical issues and large covariance/correlation matrix (larger than X)

Algorithmic Optimizations [STA1998,SIGMOD] [ZRL1996,SIGMOD][O2007,SIGMOD]

Exact model computation:

summaries: sufficient statistics (Gaussian pdf), histograms, discretization

accelerate convergence, reduce iterations

faster matrix operations: * +

Approximate model computation:

Sampling: efficient in time O(s)

Incremental:

math: escape local optima (EM), reseed

database systems: favor table scan

Systems OptimizationsDBMS [O2006,TKDE], [ORD2010,TKDE]

SQL query optimization

mathematical equations as queries

Turing-complete: SQL code generation and programming language

UDFs as optimization

substitute key mathematical operations

push processing into RAM memory

Systems OptimizationsDBMS SQL query [O2004,DMKD]

Denormalization

Issue: Query rewriting (optimizer falls short)

Index depends on layout

Horizontal layout:

indexed by i

d may be an issue, thus vertical partition

Vertical layout:

storage: clustered by point

indexing by subscript

Use specific join algorithm

Systems OptimizationsDBMS SQL query [O2006,TKDE] [OP2010,TKDE],[OP2010,DKE] ,[MC2002,ICDM]

Join:

denormalized storage: model, intermediate tables

favor hash joins over mrg-srt: both tables PI on i

secondary indexing for join: sort-merge join

Aggregation (compression):

push group-by before join: watch out nulls and high cardinality columns like point i

synchronized table scans: several SELECTs on same table; examples: unpivoting; 2+ models

Sampling: O(s), random access, truly random; error

Systems Optimization DBMS UDF [HLS2005,TODS] [O2007,TKDE]

UDFs can substitute SQL code

UDFs can express complex math computations

Scalar UDFs: vector operations

Aggregate UDFs: compute data set summaries in parallel

Table UDFs: stream model; external temporary file

MapReduce[ABASR2009,VLDB] [CDDHW2009,VLDB][SADMPPR2010,CACM]

Data set

keys as input, partition data set

text versus sequential file

loading into file system may be required

Parallel processing

high cardinality keys: i

handle skewed distributions

reduce row redistribution in Map( )

Main memory processing

MapReduceProcessing[DG2008,CACM][FPC2009,PVLDB] [PHBB2009,PVLDB]

Modify Block Size

Disable Block Replication

Delay reduce()

Tune M and R (memory allocation and number)

Several M use the same R

Avoid full table scans by using subfiles (requires naming convention)

combine() in map() to shrink intermediate files

SequenceFiles as input with custom data types.

MapReduceIssues

Loading, converting to binary may be necessary

Input key generally OK if high cardinality

Skewed map key distribution

Key redistribution (lot of message passing)

SQL vs MapReduceProcessing & I/O Bottleneck (bulk load) [PPRADMS2009,SIGMOD] [O2010,TKDE]

Import and Model Computation Times for SQL and MR (times in secs).

*MR times include conversion into a SequenceFile.

Research Issues Both: SQL and MapReduce [BFR1998,KDD], [CFB1999,ICDE][SADMPPR2010,CACM]

Fast data mining algorithms solved? Yes, but not considering data sets are stored in a DBMS

SQL and MR have many similarities: shared-nothing

Fast load/unload interfaces between both systems; tighter integration

General tradeoffs in speed and programming: horizontal vs vertical layout

Incremental algorithms

one pass (streams) versus parallel processing

reduce passes/iterations

Research Issues on Each [ABASR2009,VLDB], [CDDHW2009,VLDB [CKLRSS2009,VLDB]

DBMS:

C++/Java libraries generating SQL code, pushing processing: Oracle, Teradata, SAS, KXEN

Internal C code: commercial DBMSs, open-source?

Study aggregate UDFs for complex models; extend Table UDF support: I/O bottleneck, streams

Extend SQL with more DM primitives and constructs or forget extending SQL for DM?

Specialized DBMS, middleware: SciDB, RIOT

MapReduce:

SQL+MapReduce: Greenplum, Aster, Teradata

MapReduce only: Mahout

MapReduce for query processing and data mining: especially joins, aggregations OK

Thank you… Q&A

Special thanks:

Carlos Garcia-Alvarado

DBMS Group at UH:

Sasi K. Pitchaimalai

Mario Navas

Zhibo Chen

References

[ABASR2009,VLDB] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow., pages 922-933, 2009.

[BFR1998,KDD] P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In ACM KDD Conference, pages 9-15, 1998.

[BRKPHK2008,SIGMOD] J.A Blakeley, V. Rao, I. Kunen, A. Prout, M. Henaire, and C. Kleinerman. .NET database programmability and extensibility in microsoft SQL server. In ACM SIGMOD, pages 1087-1098. 2008.

[CFB1999,ICDE] S. Chaudhuri, U. Fayyad, and J. Bernhardt. Scalable classification over SQL databases. ICDE, 00:470, 1999.

[CDHHL1999,KDD] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman. Non-stop SQL/MX primitives for knowledge discovery. In ACM KDD Conference, pages 425-429, 1999.

[CDDHW2009,VLDB] J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. In VLDB Conference, pages 1481-1492, 2009.

[CKLRSS2009,VLDB] A demonstration of SciDB: a science-oriented DBMS. In VLDB Conference, pages 1534-1537,2009.

[DG2008,CACM] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113,2008.

[DLR1977,RSS] A.P. Dempster, N.M. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of The Royal Statistical Society, 39(1):1-38, 1977.

[DM2006,SIGMOD] A. Deshpande and S. Madden. MauveDB: supporting model-based user views in database systems. In SIGMOD Conference, pages 73-84, 2006.

References

[DG1992,CACM] D. DeWitt, J. Gray. Parallel database systems: the future of high performance database systems. In Communications of the ACM, 35(6): 85-98, 1992.

[DNPT2006,SAC] A. Dorneich, R. Natarajan, E.P.D. Pednault, and F. Tipu. Embedded predictive modeling in a parallel relational database. In SAC, pages 569-574, 2006.

[FPC2009,PVLDB] E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB, 2(2):1402-1413, 2009.

[GO2010,DKE] J. García-García, C. Ordonez: Extended aggregations for databases with referential integrity issues. Data Knowl. Eng. 69(1): 73-95 (2010).

[GCBLRVPP1997,JDMKD] J. Gray and S. Chaudhuri and A. Bosworth and A. Layman and D. Reichart and M. Venkatrao and F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery., 1(1):29-53,1997.

[HLS2005,TODS] Z. He, B. S. Lee, and R. Snapp. Self-tuning cost modeling of user-defined functions in an object-relational DBMS. ACM Trans. Database Syst., 30(3):812-853, 2005.

[JM1998,SIGMOD] M. Jaedicke and B. Mitschang. On parallel processing of aggregate and scalar functions in object-relational DBMS. In ACM SIGMOD Conference, pages 379-389, 1998.

[LTWZ2005,SIGMOD] C. Luo, H. Thakkar, H. Wang, and C. Zaniolo. A native extension of SQL for mining data streams. In ACM SIGMOD, pages 873-875, New York, NY, USA, 2005.

[MC2002,ICDM] B.L. Milenova and M.M. Campos. O-cluster: Scalable clustering of large high dimensional data sets. In Proc. IEEE ICDM Conference, page 290, Washington, DC, USA, 2002.

[MYC2005,VLDB] B.L. Milenova, J. Yarmus, and M.M. Campos. SVM in Oracle database 10g: Removing the barriers to widespread adoption of support vector machines. In VLDB Conference, 2005.

[NCFB2001,ICDE] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQL databases: OLE DB for data mining. In IEEE ICDE Conference, pages 379-387, 2001.

References

[O2004,DMKD] C. Ordonez. Horizontal aggregations for building tabular data sets. In ACM SIGMOD Data Mining & Knowledge Discovery Workshop (DMKD), pages 35-42, 2004.

[O2006,TKDE] C. Ordonez. Integrating K-means clustering with a relational DBMS using SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(2):188-201, 2006.

[O2007,SIGMOD] C.Ordonez. Building Statistical Models and Scoring with UDFs. In SIGMOD Conference, pages 1005-1016, 2007.

[O2010,TKDE] C. Ordonez. Statistical Model Computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2010

[OP2010,TKDE] C. Ordonez, S.K. Pitchaimalai. Bayesian Classifiers Programmed in SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(1):139-144, 2010.

[OP2010,DKE] C. Ordonez, S.K. Pitchaimalai. Fast UDFs to Compute Sufficient Statistics on Large Data Sets exploiting Caching and Sampling, Data and Knowledge Engineering Journal (DKE), 2010.

[OG2008,DSS] C. Ordonez, J. García-García: Referential integrity quality metrics. Decision Support Systems 44 (2): 495-508 (2008)

[O2003,JLINUX] M. Owens. Embedding an SQL database with SQLite. Linux J., 2003(110):2, 2003.

[PHBB2009,PVLDB] B. Panda, J. Herbach, S. Basu, and R.J. Bayardo. PLANET: Massively parallel learning of tree ensembles with MapReduce. PVLDB, 2(2):1426-1437, 2009.

[PPRADMS2009,SIGMOD] A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D.J. DeWitt, S. Madden, and Stonebraker, M. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165-178, 2009.

[STA1998,SIGMOD] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: alternatives and implications. In ACM SIGMOD, pages 343-354, 1998.

[SD2001,CIKM] K. Sattler and O. Dunemann. SQL database primitives for decision tree classifiers. In ACM CIKM Conference, pages 379-386, 2001.

References

[NCFB2001,ICDE] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQL databases: OLE DB for data mining. In IEEE ICDE Conference, pages 379-387, 2001.

[SMAHHH2007,VLDB] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The end of an architectural era: (it's time for a complete rewrite). In VLDB, pages 1150-1160, 2007.

[WH2009,SIGMOD] F. M. Waas and J. M. Hellerstein. Parallelizing extensible query optimizers. In SIGMOD Conference, pages 871-878, 2009.

[ZHY2009,CIDR] Y. Zhang, H. Herodotou, and J. Yang. Riot: I/O-efficient numerical computing without SQL. In CIDR, 2009.

[ZZY2010,ICDE] Y. Zhang, W. Zhang, J. Yang. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.

[ZRL1996,SIGMOD] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In ACM SIGMOD Conference, pages 103-114, 1996.

Download Presentation

Connecting to Server..