Database systems research on data mining
Download
1 / 65

database systems research on data mining - PowerPoint PPT Presentation


  • 472 Views
  • Updated On :

Database Systems Research on Data Mining. Reference: Ordonez, C, Garcia-Garcia, J, Database Systems Research on Data Mining, Proc. ACM SIGMOD 2010, p.000-999 (tutorial). Global Outline. 1. Data mining models and algorithms (JG,15 min) 1.1 Preprocess to get Data set 1.2 Data set

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'database systems research on data mining' - salena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Database systems research on data mining

Database Systems Research on Data Mining

Reference:

Ordonez, C, Garcia-Garcia, J, Database Systems Research on Data Mining,Proc. ACM SIGMOD 2010, p.000-999 (tutorial).


Global outline
Global Outline

1. Data mining models and algorithms (JG,15 min)

1.1 Preprocess to get Data set

1.2 Data set

1.3 Data Mining Models

1.4 Data Mining Algorithms

2.Processing alternatives (JG, 35 min)

2.1 Inside DBMS: SQL

2.2 Outside DBMS: MapReduce

2.3 Example

3. Storage and Optimizations (CO, 35 min)

3.1 Layouts: Horizontal and Vertical

3.2 Optimizations: Algorithmic and Systems


1 data mining models algorithms
1. Data Mining Models & Algorithms

Data set preparation

Data set

Data mining models and patterns

Algorithms


1 1 data set preparation cdhhl1999 kdd gcblrvpp1997 jdmkd o2004 dmkd
1.1 Data set preparation [CDHHL1999,KDD] [GCBLRVPP1997,JDMKD] [O2004,DMKD]

In practice, 80% of project time

significant SQL code writing; some tools help query writing

iterative: between modeling and data set prep

Little attention in research literature

query optimization mostly in OLAP context

new operators: PIVOT, horizontal aggregations

research issue: can algorithms directly analyze 3NF tables?


Data set preparation go2010 dke og2008 dss
Data set preparation [GO2010,DKE] [OG2008,DSS]

Overall goal: getting data set for analysis

Database processing generally required

normalized (many 3NF) databases cannot be directly analyzed

joins, aggregations and pivoting (transposing)

Data cleaning

remove outliers

null replacement

repair referential integrity

Data transformation: categorical columns; rescale; code



1 2 data set
1.2 Data set

Data set with n records

Each has attributes: numeric, discrete or both (mixed)

Focus of the tutorial, d dimensions

Generally,

High d makes problem mathematically more difficult

Extra column G/Y


Example of data set horizontal layout
Example of data sethorizontal layout

n=5, d=3 and G/Y


1 3 data mining models sta1998 sigmod
1.3 Data Mining Models [STA1998,SIGMOD]

Models, coming mostly from statistics and machine learning

based on matrix computations, probability and calculus

time dimension not considered

Patterns, mostly combinatorial

association rules, cubes, sequences and graphs

important, but not considered in tutorial

quite different algorithms from models

no strong statistical foundation


Common data mining models dlr1977 rss
Common data mining models [DLR1977,RSS]

Unsupervised:

math: simpler

task: clustering, dimensionality reduction

models: KM, EM, PCA/SVD, FA

statistical tests overlap both

Supervised

math: tuning and validation than unsupervised

task: classification, regression

models: decision trees, Naïve Bayes, Bayes, linear/logistic regression, SVM, neural nets


Data mining models characteristics
Data mining models characteristics

Multidimensional

tens, hundreds of dimensions

feature selection and dimensionality reduction

Represented & computed with matrices & vectors

data set: set of vectors or set of records; all numeric, mixed attributes

model: numeric=matrices, discrete: histograms

intermediate computations: matrices and histograms



Data mining major tasks
Data mining Major tasks

Model computation

focus of most research

generally requires matrix computations

complex and slow algorithms (iterative)

large n makes it slower

Scoring data set

assumes model exists

useful for tuning, testing and model exchange

fast: generally requires only one pass over X

research issue: not studied enough in literature


1 4 data mining algorithms input and output
1.4 Data Mining Algorithmsinput and output

Input: data set X with n records, d dimensions

Output: model, quality

Parameters (representative; vary a lot):

k (clusters, principal components, discrete states)

epsilon for stopping (accuracy, convergence, local optima)

feature/variable selection (algorithm dependent, step-wise or now bayesian statistics)


Data mining algorithms zrl1996 sigmod
Data Mining Algorithms [ZRL1996,SIGMOD]

Behavior with respect to data set X:

one pass, few passes

multiple passes, convergence, bigger issue (most algorithms)

Time complexity:

Research issues:

preserve time complexity in SQL/MapReduce

incremental learning


2 processing alternatives
2. Processing alternatives

2.1 Inside DBMS (SQL)

2.2 Outside DBMS (MapReduce)

2.3 Example


2 1 inside dbms
2.1 Inside DBMS

  • Assumption:

    • data records are in the DBMS; exporting slow

    • row-based storage (not column-based)

  • Programming alternatives:

    • SQL and UDFs: SQL code generation (JDBC), precompiled UDFs. Extra: SP, embedded SQL, cursors

    • Internal C Code (direct access to file system and mem)

  • DBMS advantages:

    • important: storage, queries, security

    • maybe: recovery, concurrency control, integrity, transactions


Inside dbms sql code create select consider layout cddhw2009 vldb
Inside DBMSSQL code:CREATE + SELECT, Consider Layout[CDDHW2009,VLDB]

  • Vertical layout: A(i,j,v), B(i,j,v)

    • A*B: SELECT A.i, B.j

    • , sum(A.v * B.v)

    • FROM A JOIN B ON A.j = B.i

    • GROUP BY A.i, B.j

  • CREATE TABLE

    • Row storage: Clustered (to group rows of pivoted tables), Block size (for large tables)

    • Index: primary (gen. for pk, critical for joins), secondary (may help joins & searches)

  • SELECT

    • Basic mechanism to write queries; standard across DBMSs, arbitrarily complex queries, including arithmetic expressions


Inside dbms user defined function udf
Inside DBMSUser-Defined Function (UDF)

  • Classification:

    • Scalar UDF

    • Aggregate UDF

    • Table UDF

  • Programming:

    • Called in a SELECT statement

    • C code or similar language

    • API provided by DBMS, in C/C++

    • Data type mapping


Inside dbms udf pros and cons
Inside DBMSUDF pros and cons

  • Advantages:

    • arrays and flow control

    • flexibility in code writing and no side effects

    • No need to modify DBMS internal code

    • In general, simple data types

  • Limitations:

    • OS and DBMS architecture dependent, not portable

    • No I/O capability, no side effects

    • Null handling and fixed memory allocation

    • Memory leaks with arrays (matrices): fenced/protected mode


Inside dbms scalar udf dnpt2006 sac
Inside DBMSScalar UDF [DNPT2006,SAC]

Memory allocation in the stack

Returns one value of simple data type

Basic SQL data types (e.g. int, float, char)

May support UDT

Call & return value with every row

Useful for vector operations


Inside dbms aggregate udf jm1998 sigmod
Inside DBMSAggregate UDF [JM1998,SIGMOD]

Table scan

Memory allocation in the heap

GROUP BY extend their power

Also require handling nulls

Advantage: parallel & multithreaded processing

Drawback: returns a single value, not a table

DBMSs: SQL Server, PostgreSQL,Teradata, Oracle, DB2, among others

Useful for model computations


Inside dbms udf aggregate user defined function
Inside DBMSUDF: Aggregate User-Defined Function

1. Initialization: allocates variable storage

2. Accumulate: processes every record, aggregate some value (vector). Bottleneck.

3. Merge: consolidates partial results from multiple threads

4. Terminate: final processing, return result


Inside dbms table udf brkphk2008 sigmod
Inside DBMSTable UDF [BRKPHK2008,SIGMOD]

Main difference with aggregate UDF: returns a table (instead of single value)

Also, it can take several input values

Called in the FROM clause in a SELECT

Stream: no parallel processing, external file

Computation power same as aggregate UDF

Suitable for complex math operations and algorithms

Since result is a table it can be joined

DBMS: SQL Server ,DB2, Oracle,PostgreSQL


Inside dbms internal c code ltwz2005 sigmod myc2005 vldb sd2001 cikm
Inside DBMSInternal C code [LTWZ2005,SIGMOD], [MYC2005,VLDB] [SD2001,CIKM]

Advantages:

access to file system (table record blocks),

physical operators (scan, join, sort, search)

main memory, data structures, libraries

hardware optimizations: multithreading, multicore, caching RAM, caching LI/L2

Disadvantages:

requires careful integration with rest of system

not available to end users and practitioners

may require exposing functionality with DM language or SQL


Inside dbms physical operators dg1992 cacm smahhh2007 vldb wh2009 sigmod
Inside DBMSPhysical Operators [DG1992,CACM] [SMAHHH2007,VLDB] [WH2009,SIGMOD]

  • Serial DBMS (one CPU, maybe RAID):

    • table Scan

    • join: hash join, sort merge join, nested loop

    • external merge sort

  • Parallel DBMS (shared-nothing):

    • even row distribution, hashing

    • parallel table scan

    • parallel joins: large/large (sort-merge, hash); large/short (replicate short)

    • distributed sort


2 2 outside dbms
2.2 Outside DBMS

  • Alternatives:

    • MapReduce

    • Packages, libraries, Java/C++

  • Issue: I/O bottleneck


Outside dbms mapreduce dg2008 cacm
Outside DBMSMapReduce [DG2008,CACM]

  • Parallel processing; simple; shared-nothing

  • Functions are programmed in a high-level programming language (e.g. Java, Python); flexible.

  • <key,value> pairs processed in two phases:

    • map(): computation is distributed and evaluated in parallel; independent mappers

    • reduce(): partial results are combined/summarized

  • Can be categorized as inside/outside DBMS, depending on level of integration with DBMS

  • DBMS integration: Greenplum, Aster Data, Teradata...


Outside dbms mapreduce files and processing
Outside DBMSMapReduce Files and Processing

  • File Types:

    • Text Files: Common storage (e.g. CSV files.)

    • SequenceFiles: Efficient processing

    • Custom InputFormat (rarely used.)

  • Processing:

    • Points are sorted by “key” before sending to reducers

    • Small files should be merged

    • Partial results are stored in file system

    • Intermediate files should be managed in SequenceFiles for efficiency


Outside dbms packages libraries java c zhy2009 cidr zzy2010 icde
Outside DBMSPackages, libraries, Java/C++ [ZHY2009,CIDR][ZZY2010,ICDE]

  • Statistical and data mining packages:

    • exported flat files; proprietary file formats

    • Memory-based (processing data records, models, internal data structures)

  • Programming languages:

    • Arrays

    • flexibility of control statements

  • Limitation: large number of records

  • Packages: R, SAS, SPSS, KXEN,Matlab, WEKA

30/60


2 3 na ve bayes example horizontal layout
2.3 Naïve Bayes ExampleHorizontal layout

NB

one pass

Gaussian, sufficient statistics (NLQ)

Example in:

SQL

UDF

MapReduce

Data Structures

public double N;

public double[] L;

public double[] Q;


Na ve bayes sql 2 passes n l q triangular q
Naïve BayesSQL (2 passes, n L Q, triangular Q )

/*Lower triangular for PCA and LR*/

SELECT sum(X1*X1), null, ... ,null

,sum(X2*X1), sum(X2*X2), ... ,null

...

,sum(Xd*X1), sum(Xd*X2), ... ,sum(Xd*Xd)

FROM X


Na ve bayes aggregate udf 1 pass
Naïve BayesAggregate UDF (1 pass)


Na ve bayes mapreduce text file unoptimized
Naïve Bayes MapReduce (text file, unoptimized)


3 storage and optimizations
3. Storage and Optimizations

  • Storage layouts:

    • horizontal

    • vertical

  • Optimizations:

    • algorithmic: general

    • systems-oriented: SQL and MapReduce


Horizontal layout dbms data set
Horizontal LayoutDBMS Data Set

X(i,X1,…,Xd,G/Y)

  • Most common format in DM

  • Join/Aggregations/arithmetic

    expressions, pivot

36/60


Horizontal layout dbms dm2006 sigmod
Horizontal LayoutDBMS[DM2006,SIGMOD]

  • Physical operator (most common):

    • Table scan (default in SQL query or UDF)

  • External Sort or hash table:

    • SQL Group By query

    • UDF Group By

  • Join algorithm :

    • SQL queries

    • Not required in UDF


Horizontal layout dbms
Horizontal LayoutDBMS

  • Size of table: n rows

  • Limited DDL control in SQL

    • limited number of columns

    • requires assigning point id i (Primary index)

  • Clustered storage by block on i allows processing several rows at the same time


Horizontal layout dbms x1 x2 x3
Horizontal LayoutDBMS: X1+X2+X3

String query = ‘SELECT i’;

String sum =‘’;

for( int h = 1 ; h <= d; h++ ) {

query += ‘, X’+h;

sum += ‘X’+h+’+’;

}

query += ‘,‘ +sum.substring(0,sum.length()-1)

+’ FROM X’;

JAVA

double X[4], SUM=0.0;

int i = 1,h=0, d=3;

while( fscanf(fp,"%lf%lf%lf%d",&X[1],&X[2],&X[3]) != EOF )

{

SUM = 0.0;

for( h = 1; h <= d; h++ ) {

SUM += X[h];

}

printf("%d\t%g\r\n",i,SUM);

i++;

}

SELECT i, x1, x2, x3, X1+X2+X3 FROM X

C++

  • No arrays.

  • Access to dimensions through SQL generation (Java/ C++)

    Example:

39/60


Vertical layout dbms x i h v g
Vertical LayoutDBMS-X(i,h,v,g)

When d exceeds the DBMS limits;d>n

Index by point i and dimension h

Clustered row storage by i (correctness in UDF, efficiency in SQL query)

Queries require: joins & aggregations

Columns as subscripts

Size <= dn rows (sparse)

Two tables with n rows with same PK can be joined in time O(n) using hash join

40/60


Vertical layout dbms x1 x2 x3
Vertical LayoutDBMS: X1+X2+X3

double X, SUM = 0.0;

int h, old_i = 0, i;

while( fscanf(fp,"%d%d%lf",&i,&h,&X) != EOF ) {

if ( old_i != i ) {

if ( old_i != 0 )

printf("%d\t%g\r\n",old_i,SUM);

old_i = i;

SUM = X;

} else {

SUM += X;

}

}

printf("%d\t%g\r\n",i,SUM);

String query = ‘SELECT i, SUM(v) FROM X’;

query += ‘GROUP BY i’

SELECT i, SUM(v) FROM X GROUP BY i

JAVA

C++

Requires using UDF functions

SQL statements using [index] joins are required


Horizontal layout mapreduce
Horizontal LayoutMapReduce

X.csv

Line number represents the point ID (implicit)

No indexes in general

Flat file

42/60



3 2 optimizations algorithmic systems
3.2 OptimizationsAlgorithmic & Systems

Algorithmic

90% research, many efficient algorithms

accelerate/reduce computations or convergence

database systems focus: reduce I/O

approximate solutions

Systems (SQL, MapReduce)

Platform: parallel DBMS server vs cluster of computers

Programming: SQL/C++ versus Java


Algorithmic zrl1996 sigmod
Algorithmic [ZRL1996,SIGMOD]

Implementation: data set available as flat file, binary file required for random access

May require data structures working in main memory and disk

Programming not in SQL: C/C++ are preferred languages, although Java becoming common

MapReduce is becoming popular

Assumption d<<n: n has received more attention

Issue: d>n produces numerical issues and large covariance/correlation matrix (larger than X)


Algorithmic optimizations sta1998 sigmod zrl1996 sigmod o2007 sigmod
Algorithmic Optimizations [STA1998,SIGMOD] [ZRL1996,SIGMOD][O2007,SIGMOD]

Exact model computation:

summaries: sufficient statistics (Gaussian pdf), histograms, discretization

accelerate convergence, reduce iterations

faster matrix operations: * +

Approximate model computation:

Sampling: efficient in time O(s)

Incremental:

math: escape local optima (EM), reseed

database systems: favor table scan


Systems optimizations dbms o2006 tkde ord2010 tkde
Systems OptimizationsDBMS [O2006,TKDE], [ORD2010,TKDE]

SQL query optimization

mathematical equations as queries

Turing-complete: SQL code generation and programming language

UDFs as optimization

substitute key mathematical operations

push processing into RAM memory


Systems optimizations dbms sql query o2004 dmkd
Systems OptimizationsDBMS SQL query [O2004,DMKD]

Denormalization

Issue: Query rewriting (optimizer falls short)

Index depends on layout

Horizontal layout:

indexed by i

d may be an issue, thus vertical partition

Vertical layout:

storage: clustered by point

indexing by subscript

Use specific join algorithm


Systems optimizations dbms sql query o2006 tkde op2010 tkde op2010 dke mc2002 icdm
Systems OptimizationsDBMS SQL query [O2006,TKDE] [OP2010,TKDE],[OP2010,DKE] ,[MC2002,ICDM]

Join:

denormalized storage: model, intermediate tables

favor hash joins over mrg-srt: both tables PI on i

secondary indexing for join: sort-merge join

Aggregation (compression):

push group-by before join: watch out nulls and high cardinality columns like point i

synchronized table scans: several SELECTs on same table; examples: unpivoting; 2+ models

Sampling: O(s), random access, truly random; error


Na ve bayes sql optimized
Naïve BayesSQL (optimized)


Systems optimization dbms udf hls2005 tods o2007 tkde
Systems Optimization DBMS UDF [HLS2005,TODS] [O2007,TKDE]

UDFs can substitute SQL code

UDFs can express complex math computations

Scalar UDFs: vector operations

Aggregate UDFs: compute data set summaries in parallel

Table UDFs: stream model; external temporary file


Na ve bayes aggregate udf optimized 1 pass same as before
Naïve BayesAggregate UDF (optimized, 1 pass, same as before)


Mapreduce abasr2009 vldb cddhw2009 vldb sadmppr2010 cacm
MapReduce[ABASR2009,VLDB] [CDDHW2009,VLDB][SADMPPR2010,CACM]

Data set

keys as input, partition data set

text versus sequential file

loading into file system may be required

Parallel processing

high cardinality keys: i

handle skewed distributions

reduce row redistribution in Map( )

Main memory processing


Mapreduce processing dg2008 cacm fpc2009 pvldb phbb2009 pvldb
MapReduceProcessing[DG2008,CACM][FPC2009,PVLDB] [PHBB2009,PVLDB]

Modify Block Size

Disable Block Replication

Delay reduce()

Tune M and R (memory allocation and number)

Several M use the same R

Avoid full table scans by using subfiles (requires naming convention)

combine() in map() to shrink intermediate files

SequenceFiles as input with custom data types.


Mapreduce issues
MapReduceIssues

Loading, converting to binary may be necessary

Input key generally OK if high cardinality

Skewed map key distribution

Key redistribution (lot of message passing)


Mapreduce optimized
MapReduceOptimized


Sql vs mapreduce processing i o bottleneck bulk load ppradms2009 sigmod o2010 tkde
SQL vs MapReduceProcessing & I/O Bottleneck (bulk load) [PPRADMS2009,SIGMOD] [O2010,TKDE]

Import and Model Computation Times for SQL and MR (times in secs).

*MR times include conversion into a SequenceFile.


Systems optimizations sql vs mr optimized versions run same hardware
Systems optimizationsSQL vs MR (optimized versions, run same hardware)


Research issues both sql and mapreduce bfr1998 kdd cfb1999 icde sadmppr2010 cacm
Research Issues Both: SQL and MapReduce [BFR1998,KDD], [CFB1999,ICDE][SADMPPR2010,CACM]

Fast data mining algorithms solved? Yes, but not considering data sets are stored in a DBMS

SQL and MR have many similarities: shared-nothing

Fast load/unload interfaces between both systems; tighter integration

General tradeoffs in speed and programming: horizontal vs vertical layout

Incremental algorithms

one pass (streams) versus parallel processing

reduce passes/iterations


Research issues on each abasr2009 vldb cddhw2009 vldb cklrss2009 vldb
Research Issues on Each [ABASR2009,VLDB], [CDDHW2009,VLDB [CKLRSS2009,VLDB]

DBMS:

C++/Java libraries generating SQL code, pushing processing: Oracle, Teradata, SAS, KXEN

Internal C code: commercial DBMSs, open-source?

Study aggregate UDFs for complex models; extend Table UDF support: I/O bottleneck, streams

Extend SQL with more DM primitives and constructs or forget extending SQL for DM?

Specialized DBMS, middleware: SciDB, RIOT

MapReduce:

SQL+MapReduce: Greenplum, Aster, Teradata

MapReduce only: Mahout

MapReduce for query processing and data mining: especially joins, aggregations OK


Thank you q a
Thank you… Q&A

Special thanks:

Carlos Garcia-Alvarado

DBMS Group at UH:

Sasi K. Pitchaimalai

Mario Navas

Zhibo Chen


References
References

[ABASR2009,VLDB] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow., pages 922-933, 2009.

[BFR1998,KDD] P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In ACM KDD Conference, pages 9-15, 1998.

[BRKPHK2008,SIGMOD] J.A Blakeley, V. Rao, I. Kunen, A. Prout, M. Henaire, and C. Kleinerman. .NET database programmability and extensibility in microsoft SQL server. In ACM SIGMOD, pages 1087-1098. 2008.

[CFB1999,ICDE] S. Chaudhuri, U. Fayyad, and J. Bernhardt. Scalable classification over SQL databases. ICDE, 00:470, 1999.

[CDHHL1999,KDD] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman. Non-stop SQL/MX primitives for knowledge discovery. In ACM KDD Conference, pages 425-429, 1999.

[CDDHW2009,VLDB] J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. In VLDB Conference, pages 1481-1492, 2009.

[CKLRSS2009,VLDB] A demonstration of SciDB: a science-oriented DBMS. In VLDB Conference, pages 1534-1537,2009.

[DG2008,CACM] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113,2008.

[DLR1977,RSS] A.P. Dempster, N.M. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of The Royal Statistical Society, 39(1):1-38, 1977.

[DM2006,SIGMOD] A. Deshpande and S. Madden. MauveDB: supporting model-based user views in database systems. In SIGMOD Conference, pages 73-84, 2006.


References1
References

[DG1992,CACM] D. DeWitt, J. Gray. Parallel database systems: the future of high performance database systems. In Communications of the ACM, 35(6): 85-98, 1992.

[DNPT2006,SAC] A. Dorneich, R. Natarajan, E.P.D. Pednault, and F. Tipu. Embedded predictive modeling in a parallel relational database. In SAC, pages 569-574, 2006.

[FPC2009,PVLDB] E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB, 2(2):1402-1413, 2009.

[GO2010,DKE] J. García-García, C. Ordonez: Extended aggregations for databases with referential integrity issues. Data Knowl. Eng. 69(1): 73-95 (2010).

[GCBLRVPP1997,JDMKD] J. Gray and S. Chaudhuri and A. Bosworth and A. Layman and D. Reichart and M. Venkatrao and F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery., 1(1):29-53,1997.

[HLS2005,TODS] Z. He, B. S. Lee, and R. Snapp. Self-tuning cost modeling of user-defined functions in an object-relational DBMS. ACM Trans. Database Syst., 30(3):812-853, 2005.

[JM1998,SIGMOD] M. Jaedicke and B. Mitschang. On parallel processing of aggregate and scalar functions in object-relational DBMS. In ACM SIGMOD Conference, pages 379-389, 1998.

[LTWZ2005,SIGMOD] C. Luo, H. Thakkar, H. Wang, and C. Zaniolo. A native extension of SQL for mining data streams. In ACM SIGMOD, pages 873-875, New York, NY, USA, 2005.

[MC2002,ICDM] B.L. Milenova and M.M. Campos. O-cluster: Scalable clustering of large high dimensional data sets. In Proc. IEEE ICDM Conference, page 290, Washington, DC, USA, 2002.

[MYC2005,VLDB] B.L. Milenova, J. Yarmus, and M.M. Campos. SVM in Oracle database 10g: Removing the barriers to widespread adoption of support vector machines. In VLDB Conference, 2005.

[NCFB2001,ICDE] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQL databases: OLE DB for data mining. In IEEE ICDE Conference, pages 379-387, 2001.


References2
References

[O2004,DMKD] C. Ordonez. Horizontal aggregations for building tabular data sets. In ACM SIGMOD Data Mining & Knowledge Discovery Workshop (DMKD), pages 35-42, 2004.

[O2006,TKDE] C. Ordonez. Integrating K-means clustering with a relational DBMS using SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(2):188-201, 2006.

[O2007,SIGMOD] C.Ordonez. Building Statistical Models and Scoring with UDFs. In SIGMOD Conference, pages 1005-1016, 2007.

[O2010,TKDE] C. Ordonez. Statistical Model Computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2010

[OP2010,TKDE] C. Ordonez, S.K. Pitchaimalai. Bayesian Classifiers Programmed in SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(1):139-144, 2010.

[OP2010,DKE] C. Ordonez, S.K. Pitchaimalai. Fast UDFs to Compute Sufficient Statistics on Large Data Sets exploiting Caching and Sampling, Data and Knowledge Engineering Journal (DKE), 2010.

[OG2008,DSS] C. Ordonez, J. García-García: Referential integrity quality metrics. Decision Support Systems 44 (2): 495-508 (2008)

[O2003,JLINUX] M. Owens. Embedding an SQL database with SQLite. Linux J., 2003(110):2, 2003.

[PHBB2009,PVLDB] B. Panda, J. Herbach, S. Basu, and R.J. Bayardo. PLANET: Massively parallel learning of tree ensembles with MapReduce. PVLDB, 2(2):1426-1437, 2009.

[PPRADMS2009,SIGMOD] A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D.J. DeWitt, S. Madden, and Stonebraker, M. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165-178, 2009.

[STA1998,SIGMOD] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: alternatives and implications. In ACM SIGMOD, pages 343-354, 1998.

[SD2001,CIKM] K. Sattler and O. Dunemann. SQL database primitives for decision tree classifiers. In ACM CIKM Conference, pages 379-386, 2001.


References3
References

[NCFB2001,ICDE] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQL databases: OLE DB for data mining. In IEEE ICDE Conference, pages 379-387, 2001.

[SMAHHH2007,VLDB] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The end of an architectural era: (it's time for a complete rewrite). In VLDB, pages 1150-1160, 2007.

[WH2009,SIGMOD] F. M. Waas and J. M. Hellerstein. Parallelizing extensible query optimizers. In SIGMOD Conference, pages 871-878, 2009.

[ZHY2009,CIDR] Y. Zhang, H. Herodotou, and J. Yang. Riot: I/O-efficient numerical computing without SQL. In CIDR, 2009.

[ZZY2010,ICDE] Y. Zhang, W. Zhang, J. Yang. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.

[ZRL1996,SIGMOD] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In ACM SIGMOD Conference, pages 103-114, 1996.


ad