Vik singh w jim gray mark manasse barc escience msr svc mountain view 8 23 2006
1 / 55

SQL Text Mining - PowerPoint PPT Presentation

  • Uploaded on

Vik Singh w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006. SQL Text Mining. Road Map. Motivations Methodology & Algorithms Experiments & Analysis Application Future Work. 1. Motivations (1): SDSS SkyServer. Multi-TB Astronomy Archive

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' SQL Text Mining' - robert

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Vik singh w jim gray mark manasse barc escience msr svc mountain view 8 23 2006

Vik Singh

w/ Jim Gray, Mark Manasse (BARC - eScience)

MSR SVC Mountain View


SQL Text Mining

Road map
Road Map

  • Motivations

  • Methodology & Algorithms

  • Experiments & Analysis

  • Application

  • Future Work


Motivations 1 sdss skyserver
Motivations (1): SDSS SkyServer

  • Multi-TB Astronomy Archive

  • 5 Yrs, 178M Hits, 77M Views, 1M Unique IP's20M SQL Queries (9.8M Unique)

  • SQL access to telescope data

  • Wish to categorize user’s SQL queries

    • Schema redesign, caching, query recommendation, popular templates, segment users


2 not much prior work
(2): Not much prior work

  • Could not find research showing how to characterize SQL

  • But many sites and databases maintain query logs

  • Fortunately there is related work

    • NLP, IR, Machine Learning



  • Use unsupervised learning (K-Means)

    • Cluster centers give us query templates

    • Cluster sizes tell us popularity

  • Do term analysis over these segments

    • More interesting than total aggregate term stats

    • Can isolate types of users (Bots v. Mortals)


K means algorithm
K-Means Algorithm

defKMeans(k, docs):

clusters = InitialClusterCenters(k, docs)

while true:

change = AssignToClusters(docs, clusters)

if change:

clusters = RecomputeClusterCenters(k, docs)



return docs, clusters


3 key factors to clustering
3 Key Factors to Clustering

  • Distance function

  • Choosing K

    • We’ll undershoot K

    • Then break these clusters into ‘tighter’ ones

  • Choosing the initial centroids

    • Not covered in this talk, but we have some cool tricks

    • For now assume traditional approaches (BuckShot, Random)


1 st step distance function
1st Step: Distance Function

  • How syntactically similar are two queries?

    • Not necessarily functionally similar

  • SQL term order

    • Column Fields and WHERE conditionals can be reordered & represent the same query

    • Solution: Compare token combinations

      • N-Grams (or Shingles)


We also need to clean the sql
We also need to clean the SQL

  • ‘Templatize’ SQL queries

    • Remove unnecessary, uncommon features

      • 24 REGEX cleaners and substitutions

      • Ex. Substitute ‘STRING’, ‘NUM’, ‘COMPARE’

  • Goal here is to MAXIMIZE similarity


Before after
Before & After

SELECTs.ra, s.dec FROM #upload u, SpecObj s WHERE u.up_plate=s.plate and u.up_mjd=s.mjd and u.up_fiber=s.fiberid select p.objID, as rc3_name, as stetson_name, p.ra, p.dec, as type, p.u, p.g, p.r, p.i, p.z, o.distance from(((PhotoPrimary p inner join PhotoType ph on p.type = ph.value) left join RC3 rc on p.objid = rc.objid) left join Stetson s on p.objid = s.objid), dbo.fGetNearbyObjEq(180,-0.5,3) o where o.objid = p.objid and p.type = ph.value order by o.distance

select ra dec from temp specobj where up_plate compare plate logic up_mjd compare mjd logic up_fiber compare fiberid select objid name name ra dec name u g r i z distance from photoprimary inner join phototype on type compare value left join rc3 on objid compare objid left join stetson on objid compare objid fgetnearbyobjeq where objid compare objid logic type compare value orderby distance


Feature vector
Feature Vector

  • What sized N-Grams should we use?

    • Unigrams & Bigrams most common

      • Any more than Tri-Grams usually results in worse clusters

  • But this assumes unstructured text

    • We have a highly constrained language

  • And we want to capture Joins

    select objid name name ra dec name u g r i z distance fromphotoprimary inner join phototype on type compare value

    • (Need at least size 8-grams here)

  • At the same time we want good results too - consistent with the literature

    • So bias smaller grams since they more likely to occur


Feature strategy use em all
Feature Strategy – ‘Use ‘Em All’

  • Generate all 1 … 8 sized N-grams for a query

  • Sort the tokens within an N-gram

  • Why?

    • Increases similarity matches, decreases # N-grams (better for memory)

    • SQL is highly constrained - unlikely the terms reordered represent a different style query

    • Larger N-gram matches are unlikely, should be rewarded similarity if they are within the same N-distance neighborhood of terms

  • Jaccard’s Similarity Measure

  • |Intersection(Q1_n, Q2_n)| / |Union(Q1_n, Q2_n)|

  • Compute the Jaccard for each N-gram set separately, then take a weighted Fibonacci mean favoring smaller grams

  • Since there can be duplicate terms, we append each N-gram with a rolling index

    • ‘Multi-Set Jaccard’


Ex distance between 2 queries
Ex. Distance between 2 Queries

  • A: ‘select objid ra dec r z from galaxy specobj’

  • B: ‘select objid ra dec from galaxy fgetnearbyobjeq where objid compare objid’

  • For simplicity, just up to size 3

  • Not Shown

    • Sort tokens with N-Gram

      • “ra_dec_from” => “dec_from_ra”

    • Index repeated N-grams within same set size

    • Ex. objid_1, objid_2

Distance = (3*(0.43)+2*(0.31)+1*(0.17)) / 6 = 0.35


But this still won t scale for 20m
But .. this still won’t scale for 20M

  • We’re producing 1 … 8 grams each

  • Producing close to 1000+ grams per query

    • This is like full-scale document clustering!

  • 8 Jaccard’s over strings in each comparison step

  • Piping results in and out of SQL

  • O(N^2) clustering algorithm

  • Only have 3 months & a single machine …


But there are only 80k templates
But there are only 80K templates!

  • First, we only need to cluster the distinct queries (9.8M)

  • Second, we found MANY queries reduce down to a common template after cleaning

  • Number of unique query templates ~ 80K

    • Over 99.5% reduction!

  • Filter out the queries which result in errors

    • Brings it down to ~77K

  • Clustering these ~77K templates is equivalent to clustering all 20M!

    • We maintain the query to template mappings and book-keep template counts

    • Just factor the counts at the end to scale the solution for 20M


Let s cluster
Let’s Cluster

  • IronPython + SQL Server 2005

    • Didn't use Data Mining package in SQL Server

      • Its Clustering optimized for Euclidean distances

  • K = 24 (based on previous work at JHU)

  • Then search clusters for tighter groups

    • Within 70% similarity and of size >= 10

  • Total clusters: 194

  • Computation Time: ~22 Hrs

  • 15 Iterations


Ex a popular cluster found
Ex. A Popular Cluster Found

  • Represents 18% of all the SQL queries (3,100,655 hits)

  • Sum of Square Errors/Cluster Size =3.58006287097031E-06

    • 0.061 variance omitting top template from cluster

    • Tiny variance among queries


And since we collect n grams
And since we collect N-grams …

  • … We can find the common phrases within a cluster

  • Ex. Top 15 Most PopularEight-Grams


Two example applications
Two Example Applications

  • Bot Detection

  • Query Recommendation


App 1 bot detection
App 1: Bot Detection

  • Belief:Little or no variance and large cluster sizes correspond to Bot agents

  • Can isolate bot from human SQL traffic

    • Useful since we don’t have the user-agent strings in the SQL logs


App 1 bot detection1
App 1: Bot Detection

  • Significant # of EXACT query template matches

  • Top 10 Query Templates


Check the logs for the top template
Check the logs for the top template

  • Sample of SQL log records matching the template

  • Even have the same # of whitespaces & case

  • Just diff numbers being passed into the TVF’s

  • Steady rate: 2x hour for weeks at a time

  • Product usage pattern (All DR3, then all DR2)

  • All from the same user IP (NASA)

    • 2nd highest # queries on our top org’s list

  • Smells like a bot


App 2 query recommendation
App 2: Query Recommendation

  • Since we have a distance function … we can even return similar queries

  • New users, students, visitors wishing to query the sky

  • But there’s a learning curve

    • SQL (writing 3-way spatial query joins not easy)

    • Schemas

    • Optimizing & Debugging

  • Because queries are quite repetitive, why not suggest known correct ones to the user

    • Spelling Correction!


App 2 example
App 2: Example

Bad User Query:

SELECT TOP 10 ph.ra,ph.dec,str(ph.g - ph.r,11 ?) as color,ISNULL(s.bestObjId, 0) as bestObjId, 'ugri' FROM #x x, #upload up, BESTDR2..PhotoObjAll as ph LEFT OUTER JOIN? SpecObjAll s ON ph.objID = s.bestObjID WHERE (ph.type=3 OR ?) AND up.up_id = x.up_id ? x.objID=p ?.objID ORDER BY x.up_id

=> It’s ‘Cleaned’ representation:

select top num ra dec str isnull bestobjid from temp temp photoobjall left outer join specobjall on objid compare bestobjid where type compare num logic logic up_id compare up_id objid compare objid orderby up_id


App 2 similar query results
App 2: Similar Query Results

* Can computed quickly by comparing only against queries within the same length neighborhood



  • Not discussed

    • Optimizations, use of SQL Server, how we broke down clusters (‘Finesse Algorithm’), initializing centers, defining a center (‘Clustroid’), theoretical properties, weighting scheme

  • Future Work

    • More cleaning, using SQL parse trees, other features representations, better n-grams (wrapping edges), optimizations (min-hashing)

  • Findings

    • We found that queries in cleaned representation fall into a small group of clusters (20M => 77K)

    • Queries follow very repetitive template patterns, enabling us to effectively find bots, query outliers, and do query recommendation



  • SkyServer Site Logs:

  • Sloan Digital Sky Survey SkyServer project’s website:

  • G. Abdulla, “Analysis of SDSS SQL server log files”, UCRL-MI-215756-DRAFT. Lawrence Livermore National Laboratory, 2005

  • T. Malik, R. Burns, A. Chaudhary. Bypass Caching: Making Scientific Databases Good Network Citizens. In Proceedings of the 21st International Conference on Data Engineering, 2005.

  • Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig: Syntactic Clustering of the Web. Computer Networks 29(8-13): 1157-1166 (1997)

  • Fast and effective text mining using linear-time document clusteringB Larsen, C Aone - Proceedings of the fifth ACM SIGKDD international conference …, 1999 -

  • Web mining research: a survey R Kosala, H Blockeel - ACM SIGKDD Explorations Newsletter, 2000 -

  • Mehran Sahami, Timothy D. Heilman: A web-based kernel function for measuring the similarity of short text snippets. WWW 2006: 377-386

  • Plan to publish our web and SQL traffic research in a MSR TR as well as make the database, docs, slides & code available at this week




Gives us another optimization
Gives us another optimization

  • Current Design Inefficient

    • Many Jaccard’s over very large sets of strings

  • Let’s use hashing to speed up Jaccard and enable us to store more features in memory

    • Hash(Template) => [Hashes(1-Grams), Hashes(2-Grams) …]

  • We can use Python’s 32-bit String hash function

  • But what about collisions?

    • Not many unique N-grams

      • SQL & Schema vocabulary small

    • What about hashing query templates?

    • Birthday Problem

      • We can expect collisions with probability greater than a 50% after hashing:

      • ½ + sqrt(1/4 + 2*2^32 * ln(2)) = ~77K


  • Primarily two flavors of clustering

    • Expectation-Maximization (EM)

    • K-Means

  • Don’t want EM

    • Model-based clustering for unobservable data

    • K-Means is actually just a special case of EM

      • Clusters modelled by spherical Gaussian distributions

      • Each data item is assigned to one cluster

      • The mixture weights are equal

  • Now how do we compare two sets of n grams todo
    Now how do we compare two sets of N-grams? -- TODO

    • Jaccard’s Measure

    • Cardinality of the intersection of the two sets divided by the cardinality of the union of the two sets

    • Traditional, established metric for similarity

    • Captures Euclidean and Cosine properties

    • Since sets remove duplicates, we index n-grams with numbers to denote repetition

      • Multi-Set Jaccard’s Measure

    3 how well can we classify sql
    (3): How well can we classify SQL?

    • Motivation for converting unstructured to structured text

    • Can optimize a common layer for SQL style representation

      • Reusable ML development framework

        • Although translating to SQL is not trivial

    • Has nice properties

      • Highly constrained, clean vocabulary, relational, propositional, parseable

    But we have 8 sets of n grams todo
    But we have 8 sets of n-grams … -- TODO

    • We compute the Jaccard separately for each size n-gram

      • Not a good idea to mix sizes (lose independence)

    • Wish to prioritize smaller grams for better results and to maximize similarity

      • Since the Jaccard of smaller grams will result in a higher similarity value than larger grams

      • Prioritize the first 3 grams

    • Reverse Fibonacci sequence enables us to grow the weights proportionally and exponentially

    • 78% goes to first 3 grams

    • Comforting to know we’re taking into account larger grams (22% sounds reasonable to us)

    • Then take the weighted mean of the Jaccard’s scaled by their respective Fibonacci number

    • This is arbitrary, but we liked the results

    • Compared to exponential (2 based) and just plain ol’ unigrams with no scaling

    • Gives us something in between these two

    Finally an example distance between 2 sql queries todo
    Finally, an example: Distance Between 2 SQL Queries -- TODO

    • [include two queries, their n-grams, jaccard of each, multiplied by the fibonacci element, then the mean]

    • For now, say n-grams from 1-3 (so it can fit on a slide)

    2 nd step choosing initial centers
    2nd Step: Choosing Initial Centers

    • First, what’s a center?

      • Common clustering done over numerical vectors

        • Euclidean distance and midpoint (‘Centroid’)

      • But we’re dealing with sets of n-grams

        • No clear average midpoint formula

    • So instead we use ‘Clustroids’

      • SQL queries represent centers

      • Since we can’t do average midpoint, we find the query minimizes square error with everybody else within the cluster


    • Hill-Climbing

      • No really good method that escapes local minima

      • Simulated Annealing

      • Genetic K-Means

        • Difficult to apply to our features and distance metric

    • Buckshot (1992) [include cite])

      • Hierarchical Agglomerative Clustering (HAC)

      • Dendrogram Graph 

      • Start w/ SQRT(N) samples, Greedily combine clusters (All-Pairs) until left with K Clustroids

      • O(N^2–N*K)

    But this just does not scale for 20m todo
    But this just does not scale for 20M -- TODO

    • How many Jaccards for just BuckShot [include]

    • Plus how many Jaccards for KMeans

    • (probably more, stops when the clusters are stables, which is unknown number of steps)

    • One can set a max number of iterations

    • So how about sample more

    • Just recomputing clustroids is n^2

      • Need to do all-pairs to find center with min square error

      • Sample sqrt(n) to make it linear

    Can we initialize centers faster todo
    Can we initialize centers faster? -- TODO

    • BuckShot is slow

    • Can we do this step faster and produce better clusters?

    • Idea: Cuckoo Clustering

    • Inspired by Cuckoo Hashing and Genetic KMeans [include cites]

    Side excursion cuckoo clustering todo should be extra slides
    Side Excursion: Cuckoo Clustering – TODO (Should be extra slides)

    • [describe algorithm]

    • K buckets

    • Elect better leaders

    • Spill over to next best bucket

    • [Analyze run time]

    • [include pseudo code]

    Cuckoo clustering todo extra slides
    Cuckoo Clustering – TODO (extra slides) slides)

    • Nests (capacity 2)

    • K = 4

    • Sample size = 8

    Clustering 116 425 sql queries
    Clustering 116,425 SQL Queries slides)

    • SQL Queries chosen uniformly at random

    • K = 56

    • Max K-Means Iterations = 20

    • MSEWC = Mean Square Error Within Cluster

    Plus todo
    Plus … -- TODO slides)

    • Each query template and the number of times it occurred

    • The most popular ones should represent initial centers

    • And, cleaned queries grouped by the same length should be assigned to the same cluster initially

    • Get one iteration of AssignClusters for free

    • Now we can do KMeans

    Bootstrap example todo
    Bootstrap Example -- TODO slides)

    • [include example, before and after group bys]

    Experiment todo
    Experiment -- TODO slides)

    • Setup

    • SQL Server 2005

    • IronPython <== I love this language

    • Did not use Data Mining package in SQL Server

    • 'Microsoft Clustering' designed around and optimized for Euclidean vectors

    • Not easy to write a SQL Server C++ plug-in for my style feature sets + Jaccard + sampling logic

    • Computation Time: [?]

    • KMeans Iterations: [?]

    Some optimizations todo extra slides
    Some Optimizations – TODO (extra slides) slides)

    • Pruning when we find similarity = 1

    • Hashing

    • Speeds up Jaccard and reduces memory

    • Hash(cleanedQuery) => List of Hash(N-gram)’s

    • Use Python’s 32-bit string hash function

    • Collisions?

    • Not many unique N-grams

      • Number of SQL keywords + SDSS schema fields very small

    • Birthday Problem: We can expect collisions with probability greater than a ½ after hashing:

    • ½ + sqrt(1/4 + 2*2^32 * ln(2)) = ~77k queries

    • We have ~77k unique cleaned queries

    • Good enough

    We can do better finesse todo
    We can do better: ‘Finesse’ -- TODO slides)

    • Idea: KMeans does the hard work of grouping similar queries together

    • Intuition: We can find 'tighter' query groups within a cluster fast

    • We choose a threshold (say within 85% similarity)

    • Value of K not really important now

    • [include algorithm]

    Finesse algorithm todo
    Finesse Algorithm -- TODO slides)

    • [include algorithm]

    • Going backwards from 1.0 to threshold in decremental step values key to finding optimal clusters and for speed

    A popular cluster found
    A Popular Cluster Found slides)

    • Represents 18% of all the SQL queries (3,100,655 hits)

    • Sum of Square Errors/Cluster Size =3.58006287097031E-06

      • Meaning tiny variation within cluster

    And since we collect n grams1
    And since we collect N-grams … slides)

    • … We can find the common phrases

    • Ex. Top 15 Most PopularEight-Grams

    Bot detection
    Bot Detection slides)

    • Belief:Little or no variance and large cluster sizes correspond to bot agents

    • Can isolate bot from mortal SQL traffic

      • No user-agent strings in the SQL logs

    Bot detection1
    Bot Detection slides)

    • Significant # of EXACT query template matches

    • Top 10 Query Templates

    App query recommendation
    App: Query Recommendation slides)

    • Since we have a distance function …we can even find similar queries

    • New users, students, visitors wishing to query the sky

    • But there’s a learning curve

      • SQL (writing 3-way spatial query joins may not be intuitive)

      • Schemas

      • Optimizing & Debugging

    • And since we a found queries to be quite repetitive …

    • Why not return the most similar (and correct) ones as suggestions to the user?

      • Improving user experience and reach

    Example todo better query this isn t a common one
    Example slides)(TODO Better Query – this isn’t a common one)

    Find Top 3 Most Similar Queries to:

    SELECTtop 100 p.psfMag_g - s.Mag_g FROM photoprimary p JOIN specObj s on p.specObjID=s.specObjID where s.specClass=1

    It’s ‘Cleaned’ representation:

    select top num psfmag_g mag_g from photoprimary join specobj on specobjid compare specobjid where specclass compare num