aggregate features for relational data claudia perlich foster provost
Download
Skip this Video
Download Presentation
Aggregate features for relational data Claudia Perlich, Foster Provost

Loading in 2 Seconds...

play fullscreen
1 / 52

Aggregate features for relational data Claudia Perlich, Foster Provost - PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on

Aggregate features for relational data Claudia Perlich, Foster Provost. Pat Tressel 16-May-2005. Overview. Perlich and Provost provide... Hierarchy of aggregation methods Survey of existing aggregation methods New aggregation methods Concerned w/ supervised learning only

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Aggregate features for relational data Claudia Perlich, Foster Provost' - phong


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
aggregate features for relational data claudia perlich foster provost

Aggregate features for relational dataClaudia Perlich, Foster Provost

Pat Tressel

16-May-2005

overview
Overview
  • Perlich and Provost provide...
    • Hierarchy of aggregation methods
    • Survey of existing aggregation methods
    • New aggregation methods
  • Concerned w/ supervised learning only
    • But much seems applicable to clustering
the issues
The issues…
  • Most classifiers use feature vectors
    • Individual features have fixed arity
    • No links to other objects
  • How do we get feature vectors from relational data?
    • Flatten it:
      • Joins
      • Aggregation
  • (Are feature vectors all there are?)
joins
Joins
  • Why consider them?
    • Yield flat feature vectors
    • Preserve all the data
  • Why not use them?
    • They emphasize data with many references
      • Ok if that’s what we want
      • Not ok if sampling was skewed
      • Cascaded or transitive joins blow up
joins1
Joins
  • They emphasize data with many references:
    • Lots more Joes than there were before...
joins2
Joins
  • Why not use them?
    • What if we don’t know the references?
      • Try out everything with everything else
      • Cross product yields all combinations
      • Adds fictitious relationships
      • Combinatorial blowup
joins3
Joins
  • What if we don’t know the references?
aggregates
Aggregates
  • Why use them?
    • Yield flat feature vectors
    • No blowup in number of tuples
      • Can group tuples in all related tables
    • Can keep as detailed stats as desired
      • Not just max, mean, etc.
      • Parametric dists from sufficient stats
      • Can apply tests for grouping
    • Choice of aggregates can be model-based
      • Better generalization
      • Include domain knowledge in model choice
aggregates1
Aggregates
  • Anything wrong with them?
    • Data is lost
    • Relational structure is lost
    • Influential individuals are lumped in
      • Doesn’t discover critical individuals
      • Dominates other data
    • Any choice of aggregates assumes a model
      • What if it’s wrong?
    • Adding new data can require calculations
      • But can avoid issue by keeping sufficient statistics
taxonomy of aggregates
Taxonomy of aggregates
  • Why is this useful?
    • Promote deliberate use of aggregates
    • Point out gaps in current use of aggregates
    • Find appropriate techniques for each class
  • Based on “complexity” due to:
    • Relational structure
      • Cardinality of the relations (1:1, 1:n, m:n)
    • Feature extraction
      • Computing the aggregates
    • Class prediction
taxonomy of aggregates1
Taxonomy of aggregates
  • Formal statement of the task:
  • Notation (here and on following slides):
    • t, tuple (from “target” table T, with main features)
    • y, class (known per t if training)
    • Ψ, aggregation function
    • Φ, classification function
    • σ, select operation (where joins preserve t)
    • Ω, all tables; B, any other table, b in B
    • u, fields to be added to t from other tables
    • f, a field in u
    • More, that doesn’t fit on this slide
taxonomy of aggregates2
Taxonomy of aggregates
  • Formal statement of the task:
  • Notation (here and on following slides):
    • Caution! Simplified from what’s in the paper!
    • t, tuple (from “target” table T, with main features)
    • y, class (known per t if training)
    • Ψ, aggregation function
    • Φ, classification function
    • σ, select operation (where joins preserve t)
    • Ω, all tables; B, any other table, b a tuple in B
    • u, fields to be added to t from joined tables
    • f, a field in u
    • More, that doesn’t fit on this slide
aggregation complexity
Aggregation complexity
  • Simple
    • One field from one object type
  • Denoted by:
aggregation complexity1
Aggregation complexity
  • Multi-dimensional
    • Multiple fields, one object type
  • Denoted by:
aggregation complexity2
Aggregation complexity
  • Multi-type
    • Multiple object types
  • Denoted by:
relational concept complexity
Relational “concept” complexity
  • Propositional
    • No aggregation
    • Single tuple, 1-1 or n-1 joins
      • n-1 is just a shared object
    • Not relational per se – already flat
relational concept complexity1
Relational “concept” complexity
  • Independent fields
    • Separate aggregation per field
    • Separate 1-n joins with T
relational concept complexity2
Relational “concept” complexity
  • Dependent fields in same table
    • Multi-dimensional aggregation
    • Separate 1-n joins with T
relational concept complexity3
Relational “concept” complexity
  • Dependent fields over multiple tables
    • Multi-type aggregation
    • Separate 1-n joins, still only with T
relational concept complexity4
Relational “concept” complexity
  • Global
    • Any joins or combinations of fields
      • Multi-type aggregation
      • Multi-way joins
      • Joins among tables other than T
current relational aggregation
Current relational aggregation
  • First-order logic
    • Find clauses that directly predict the class
      • Ф is OR
    • Form binary features from tests
      • Logical and arithmetic tests
      • These go in the feature vector
      • Ф is any ordinary classifier
current relational aggregation1
Current relational aggregation
  • The usual database aggregates
    • For numerical values:
      • mean, min, max, count, sum, etc.
    • For categorical values:
      • Most common value
      • Count per value
current relational aggregation2
Current relational aggregation
  • Set distance
    • Two tuples, each with a set of related tuples
    • Distance metric between related fields
      • Euclidean for numerical data
      • Edit distance for categorical
    • Distance between sets is distance of closest pair
proposed relational aggregation
Proposed relational aggregation
  • Recall the point of this work:
    • Tuple t from table T is part of a feature vector
    • Want to augment w/ info from other tables
    • Info added to t must be consistent w/ values in t
    • Need to flatten the added info to yield one vector per tuple t
    • Use that to:
      • Train classifier given class y for t
      • Predict class y for t
proposed relational aggregation1
Proposed relational aggregation
  • Outline of steps:
    • Do query to get more info u from other tables
    • Partition the results based on:
      • Main features t
      • Class y
      • Predicates on t
    • Extract distributions over results for fields in u
      • Get distribution for each partition
      • For now, limit to categorical fields
      • Suggest extension to numerical fields
    • Derive features from distributions
do query to get info from other tables
Do query to get info from other tables
  • Select
    • Based on the target table T
    • If training, known class y is included in T
    • Joins must preserve distinct values from T
      • Join on as much of T’s key as is present in other table
      • Maybe need to constrain other fields?
      • Not a problem for correctly normalized tables
  • Project
    • Include all of t
    • Append additional fields u from joined tables
      • Anything up to all fields from joins
extract distributions
Extract distributions
  • Partition query results various ways, e.g.:
    • Into cases per each t
      • For training, include the (known) class y in t
    • Also (if training) split per each class
      • Want this for class priors
    • Split per some (unspecifed) predicate c(t)
  • For each partition:
    • There is a bag of associated u tuples
      • Ignore the t part – already a flat vector
    • Split vertically to get bags of individual values per each field f in u
      • Note this breaks association between fields!
distributions for categorical fields
Distributions for categorical fields
  • Let categorical field be f with values fi
  • Form histogram for each partition
    • Count instances of each value fi of f in a bag
    • These are sufficient statistics for:
      • Distribution over fi values
      • Probability of each bag in the partition
  • Start with one per each tuple t and field f
    • Cft, (per-) case vector
    • Component Cft[i], count for fi
distributions for categorical fields1
Distributions for categorical fields
  • Distribution of histograms per predicatec(t) and field f
    • Treat histogram counts as random variables
      • Regard c(t) true partition as a collection of histogram “samples”
      • Regard histograms as vectors of random variables, one per field value fi
    • Extract moments of these histogram count distributions
      • mean (sort of) – reference vector
      • variance (sort of) – variance vector
distributions for categorical fields2
Distributions for categorical fields
  • Net histogram per predicate c(t), field f
    • c(t) partitions tuples t into two groups
      • Only histogram the c(t) true group
      • Could include ~c as a predicate if we want
    • Don’t re-count!
      • Already have histograms for each t and f – case reference vectors
      • Sum the case reference vectors columnwise
    • Call this a “reference vector”, Rfc
      • Proportional to average histogram over t for c(t) true (weighted by # samples per t)
distributions for categorical fields3
Distributions for categorical fields
  • Variance of case histograms per predicatec(t) and field f
    • Define “variance vector”, Vfc
      • Columnwise sum of squares of case reference vectors / number of samples with c(t) true
      • Not an actual variance
        • Squared means not subtracted
      • Don’t care:
        • It’s indicative of the variance...
        • Throw in means-based features as well to give classifier full variance info
distributions for categorical fields4
Distributions for categorical fields
  • What predicates might we use?
    • Unconditionally true, c(t) = true
      • Result is net distribution independent of t
      • Unconditional reference vector, R
    • Per class k, ck(t) = (t.y == k)
      • Class priors
      • Recall for training data, y is a field in t
      • Per class reference vector,
distributions for categorical fields5
Distributions for categorical fields
  • Summary of notation
    • c(t), a predicate based on values in a tuple t
    • f, a categorical field from a join with T
    • fi, values of f
    • Rfc, reference vector
      • histogram over fi values in bag for c(t) true
    • Cft, case vector
      • histogram over fi values for t’s bag
    • R, unconditional reference vector
    • Vfc, variance vector
      • Columnwise average squared ref. vector
    • X[i], i th value in some ref. vector X
distributions for numerical data
Distributions for numerical data
  • Same general idea – representative distributions per various partitions
  • Can use categorical techniques if we:
    • Bin the numerical values
    • Treat each bin as a categorical value
feature extraction
Feature extraction
  • Base features on ref. and variance vectors
  • Two kinds:
    • “Interesting” values
      • one value from case reference vector per t
      • same column in vector for all t
      • assorted options for choosing column
      • choices depend on predicate ref. vectors
    • Vector distances
      • distance between case ref. vector and predicate ref. vector
      • various distance metrics
  • More notation: acronym for each feature type
feature extraction interesting values
Feature extraction: “interesting” values
  • For a given c, f, select that fi which is...
    • MOC: Most common overall
      • argmaxiR[i]
    • Most common in each class
      • For binary class y
        • Positive is y = 1, Negative is y = 0
      • MOP: argmaxiRft.y=1[i]
      • MON: argmaxiRft.y=0[i]
    • Most distinctive per class
      • Common in one class but not in other(s)
      • MOD: argmaxi |Rft.y=1[i] - Rft.y=0[i] |
      • MOM: argmaxi MOD / Vft.y=1[i] - Vft.y=0[i]
        • Normalizes for variance (sort of)
feature extraction vector distance
Feature extraction: vector distance
  • Distance btw given ref. vector & each case vector
  • Distance metrics
    • ED: Edit – not defined
      • Sum of abs. diffs, a.k.a. Manhattan dist?
      • Σi |C[i] – R[i] |
    • EU: Euclidean
      • √(C[i] T R[i] ), omit √ for speed
    • MA: Mahalanobis
      • √(C[i] TΣ-1 R[i] ), omit √ for speed
      • Σshould be covariance...of what?
    • CO: Cosine, 1- cos(angle btw vectors)
      • 1 - C[i] T R[i] / √ (|C[i] ||R[i] |)
feature extraction vector distance1
Feature extraction: vector distance
  • Apply each metric w/ various ref. vectors
    • Acronym is metric w/ suffix for ref. vector
    • (No suffix): Unconditional ref. vector
    • P: per-class positive ref. vector, Rft.y=1
    • N: per-class positive ref. vector, Rft.y=0
    • D: difference between P and D distances
  • Alphabet soup, e.g. EUP, MAD,...
feature extraction1
Feature extraction
  • Other features added for tests
    • Not part of their aggregation proposal
    • AH: “abstraction hierarchy” (?)
      • Pull into T all fields that are just “shared records” via n:1 references
    • AC: “autocorrelation” aggregation
      • For joins back into T, get other cases “linked to” each t
      • Fraction of positive cases among others
learning
Learning
  • Find linked tables
    • Starting from T, do breadth-first walk of schema graph
      • Up to some max depth
      • Cap number of paths followed
    • For each path, know T is linked to last table in path
  • Extract aggregate fields
    • Pull in all fields of last table in path
    • Aggregate them (using new aggregates) per t
    • Append aggregates to t
learning1
Learning
  • Classifier
    • Pick 10 subsets each w/ 10 features
      • Random choice, weighted by “performance”
      • But there’s no classifier yet...so how do features predict class?
    • Build a decision tree for each feature set
      • Have class frequencies at leaves
        • Features might not completely distinguish classes
      • Class prediction:
        • Select class with higher frequency
      • Class probability estimation:
        • Average frequencies over trees
tests
Tests
  • IPO data
    • 5 tables
      • Most fields in the “main” table, used as T
      • Other tables had key & one data field
      • Predicate on one field in T used as the class
  • Tested against:
    • First-order logic aggregation
      • Extract clauses using an ILP system
      • Append evaluated clauses to each t
    • Various ILP systems
      • Using just data in T
      • (Or T and AH features?)
tests1
Tests
  • IPO data
    • 5 tables w/ small, simple schema
      • Majority of fields were in the “main” table, i.e. T
        • The only numeric fields were in main table, so no aggregation of numeric features needed
      • Other tables had key & one data field
      • Max path length 2 to reach all tables, no recursion
      • Predicate on one field in T used as the class
  • Tested against:
    • First-order logic aggregation
      • Extract clauses using an ILP system
      • Append evaluated clauses to each t
    • Various ILP systems
      • Using just data in T (or T and AH features?)
test results
Test results
  • See paper for numbers
  • Accuracy with aggregate features:
    • Up to 10% increase over only features from T
    • Depends on which and how many extra features used
    • Most predictive feature was in a separate table
    • Expect accuracy increase as more info available
    • Shows info was not destroyed by aggregation
    • Vector distance features better
  • Generalization
interesting ideas i benefits b
Interesting ideas (“I”) & benefits (“B”)
  • Taxonomy
    • I: Division into stages of aggregation
      • Slot in any procedure per stage
      • Estmate complexity per stage
    • B: Might get the discussion going
  • Aggregate features
    • I: Identifying a “main” table
      • Others get aggregated
    • I: Forming partitions to aggregate over
      • Using queries with joins to pull in other tables
      • Abstract partitioning based on predicate
    • I: Comparing case against reference histograms
    • I: Separate comparison method and reference
interesting ideas i benefits b1
Interesting ideas (“I”) & benefits (“B”)
  • Learning
    • I: Decision tree tricks
      • Cut DT induction off short to get class freqs
      • Starve DT of features to improve generalization
issues
Issues
  • Some worrying lapses...
    • Lacked standard terms for common concepts
      • “position i [of vector has] the number of instances of [ith value]”... -> histogram
      • “abstraction hierarchy” -> schema
      • “value order” -> enumeration
      • Defined (and emphasized) terms for trivial and commonly used things
    • Imprecise use of terms
      • “variance” for (something like) second moment
      • I’m not confident they know what Mahalanobis distance is
      • They say “left outer join” and show inner join symbol
issues1
Issues
  • Some worrying lapses...
    • Did not connect “reference vector” and “variance vector” to underlying statistics
      • Should relate to bag prior and field value conditional probability, not just “weighted”
    • Did not acknowledge loss of correlation info from splitting up joined u tuples in their features
      • Assumes fields are independent
      • Dependency was mentioned in the taxonomy
    • Fig 1 schema cannot support § 2 example query
      • Missing a necessary foreign key reference
issues2
Issues
  • Some worrying lapses...
    • Their formal statement of the task did not show aggregation as dependent on t
      • Needed for c(t) partitioning
    • Did not clearly distinguish when t did or did not contain class
      • No need to put it in there at all
    • No, the higher Gaussian moments are not all zero!
      • Only the odd ones are. Yeesh.
      • Correct reason we don’t need them is: all can be computed from mean and variance
    • Uuugly notation
issues3
Issues
  • Some worrying lapses...
    • Did not cite other uses of histograms or distributions extracted as features
      • “Spike-triggered average” / covariance / etc.
        • Used by: all neurobiology, neurocomputation
        • E.g.: de Ruyter van Steveninck & Bialek
      • “Response-conditional ensemble”
        • Used by: Our own Adrienne Fairhall & colleagues
        • E.g.: Aguera & Arcas, Fairhall, Bialek
      • “Event-triggered distribution”
        • Used by: me ☺
        • E.g.: CSE528 project
issues4
Issues
  • Some worrying lapses...
    • Did not cite other uses of histograms or distributions extracted as features...
    • So, did not use “standard” tricks
      • Dimension reduction:
        • Treat histogram as a vector
        • Do PCA, keep top few eigenmodes, new features are projections
    • Nor “special” tricks:
      • Subtract prior covariance before PCA
    • Likewise competing the classes is not new
issues5
Issues
  • Non-goof issues
    • Would need bookkeeping to maintain variance vector for online learning
      • Don’t have sufficient statistics
      • Histograms are actual “samples”
      • Adding new data doesn’t add new “samples”:changes existing ones
      • Could subtract old contribution, add new one
      • Use a triggered query
    • Don’t bin those nice numerical variables!
      • Binning makes vectors out of scalars
      • Scalar fields can be ganged into a vector across fields!
      • Do (e.g.) clustering on the bag of vectors
  • That’s enough of that
ad