Aggregate features for relational data claudia perlich foster provost
Download
1 / 52

Aggregate features for relational data Claudia Perlich, Foster Provost - PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on

Aggregate features for relational data Claudia Perlich, Foster Provost. Pat Tressel 16-May-2005. Overview. Perlich and Provost provide... Hierarchy of aggregation methods Survey of existing aggregation methods New aggregation methods Concerned w/ supervised learning only

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Aggregate features for relational data Claudia Perlich, Foster Provost' - phong


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Aggregate features for relational data claudia perlich foster provost

Aggregate features for relational dataClaudia Perlich, Foster Provost

Pat Tressel

16-May-2005


Overview
Overview

  • Perlich and Provost provide...

    • Hierarchy of aggregation methods

    • Survey of existing aggregation methods

    • New aggregation methods

  • Concerned w/ supervised learning only

    • But much seems applicable to clustering


The issues
The issues…

  • Most classifiers use feature vectors

    • Individual features have fixed arity

    • No links to other objects

  • How do we get feature vectors from relational data?

    • Flatten it:

      • Joins

      • Aggregation

  • (Are feature vectors all there are?)


Joins
Joins

  • Why consider them?

    • Yield flat feature vectors

    • Preserve all the data

  • Why not use them?

    • They emphasize data with many references

      • Ok if that’s what we want

      • Not ok if sampling was skewed

      • Cascaded or transitive joins blow up


Joins1
Joins

  • They emphasize data with many references:

    • Lots more Joes than there were before...


Joins2
Joins

  • Why not use them?

    • What if we don’t know the references?

      • Try out everything with everything else

      • Cross product yields all combinations

      • Adds fictitious relationships

      • Combinatorial blowup


Joins3
Joins

  • What if we don’t know the references?


Aggregates
Aggregates

  • Why use them?

    • Yield flat feature vectors

    • No blowup in number of tuples

      • Can group tuples in all related tables

    • Can keep as detailed stats as desired

      • Not just max, mean, etc.

      • Parametric dists from sufficient stats

      • Can apply tests for grouping

    • Choice of aggregates can be model-based

      • Better generalization

      • Include domain knowledge in model choice


Aggregates1
Aggregates

  • Anything wrong with them?

    • Data is lost

    • Relational structure is lost

    • Influential individuals are lumped in

      • Doesn’t discover critical individuals

      • Dominates other data

    • Any choice of aggregates assumes a model

      • What if it’s wrong?

    • Adding new data can require calculations

      • But can avoid issue by keeping sufficient statistics


Taxonomy of aggregates
Taxonomy of aggregates

  • Why is this useful?

    • Promote deliberate use of aggregates

    • Point out gaps in current use of aggregates

    • Find appropriate techniques for each class

  • Based on “complexity” due to:

    • Relational structure

      • Cardinality of the relations (1:1, 1:n, m:n)

    • Feature extraction

      • Computing the aggregates

    • Class prediction


Taxonomy of aggregates1
Taxonomy of aggregates

  • Formal statement of the task:

  • Notation (here and on following slides):

    • t, tuple (from “target” table T, with main features)

    • y, class (known per t if training)

    • Ψ, aggregation function

    • Φ, classification function

    • σ, select operation (where joins preserve t)

    • Ω, all tables; B, any other table, b in B

    • u, fields to be added to t from other tables

    • f, a field in u

    • More, that doesn’t fit on this slide


Taxonomy of aggregates2
Taxonomy of aggregates

  • Formal statement of the task:

  • Notation (here and on following slides):

    • Caution! Simplified from what’s in the paper!

    • t, tuple (from “target” table T, with main features)

    • y, class (known per t if training)

    • Ψ, aggregation function

    • Φ, classification function

    • σ, select operation (where joins preserve t)

    • Ω, all tables; B, any other table, b a tuple in B

    • u, fields to be added to t from joined tables

    • f, a field in u

    • More, that doesn’t fit on this slide


Aggregation complexity
Aggregation complexity

  • Simple

    • One field from one object type

  • Denoted by:


Aggregation complexity1
Aggregation complexity

  • Multi-dimensional

    • Multiple fields, one object type

  • Denoted by:


Aggregation complexity2
Aggregation complexity

  • Multi-type

    • Multiple object types

  • Denoted by:


Relational concept complexity
Relational “concept” complexity

  • Propositional

    • No aggregation

    • Single tuple, 1-1 or n-1 joins

      • n-1 is just a shared object

    • Not relational per se – already flat


Relational concept complexity1
Relational “concept” complexity

  • Independent fields

    • Separate aggregation per field

    • Separate 1-n joins with T


Relational concept complexity2
Relational “concept” complexity

  • Dependent fields in same table

    • Multi-dimensional aggregation

    • Separate 1-n joins with T


Relational concept complexity3
Relational “concept” complexity

  • Dependent fields over multiple tables

    • Multi-type aggregation

    • Separate 1-n joins, still only with T


Relational concept complexity4
Relational “concept” complexity

  • Global

    • Any joins or combinations of fields

      • Multi-type aggregation

      • Multi-way joins

      • Joins among tables other than T


Current relational aggregation
Current relational aggregation

  • First-order logic

    • Find clauses that directly predict the class

      • Ф is OR

    • Form binary features from tests

      • Logical and arithmetic tests

      • These go in the feature vector

      • Ф is any ordinary classifier


Current relational aggregation1
Current relational aggregation

  • The usual database aggregates

    • For numerical values:

      • mean, min, max, count, sum, etc.

    • For categorical values:

      • Most common value

      • Count per value


Current relational aggregation2
Current relational aggregation

  • Set distance

    • Two tuples, each with a set of related tuples

    • Distance metric between related fields

      • Euclidean for numerical data

      • Edit distance for categorical

    • Distance between sets is distance of closest pair


Proposed relational aggregation
Proposed relational aggregation

  • Recall the point of this work:

    • Tuple t from table T is part of a feature vector

    • Want to augment w/ info from other tables

    • Info added to t must be consistent w/ values in t

    • Need to flatten the added info to yield one vector per tuple t

    • Use that to:

      • Train classifier given class y for t

      • Predict class y for t


Proposed relational aggregation1
Proposed relational aggregation

  • Outline of steps:

    • Do query to get more info u from other tables

    • Partition the results based on:

      • Main features t

      • Class y

      • Predicates on t

    • Extract distributions over results for fields in u

      • Get distribution for each partition

      • For now, limit to categorical fields

      • Suggest extension to numerical fields

    • Derive features from distributions


Do query to get info from other tables
Do query to get info from other tables

  • Select

    • Based on the target table T

    • If training, known class y is included in T

    • Joins must preserve distinct values from T

      • Join on as much of T’s key as is present in other table

      • Maybe need to constrain other fields?

      • Not a problem for correctly normalized tables

  • Project

    • Include all of t

    • Append additional fields u from joined tables

      • Anything up to all fields from joins


Extract distributions
Extract distributions

  • Partition query results various ways, e.g.:

    • Into cases per each t

      • For training, include the (known) class y in t

    • Also (if training) split per each class

      • Want this for class priors

    • Split per some (unspecifed) predicate c(t)

  • For each partition:

    • There is a bag of associated u tuples

      • Ignore the t part – already a flat vector

    • Split vertically to get bags of individual values per each field f in u

      • Note this breaks association between fields!


Distributions for categorical fields
Distributions for categorical fields

  • Let categorical field be f with values fi

  • Form histogram for each partition

    • Count instances of each value fi of f in a bag

    • These are sufficient statistics for:

      • Distribution over fi values

      • Probability of each bag in the partition

  • Start with one per each tuple t and field f

    • Cft, (per-) case vector

    • Component Cft[i], count for fi


Distributions for categorical fields1
Distributions for categorical fields

  • Distribution of histograms per predicatec(t) and field f

    • Treat histogram counts as random variables

      • Regard c(t) true partition as a collection of histogram “samples”

      • Regard histograms as vectors of random variables, one per field value fi

    • Extract moments of these histogram count distributions

      • mean (sort of) – reference vector

      • variance (sort of) – variance vector


Distributions for categorical fields2
Distributions for categorical fields

  • Net histogram per predicate c(t), field f

    • c(t) partitions tuples t into two groups

      • Only histogram the c(t) true group

      • Could include ~c as a predicate if we want

    • Don’t re-count!

      • Already have histograms for each t and f – case reference vectors

      • Sum the case reference vectors columnwise

    • Call this a “reference vector”, Rfc

      • Proportional to average histogram over t for c(t) true (weighted by # samples per t)


Distributions for categorical fields3
Distributions for categorical fields

  • Variance of case histograms per predicatec(t) and field f

    • Define “variance vector”, Vfc

      • Columnwise sum of squares of case reference vectors / number of samples with c(t) true

      • Not an actual variance

        • Squared means not subtracted

      • Don’t care:

        • It’s indicative of the variance...

        • Throw in means-based features as well to give classifier full variance info


Distributions for categorical fields4
Distributions for categorical fields

  • What predicates might we use?

    • Unconditionally true, c(t) = true

      • Result is net distribution independent of t

      • Unconditional reference vector, R

    • Per class k, ck(t) = (t.y == k)

      • Class priors

      • Recall for training data, y is a field in t

      • Per class reference vector,


Distributions for categorical fields5
Distributions for categorical fields

  • Summary of notation

    • c(t), a predicate based on values in a tuple t

    • f, a categorical field from a join with T

    • fi, values of f

    • Rfc, reference vector

      • histogram over fi values in bag for c(t) true

    • Cft, case vector

      • histogram over fi values for t’s bag

    • R, unconditional reference vector

    • Vfc, variance vector

      • Columnwise average squared ref. vector

    • X[i], i th value in some ref. vector X


Distributions for numerical data
Distributions for numerical data

  • Same general idea – representative distributions per various partitions

  • Can use categorical techniques if we:

    • Bin the numerical values

    • Treat each bin as a categorical value


Feature extraction
Feature extraction

  • Base features on ref. and variance vectors

  • Two kinds:

    • “Interesting” values

      • one value from case reference vector per t

      • same column in vector for all t

      • assorted options for choosing column

      • choices depend on predicate ref. vectors

    • Vector distances

      • distance between case ref. vector and predicate ref. vector

      • various distance metrics

  • More notation: acronym for each feature type


Feature extraction interesting values
Feature extraction: “interesting” values

  • For a given c, f, select that fi which is...

    • MOC: Most common overall

      • argmaxiR[i]

    • Most common in each class

      • For binary class y

        • Positive is y = 1, Negative is y = 0

      • MOP: argmaxiRft.y=1[i]

      • MON: argmaxiRft.y=0[i]

    • Most distinctive per class

      • Common in one class but not in other(s)

      • MOD: argmaxi |Rft.y=1[i] - Rft.y=0[i] |

      • MOM: argmaxi MOD / Vft.y=1[i] - Vft.y=0[i]

        • Normalizes for variance (sort of)


Feature extraction vector distance
Feature extraction: vector distance

  • Distance btw given ref. vector & each case vector

  • Distance metrics

    • ED: Edit – not defined

      • Sum of abs. diffs, a.k.a. Manhattan dist?

      • Σi |C[i] – R[i] |

    • EU: Euclidean

      • √(C[i] T R[i] ), omit √ for speed

    • MA: Mahalanobis

      • √(C[i] TΣ-1 R[i] ), omit √ for speed

      • Σshould be covariance...of what?

    • CO: Cosine, 1- cos(angle btw vectors)

      • 1 - C[i] T R[i] / √ (|C[i] ||R[i] |)


Feature extraction vector distance1
Feature extraction: vector distance

  • Apply each metric w/ various ref. vectors

    • Acronym is metric w/ suffix for ref. vector

    • (No suffix): Unconditional ref. vector

    • P: per-class positive ref. vector, Rft.y=1

    • N: per-class positive ref. vector, Rft.y=0

    • D: difference between P and D distances

  • Alphabet soup, e.g. EUP, MAD,...


Feature extraction1
Feature extraction

  • Other features added for tests

    • Not part of their aggregation proposal

    • AH: “abstraction hierarchy” (?)

      • Pull into T all fields that are just “shared records” via n:1 references

    • AC: “autocorrelation” aggregation

      • For joins back into T, get other cases “linked to” each t

      • Fraction of positive cases among others


Learning
Learning

  • Find linked tables

    • Starting from T, do breadth-first walk of schema graph

      • Up to some max depth

      • Cap number of paths followed

    • For each path, know T is linked to last table in path

  • Extract aggregate fields

    • Pull in all fields of last table in path

    • Aggregate them (using new aggregates) per t

    • Append aggregates to t


Learning1
Learning

  • Classifier

    • Pick 10 subsets each w/ 10 features

      • Random choice, weighted by “performance”

      • But there’s no classifier yet...so how do features predict class?

    • Build a decision tree for each feature set

      • Have class frequencies at leaves

        • Features might not completely distinguish classes

      • Class prediction:

        • Select class with higher frequency

      • Class probability estimation:

        • Average frequencies over trees


Tests
Tests

  • IPO data

    • 5 tables

      • Most fields in the “main” table, used as T

      • Other tables had key & one data field

      • Predicate on one field in T used as the class

  • Tested against:

    • First-order logic aggregation

      • Extract clauses using an ILP system

      • Append evaluated clauses to each t

    • Various ILP systems

      • Using just data in T

      • (Or T and AH features?)


Tests1
Tests

  • IPO data

    • 5 tables w/ small, simple schema

      • Majority of fields were in the “main” table, i.e. T

        • The only numeric fields were in main table, so no aggregation of numeric features needed

      • Other tables had key & one data field

      • Max path length 2 to reach all tables, no recursion

      • Predicate on one field in T used as the class

  • Tested against:

    • First-order logic aggregation

      • Extract clauses using an ILP system

      • Append evaluated clauses to each t

    • Various ILP systems

      • Using just data in T (or T and AH features?)


Test results
Test results

  • See paper for numbers

  • Accuracy with aggregate features:

    • Up to 10% increase over only features from T

    • Depends on which and how many extra features used

    • Most predictive feature was in a separate table

    • Expect accuracy increase as more info available

    • Shows info was not destroyed by aggregation

    • Vector distance features better

  • Generalization


Interesting ideas i benefits b
Interesting ideas (“I”) & benefits (“B”)

  • Taxonomy

    • I: Division into stages of aggregation

      • Slot in any procedure per stage

      • Estmate complexity per stage

    • B: Might get the discussion going

  • Aggregate features

    • I: Identifying a “main” table

      • Others get aggregated

    • I: Forming partitions to aggregate over

      • Using queries with joins to pull in other tables

      • Abstract partitioning based on predicate

    • I: Comparing case against reference histograms

    • I: Separate comparison method and reference


Interesting ideas i benefits b1
Interesting ideas (“I”) & benefits (“B”)

  • Learning

    • I: Decision tree tricks

      • Cut DT induction off short to get class freqs

      • Starve DT of features to improve generalization


Issues
Issues

  • Some worrying lapses...

    • Lacked standard terms for common concepts

      • “position i [of vector has] the number of instances of [ith value]”... -> histogram

      • “abstraction hierarchy” -> schema

      • “value order” -> enumeration

      • Defined (and emphasized) terms for trivial and commonly used things

    • Imprecise use of terms

      • “variance” for (something like) second moment

      • I’m not confident they know what Mahalanobis distance is

      • They say “left outer join” and show inner join symbol


Issues1
Issues

  • Some worrying lapses...

    • Did not connect “reference vector” and “variance vector” to underlying statistics

      • Should relate to bag prior and field value conditional probability, not just “weighted”

    • Did not acknowledge loss of correlation info from splitting up joined u tuples in their features

      • Assumes fields are independent

      • Dependency was mentioned in the taxonomy

    • Fig 1 schema cannot support § 2 example query

      • Missing a necessary foreign key reference


Issues2
Issues

  • Some worrying lapses...

    • Their formal statement of the task did not show aggregation as dependent on t

      • Needed for c(t) partitioning

    • Did not clearly distinguish when t did or did not contain class

      • No need to put it in there at all

    • No, the higher Gaussian moments are not all zero!

      • Only the odd ones are. Yeesh.

      • Correct reason we don’t need them is: all can be computed from mean and variance

    • Uuugly notation


Issues3
Issues

  • Some worrying lapses...

    • Did not cite other uses of histograms or distributions extracted as features

      • “Spike-triggered average” / covariance / etc.

        • Used by: all neurobiology, neurocomputation

        • E.g.: de Ruyter van Steveninck & Bialek

      • “Response-conditional ensemble”

        • Used by: Our own Adrienne Fairhall & colleagues

        • E.g.: Aguera & Arcas, Fairhall, Bialek

      • “Event-triggered distribution”

        • Used by: me ☺

        • E.g.: CSE528 project


Issues4
Issues

  • Some worrying lapses...

    • Did not cite other uses of histograms or distributions extracted as features...

    • So, did not use “standard” tricks

      • Dimension reduction:

        • Treat histogram as a vector

        • Do PCA, keep top few eigenmodes, new features are projections

    • Nor “special” tricks:

      • Subtract prior covariance before PCA

    • Likewise competing the classes is not new


Issues5
Issues

  • Non-goof issues

    • Would need bookkeeping to maintain variance vector for online learning

      • Don’t have sufficient statistics

      • Histograms are actual “samples”

      • Adding new data doesn’t add new “samples”:changes existing ones

      • Could subtract old contribution, add new one

      • Use a triggered query

    • Don’t bin those nice numerical variables!

      • Binning makes vectors out of scalars

      • Scalar fields can be ganged into a vector across fields!

      • Do (e.g.) clustering on the bag of vectors

  • That’s enough of that