Aggregate features for relational data claudia perlich foster provost
This presentation is the property of its rightful owner.
Sponsored Links
1 / 52

Aggregate features for relational data Claudia Perlich, Foster Provost PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on
  • Presentation posted in: General

Aggregate features for relational data Claudia Perlich, Foster Provost. Pat Tressel 16-May-2005. Overview. Perlich and Provost provide... Hierarchy of aggregation methods Survey of existing aggregation methods New aggregation methods Concerned w/ supervised learning only

Download Presentation

Aggregate features for relational data Claudia Perlich, Foster Provost

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Aggregate features for relational data claudia perlich foster provost

Aggregate features for relational dataClaudia Perlich, Foster Provost

Pat Tressel

16-May-2005


Overview

Overview

  • Perlich and Provost provide...

    • Hierarchy of aggregation methods

    • Survey of existing aggregation methods

    • New aggregation methods

  • Concerned w/ supervised learning only

    • But much seems applicable to clustering


The issues

The issues…

  • Most classifiers use feature vectors

    • Individual features have fixed arity

    • No links to other objects

  • How do we get feature vectors from relational data?

    • Flatten it:

      • Joins

      • Aggregation

  • (Are feature vectors all there are?)


Joins

Joins

  • Why consider them?

    • Yield flat feature vectors

    • Preserve all the data

  • Why not use them?

    • They emphasize data with many references

      • Ok if that’s what we want

      • Not ok if sampling was skewed

      • Cascaded or transitive joins blow up


Joins1

Joins

  • They emphasize data with many references:

    • Lots more Joes than there were before...


Joins2

Joins

  • Why not use them?

    • What if we don’t know the references?

      • Try out everything with everything else

      • Cross product yields all combinations

      • Adds fictitious relationships

      • Combinatorial blowup


Joins3

Joins

  • What if we don’t know the references?


Aggregates

Aggregates

  • Why use them?

    • Yield flat feature vectors

    • No blowup in number of tuples

      • Can group tuples in all related tables

    • Can keep as detailed stats as desired

      • Not just max, mean, etc.

      • Parametric dists from sufficient stats

      • Can apply tests for grouping

    • Choice of aggregates can be model-based

      • Better generalization

      • Include domain knowledge in model choice


Aggregates1

Aggregates

  • Anything wrong with them?

    • Data is lost

    • Relational structure is lost

    • Influential individuals are lumped in

      • Doesn’t discover critical individuals

      • Dominates other data

    • Any choice of aggregates assumes a model

      • What if it’s wrong?

    • Adding new data can require calculations

      • But can avoid issue by keeping sufficient statistics


Taxonomy of aggregates

Taxonomy of aggregates

  • Why is this useful?

    • Promote deliberate use of aggregates

    • Point out gaps in current use of aggregates

    • Find appropriate techniques for each class

  • Based on “complexity” due to:

    • Relational structure

      • Cardinality of the relations (1:1, 1:n, m:n)

    • Feature extraction

      • Computing the aggregates

    • Class prediction


Taxonomy of aggregates1

Taxonomy of aggregates

  • Formal statement of the task:

  • Notation (here and on following slides):

    • t, tuple (from “target” table T, with main features)

    • y, class (known per t if training)

    • Ψ, aggregation function

    • Φ, classification function

    • σ, select operation (where joins preserve t)

    • Ω, all tables; B, any other table, b in B

    • u, fields to be added to t from other tables

    • f, a field in u

    • More, that doesn’t fit on this slide


Taxonomy of aggregates2

Taxonomy of aggregates

  • Formal statement of the task:

  • Notation (here and on following slides):

    • Caution! Simplified from what’s in the paper!

    • t, tuple (from “target” table T, with main features)

    • y, class (known per t if training)

    • Ψ, aggregation function

    • Φ, classification function

    • σ, select operation (where joins preserve t)

    • Ω, all tables; B, any other table, b a tuple in B

    • u, fields to be added to t from joined tables

    • f, a field in u

    • More, that doesn’t fit on this slide


Aggregation complexity

Aggregation complexity

  • Simple

    • One field from one object type

  • Denoted by:


Aggregation complexity1

Aggregation complexity

  • Multi-dimensional

    • Multiple fields, one object type

  • Denoted by:


Aggregation complexity2

Aggregation complexity

  • Multi-type

    • Multiple object types

  • Denoted by:


Relational concept complexity

Relational “concept” complexity

  • Propositional

    • No aggregation

    • Single tuple, 1-1 or n-1 joins

      • n-1 is just a shared object

    • Not relational per se – already flat


Relational concept complexity1

Relational “concept” complexity

  • Independent fields

    • Separate aggregation per field

    • Separate 1-n joins with T


Relational concept complexity2

Relational “concept” complexity

  • Dependent fields in same table

    • Multi-dimensional aggregation

    • Separate 1-n joins with T


Relational concept complexity3

Relational “concept” complexity

  • Dependent fields over multiple tables

    • Multi-type aggregation

    • Separate 1-n joins, still only with T


Relational concept complexity4

Relational “concept” complexity

  • Global

    • Any joins or combinations of fields

      • Multi-type aggregation

      • Multi-way joins

      • Joins among tables other than T


Current relational aggregation

Current relational aggregation

  • First-order logic

    • Find clauses that directly predict the class

      • Ф is OR

    • Form binary features from tests

      • Logical and arithmetic tests

      • These go in the feature vector

      • Ф is any ordinary classifier


Current relational aggregation1

Current relational aggregation

  • The usual database aggregates

    • For numerical values:

      • mean, min, max, count, sum, etc.

    • For categorical values:

      • Most common value

      • Count per value


Current relational aggregation2

Current relational aggregation

  • Set distance

    • Two tuples, each with a set of related tuples

    • Distance metric between related fields

      • Euclidean for numerical data

      • Edit distance for categorical

    • Distance between sets is distance of closest pair


Proposed relational aggregation

Proposed relational aggregation

  • Recall the point of this work:

    • Tuple t from table T is part of a feature vector

    • Want to augment w/ info from other tables

    • Info added to t must be consistent w/ values in t

    • Need to flatten the added info to yield one vector per tuple t

    • Use that to:

      • Train classifier given class y for t

      • Predict class y for t


Proposed relational aggregation1

Proposed relational aggregation

  • Outline of steps:

    • Do query to get more info u from other tables

    • Partition the results based on:

      • Main features t

      • Class y

      • Predicates on t

    • Extract distributions over results for fields in u

      • Get distribution for each partition

      • For now, limit to categorical fields

      • Suggest extension to numerical fields

    • Derive features from distributions


Do query to get info from other tables

Do query to get info from other tables

  • Select

    • Based on the target table T

    • If training, known class y is included in T

    • Joins must preserve distinct values from T

      • Join on as much of T’s key as is present in other table

      • Maybe need to constrain other fields?

      • Not a problem for correctly normalized tables

  • Project

    • Include all of t

    • Append additional fields u from joined tables

      • Anything up to all fields from joins


Extract distributions

Extract distributions

  • Partition query results various ways, e.g.:

    • Into cases per each t

      • For training, include the (known) class y in t

    • Also (if training) split per each class

      • Want this for class priors

    • Split per some (unspecifed) predicate c(t)

  • For each partition:

    • There is a bag of associated u tuples

      • Ignore the t part – already a flat vector

    • Split vertically to get bags of individual values per each field f in u

      • Note this breaks association between fields!


Distributions for categorical fields

Distributions for categorical fields

  • Let categorical field be f with values fi

  • Form histogram for each partition

    • Count instances of each value fi of f in a bag

    • These are sufficient statistics for:

      • Distribution over fi values

      • Probability of each bag in the partition

  • Start with one per each tuple t and field f

    • Cft, (per-) case vector

    • Component Cft[i], count for fi


Distributions for categorical fields1

Distributions for categorical fields

  • Distribution of histograms per predicatec(t) and field f

    • Treat histogram counts as random variables

      • Regard c(t) true partition as a collection of histogram “samples”

      • Regard histograms as vectors of random variables, one per field value fi

    • Extract moments of these histogram count distributions

      • mean (sort of) – reference vector

      • variance (sort of) – variance vector


Distributions for categorical fields2

Distributions for categorical fields

  • Net histogram per predicate c(t), field f

    • c(t) partitions tuples t into two groups

      • Only histogram the c(t) true group

      • Could include ~c as a predicate if we want

    • Don’t re-count!

      • Already have histograms for each t and f – case reference vectors

      • Sum the case reference vectors columnwise

    • Call this a “reference vector”, Rfc

      • Proportional to average histogram over t for c(t) true (weighted by # samples per t)


Distributions for categorical fields3

Distributions for categorical fields

  • Variance of case histograms per predicatec(t) and field f

    • Define “variance vector”, Vfc

      • Columnwise sum of squares of case reference vectors / number of samples with c(t) true

      • Not an actual variance

        • Squared means not subtracted

      • Don’t care:

        • It’s indicative of the variance...

        • Throw in means-based features as well to give classifier full variance info


Distributions for categorical fields4

Distributions for categorical fields

  • What predicates might we use?

    • Unconditionally true, c(t) = true

      • Result is net distribution independent of t

      • Unconditional reference vector, R

    • Per class k, ck(t) = (t.y == k)

      • Class priors

      • Recall for training data, y is a field in t

      • Per class reference vector,


Distributions for categorical fields5

Distributions for categorical fields

  • Summary of notation

    • c(t), a predicate based on values in a tuple t

    • f, a categorical field from a join with T

    • fi, values of f

    • Rfc, reference vector

      • histogram over fi values in bag for c(t) true

    • Cft, case vector

      • histogram over fi values for t’s bag

    • R, unconditional reference vector

    • Vfc, variance vector

      • Columnwise average squared ref. vector

    • X[i], i th value in some ref. vector X


Distributions for numerical data

Distributions for numerical data

  • Same general idea – representative distributions per various partitions

  • Can use categorical techniques if we:

    • Bin the numerical values

    • Treat each bin as a categorical value


Feature extraction

Feature extraction

  • Base features on ref. and variance vectors

  • Two kinds:

    • “Interesting” values

      • one value from case reference vector per t

      • same column in vector for all t

      • assorted options for choosing column

      • choices depend on predicate ref. vectors

    • Vector distances

      • distance between case ref. vector and predicate ref. vector

      • various distance metrics

  • More notation: acronym for each feature type


Feature extraction interesting values

Feature extraction: “interesting” values

  • For a given c, f, select that fi which is...

    • MOC: Most common overall

      • argmaxiR[i]

    • Most common in each class

      • For binary class y

        • Positive is y = 1, Negative is y = 0

      • MOP: argmaxiRft.y=1[i]

      • MON: argmaxiRft.y=0[i]

    • Most distinctive per class

      • Common in one class but not in other(s)

      • MOD: argmaxi |Rft.y=1[i] - Rft.y=0[i] |

      • MOM: argmaxi MOD / Vft.y=1[i] - Vft.y=0[i]

        • Normalizes for variance (sort of)


Feature extraction vector distance

Feature extraction: vector distance

  • Distance btw given ref. vector & each case vector

  • Distance metrics

    • ED: Edit – not defined

      • Sum of abs. diffs, a.k.a. Manhattan dist?

      • Σi |C[i] – R[i] |

    • EU: Euclidean

      • √(C[i] T R[i] ), omit √ for speed

    • MA: Mahalanobis

      • √(C[i] TΣ-1 R[i] ), omit √ for speed

      • Σshould be covariance...of what?

    • CO: Cosine, 1- cos(angle btw vectors)

      • 1 - C[i] T R[i] / √ (|C[i] ||R[i] |)


Feature extraction vector distance1

Feature extraction: vector distance

  • Apply each metric w/ various ref. vectors

    • Acronym is metric w/ suffix for ref. vector

    • (No suffix): Unconditional ref. vector

    • P: per-class positive ref. vector, Rft.y=1

    • N: per-class positive ref. vector, Rft.y=0

    • D: difference between P and D distances

  • Alphabet soup, e.g. EUP, MAD,...


Feature extraction1

Feature extraction

  • Other features added for tests

    • Not part of their aggregation proposal

    • AH: “abstraction hierarchy” (?)

      • Pull into T all fields that are just “shared records” via n:1 references

    • AC: “autocorrelation” aggregation

      • For joins back into T, get other cases “linked to” each t

      • Fraction of positive cases among others


Learning

Learning

  • Find linked tables

    • Starting from T, do breadth-first walk of schema graph

      • Up to some max depth

      • Cap number of paths followed

    • For each path, know T is linked to last table in path

  • Extract aggregate fields

    • Pull in all fields of last table in path

    • Aggregate them (using new aggregates) per t

    • Append aggregates to t


Learning1

Learning

  • Classifier

    • Pick 10 subsets each w/ 10 features

      • Random choice, weighted by “performance”

      • But there’s no classifier yet...so how do features predict class?

    • Build a decision tree for each feature set

      • Have class frequencies at leaves

        • Features might not completely distinguish classes

      • Class prediction:

        • Select class with higher frequency

      • Class probability estimation:

        • Average frequencies over trees


Tests

Tests

  • IPO data

    • 5 tables

      • Most fields in the “main” table, used as T

      • Other tables had key & one data field

      • Predicate on one field in T used as the class

  • Tested against:

    • First-order logic aggregation

      • Extract clauses using an ILP system

      • Append evaluated clauses to each t

    • Various ILP systems

      • Using just data in T

      • (Or T and AH features?)


Tests1

Tests

  • IPO data

    • 5 tables w/ small, simple schema

      • Majority of fields were in the “main” table, i.e. T

        • The only numeric fields were in main table, so no aggregation of numeric features needed

      • Other tables had key & one data field

      • Max path length 2 to reach all tables, no recursion

      • Predicate on one field in T used as the class

  • Tested against:

    • First-order logic aggregation

      • Extract clauses using an ILP system

      • Append evaluated clauses to each t

    • Various ILP systems

      • Using just data in T (or T and AH features?)


Test results

Test results

  • See paper for numbers

  • Accuracy with aggregate features:

    • Up to 10% increase over only features from T

    • Depends on which and how many extra features used

    • Most predictive feature was in a separate table

    • Expect accuracy increase as more info available

    • Shows info was not destroyed by aggregation

    • Vector distance features better

  • Generalization


Interesting ideas i benefits b

Interesting ideas (“I”) & benefits (“B”)

  • Taxonomy

    • I: Division into stages of aggregation

      • Slot in any procedure per stage

      • Estmate complexity per stage

    • B: Might get the discussion going

  • Aggregate features

    • I: Identifying a “main” table

      • Others get aggregated

    • I: Forming partitions to aggregate over

      • Using queries with joins to pull in other tables

      • Abstract partitioning based on predicate

    • I: Comparing case against reference histograms

    • I: Separate comparison method and reference


Interesting ideas i benefits b1

Interesting ideas (“I”) & benefits (“B”)

  • Learning

    • I: Decision tree tricks

      • Cut DT induction off short to get class freqs

      • Starve DT of features to improve generalization


Issues

Issues

  • Some worrying lapses...

    • Lacked standard terms for common concepts

      • “position i [of vector has] the number of instances of [ith value]”... -> histogram

      • “abstraction hierarchy” -> schema

      • “value order” -> enumeration

      • Defined (and emphasized) terms for trivial and commonly used things

    • Imprecise use of terms

      • “variance” for (something like) second moment

      • I’m not confident they know what Mahalanobis distance is

      • They say “left outer join” and show inner join symbol


Issues1

Issues

  • Some worrying lapses...

    • Did not connect “reference vector” and “variance vector” to underlying statistics

      • Should relate to bag prior and field value conditional probability, not just “weighted”

    • Did not acknowledge loss of correlation info from splitting up joined u tuples in their features

      • Assumes fields are independent

      • Dependency was mentioned in the taxonomy

    • Fig 1 schema cannot support § 2 example query

      • Missing a necessary foreign key reference


Issues2

Issues

  • Some worrying lapses...

    • Their formal statement of the task did not show aggregation as dependent on t

      • Needed for c(t) partitioning

    • Did not clearly distinguish when t did or did not contain class

      • No need to put it in there at all

    • No, the higher Gaussian moments are not all zero!

      • Only the odd ones are. Yeesh.

      • Correct reason we don’t need them is: all can be computed from mean and variance

    • Uuugly notation


Issues3

Issues

  • Some worrying lapses...

    • Did not cite other uses of histograms or distributions extracted as features

      • “Spike-triggered average” / covariance / etc.

        • Used by: all neurobiology, neurocomputation

        • E.g.: de Ruyter van Steveninck & Bialek

      • “Response-conditional ensemble”

        • Used by: Our own Adrienne Fairhall & colleagues

        • E.g.: Aguera & Arcas, Fairhall, Bialek

      • “Event-triggered distribution”

        • Used by: me ☺

        • E.g.: CSE528 project


Issues4

Issues

  • Some worrying lapses...

    • Did not cite other uses of histograms or distributions extracted as features...

    • So, did not use “standard” tricks

      • Dimension reduction:

        • Treat histogram as a vector

        • Do PCA, keep top few eigenmodes, new features are projections

    • Nor “special” tricks:

      • Subtract prior covariance before PCA

    • Likewise competing the classes is not new


Issues5

Issues

  • Non-goof issues

    • Would need bookkeeping to maintain variance vector for online learning

      • Don’t have sufficient statistics

      • Histograms are actual “samples”

      • Adding new data doesn’t add new “samples”:changes existing ones

      • Could subtract old contribution, add new one

      • Use a triggered query

    • Don’t bin those nice numerical variables!

      • Binning makes vectors out of scalars

      • Scalar fields can be ganged into a vector across fields!

      • Do (e.g.) clustering on the bag of vectors

  • That’s enough of that


  • Login