Aggregate features for relational data Claudia Perlich, Foster Provost - PowerPoint PPT Presentation

Aggregate features for relational data claudia perlich foster provost
Download
1 / 52

  • 109 Views
  • Uploaded on
  • Presentation posted in: General

Aggregate features for relational data Claudia Perlich, Foster Provost. Pat Tressel 16-May-2005. Overview. Perlich and Provost provide... Hierarchy of aggregation methods Survey of existing aggregation methods New aggregation methods Concerned w/ supervised learning only

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Aggregate features for relational data Claudia Perlich, Foster Provost

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Aggregate features for relational data claudia perlich foster provost

Aggregate features for relational dataClaudia Perlich, Foster Provost

Pat Tressel

16-May-2005


Overview

Overview

  • Perlich and Provost provide...

    • Hierarchy of aggregation methods

    • Survey of existing aggregation methods

    • New aggregation methods

  • Concerned w/ supervised learning only

    • But much seems applicable to clustering


The issues

The issues…

  • Most classifiers use feature vectors

    • Individual features have fixed arity

    • No links to other objects

  • How do we get feature vectors from relational data?

    • Flatten it:

      • Joins

      • Aggregation

  • (Are feature vectors all there are?)


Joins

Joins

  • Why consider them?

    • Yield flat feature vectors

    • Preserve all the data

  • Why not use them?

    • They emphasize data with many references

      • Ok if that’s what we want

      • Not ok if sampling was skewed

      • Cascaded or transitive joins blow up


Joins1

Joins

  • They emphasize data with many references:

    • Lots more Joes than there were before...


Joins2

Joins

  • Why not use them?

    • What if we don’t know the references?

      • Try out everything with everything else

      • Cross product yields all combinations

      • Adds fictitious relationships

      • Combinatorial blowup


Joins3

Joins

  • What if we don’t know the references?


Aggregates

Aggregates

  • Why use them?

    • Yield flat feature vectors

    • No blowup in number of tuples

      • Can group tuples in all related tables

    • Can keep as detailed stats as desired

      • Not just max, mean, etc.

      • Parametric dists from sufficient stats

      • Can apply tests for grouping

    • Choice of aggregates can be model-based

      • Better generalization

      • Include domain knowledge in model choice


Aggregates1

Aggregates

  • Anything wrong with them?

    • Data is lost

    • Relational structure is lost

    • Influential individuals are lumped in

      • Doesn’t discover critical individuals

      • Dominates other data

    • Any choice of aggregates assumes a model

      • What if it’s wrong?

    • Adding new data can require calculations

      • But can avoid issue by keeping sufficient statistics


Taxonomy of aggregates

Taxonomy of aggregates

  • Why is this useful?

    • Promote deliberate use of aggregates

    • Point out gaps in current use of aggregates

    • Find appropriate techniques for each class

  • Based on “complexity” due to:

    • Relational structure

      • Cardinality of the relations (1:1, 1:n, m:n)

    • Feature extraction

      • Computing the aggregates

    • Class prediction


Taxonomy of aggregates1

Taxonomy of aggregates

  • Formal statement of the task:

  • Notation (here and on following slides):

    • t, tuple (from “target” table T, with main features)

    • y, class (known per t if training)

    • Ψ, aggregation function

    • Φ, classification function

    • σ, select operation (where joins preserve t)

    • Ω, all tables; B, any other table, b in B

    • u, fields to be added to t from other tables

    • f, a field in u

    • More, that doesn’t fit on this slide


Taxonomy of aggregates2

Taxonomy of aggregates

  • Formal statement of the task:

  • Notation (here and on following slides):

    • Caution! Simplified from what’s in the paper!

    • t, tuple (from “target” table T, with main features)

    • y, class (known per t if training)

    • Ψ, aggregation function

    • Φ, classification function

    • σ, select operation (where joins preserve t)

    • Ω, all tables; B, any other table, b a tuple in B

    • u, fields to be added to t from joined tables

    • f, a field in u

    • More, that doesn’t fit on this slide


Aggregation complexity

Aggregation complexity

  • Simple

    • One field from one object type

  • Denoted by:


Aggregation complexity1

Aggregation complexity

  • Multi-dimensional

    • Multiple fields, one object type

  • Denoted by:


Aggregation complexity2

Aggregation complexity

  • Multi-type

    • Multiple object types

  • Denoted by:


Relational concept complexity

Relational “concept” complexity

  • Propositional

    • No aggregation

    • Single tuple, 1-1 or n-1 joins

      • n-1 is just a shared object

    • Not relational per se – already flat


Relational concept complexity1

Relational “concept” complexity

  • Independent fields

    • Separate aggregation per field

    • Separate 1-n joins with T


Relational concept complexity2

Relational “concept” complexity

  • Dependent fields in same table

    • Multi-dimensional aggregation

    • Separate 1-n joins with T


Relational concept complexity3

Relational “concept” complexity

  • Dependent fields over multiple tables

    • Multi-type aggregation

    • Separate 1-n joins, still only with T


Relational concept complexity4

Relational “concept” complexity

  • Global

    • Any joins or combinations of fields

      • Multi-type aggregation

      • Multi-way joins

      • Joins among tables other than T


Current relational aggregation

Current relational aggregation

  • First-order logic

    • Find clauses that directly predict the class

      • Ф is OR

    • Form binary features from tests

      • Logical and arithmetic tests

      • These go in the feature vector

      • Ф is any ordinary classifier


Current relational aggregation1

Current relational aggregation

  • The usual database aggregates

    • For numerical values:

      • mean, min, max, count, sum, etc.

    • For categorical values:

      • Most common value

      • Count per value


Current relational aggregation2

Current relational aggregation

  • Set distance

    • Two tuples, each with a set of related tuples

    • Distance metric between related fields

      • Euclidean for numerical data

      • Edit distance for categorical

    • Distance between sets is distance of closest pair


Proposed relational aggregation

Proposed relational aggregation

  • Recall the point of this work:

    • Tuple t from table T is part of a feature vector

    • Want to augment w/ info from other tables

    • Info added to t must be consistent w/ values in t

    • Need to flatten the added info to yield one vector per tuple t

    • Use that to:

      • Train classifier given class y for t

      • Predict class y for t


Proposed relational aggregation1

Proposed relational aggregation

  • Outline of steps:

    • Do query to get more info u from other tables

    • Partition the results based on:

      • Main features t

      • Class y

      • Predicates on t

    • Extract distributions over results for fields in u

      • Get distribution for each partition

      • For now, limit to categorical fields

      • Suggest extension to numerical fields

    • Derive features from distributions


Do query to get info from other tables

Do query to get info from other tables

  • Select

    • Based on the target table T

    • If training, known class y is included in T

    • Joins must preserve distinct values from T

      • Join on as much of T’s key as is present in other table

      • Maybe need to constrain other fields?

      • Not a problem for correctly normalized tables

  • Project

    • Include all of t

    • Append additional fields u from joined tables

      • Anything up to all fields from joins


Extract distributions

Extract distributions

  • Partition query results various ways, e.g.:

    • Into cases per each t

      • For training, include the (known) class y in t

    • Also (if training) split per each class

      • Want this for class priors

    • Split per some (unspecifed) predicate c(t)

  • For each partition:

    • There is a bag of associated u tuples

      • Ignore the t part – already a flat vector

    • Split vertically to get bags of individual values per each field f in u

      • Note this breaks association between fields!


Distributions for categorical fields

Distributions for categorical fields

  • Let categorical field be f with values fi

  • Form histogram for each partition

    • Count instances of each value fi of f in a bag

    • These are sufficient statistics for:

      • Distribution over fi values

      • Probability of each bag in the partition

  • Start with one per each tuple t and field f

    • Cft, (per-) case vector

    • Component Cft[i], count for fi


Distributions for categorical fields1

Distributions for categorical fields

  • Distribution of histograms per predicatec(t) and field f

    • Treat histogram counts as random variables

      • Regard c(t) true partition as a collection of histogram “samples”

      • Regard histograms as vectors of random variables, one per field value fi

    • Extract moments of these histogram count distributions

      • mean (sort of) – reference vector

      • variance (sort of) – variance vector


Distributions for categorical fields2

Distributions for categorical fields

  • Net histogram per predicate c(t), field f

    • c(t) partitions tuples t into two groups

      • Only histogram the c(t) true group

      • Could include ~c as a predicate if we want

    • Don’t re-count!

      • Already have histograms for each t and f – case reference vectors

      • Sum the case reference vectors columnwise

    • Call this a “reference vector”, Rfc

      • Proportional to average histogram over t for c(t) true (weighted by # samples per t)


Distributions for categorical fields3

Distributions for categorical fields

  • Variance of case histograms per predicatec(t) and field f

    • Define “variance vector”, Vfc

      • Columnwise sum of squares of case reference vectors / number of samples with c(t) true

      • Not an actual variance

        • Squared means not subtracted

      • Don’t care:

        • It’s indicative of the variance...

        • Throw in means-based features as well to give classifier full variance info


Distributions for categorical fields4

Distributions for categorical fields

  • What predicates might we use?

    • Unconditionally true, c(t) = true

      • Result is net distribution independent of t

      • Unconditional reference vector, R

    • Per class k, ck(t) = (t.y == k)

      • Class priors

      • Recall for training data, y is a field in t

      • Per class reference vector,


Distributions for categorical fields5

Distributions for categorical fields

  • Summary of notation

    • c(t), a predicate based on values in a tuple t

    • f, a categorical field from a join with T

    • fi, values of f

    • Rfc, reference vector

      • histogram over fi values in bag for c(t) true

    • Cft, case vector

      • histogram over fi values for t’s bag

    • R, unconditional reference vector

    • Vfc, variance vector

      • Columnwise average squared ref. vector

    • X[i], i th value in some ref. vector X


Distributions for numerical data

Distributions for numerical data

  • Same general idea – representative distributions per various partitions

  • Can use categorical techniques if we:

    • Bin the numerical values

    • Treat each bin as a categorical value


Feature extraction

Feature extraction

  • Base features on ref. and variance vectors

  • Two kinds:

    • “Interesting” values

      • one value from case reference vector per t

      • same column in vector for all t

      • assorted options for choosing column

      • choices depend on predicate ref. vectors

    • Vector distances

      • distance between case ref. vector and predicate ref. vector

      • various distance metrics

  • More notation: acronym for each feature type


Feature extraction interesting values

Feature extraction: “interesting” values

  • For a given c, f, select that fi which is...

    • MOC: Most common overall

      • argmaxiR[i]

    • Most common in each class

      • For binary class y

        • Positive is y = 1, Negative is y = 0

      • MOP: argmaxiRft.y=1[i]

      • MON: argmaxiRft.y=0[i]

    • Most distinctive per class

      • Common in one class but not in other(s)

      • MOD: argmaxi |Rft.y=1[i] - Rft.y=0[i] |

      • MOM: argmaxi MOD / Vft.y=1[i] - Vft.y=0[i]

        • Normalizes for variance (sort of)


Feature extraction vector distance

Feature extraction: vector distance

  • Distance btw given ref. vector & each case vector

  • Distance metrics

    • ED: Edit – not defined

      • Sum of abs. diffs, a.k.a. Manhattan dist?

      • Σi |C[i] – R[i] |

    • EU: Euclidean

      • √(C[i] T R[i] ), omit √ for speed

    • MA: Mahalanobis

      • √(C[i] TΣ-1 R[i] ), omit √ for speed

      • Σshould be covariance...of what?

    • CO: Cosine, 1- cos(angle btw vectors)

      • 1 - C[i] T R[i] / √ (|C[i] ||R[i] |)


Feature extraction vector distance1

Feature extraction: vector distance

  • Apply each metric w/ various ref. vectors

    • Acronym is metric w/ suffix for ref. vector

    • (No suffix): Unconditional ref. vector

    • P: per-class positive ref. vector, Rft.y=1

    • N: per-class positive ref. vector, Rft.y=0

    • D: difference between P and D distances

  • Alphabet soup, e.g. EUP, MAD,...


Feature extraction1

Feature extraction

  • Other features added for tests

    • Not part of their aggregation proposal

    • AH: “abstraction hierarchy” (?)

      • Pull into T all fields that are just “shared records” via n:1 references

    • AC: “autocorrelation” aggregation

      • For joins back into T, get other cases “linked to” each t

      • Fraction of positive cases among others


Learning

Learning

  • Find linked tables

    • Starting from T, do breadth-first walk of schema graph

      • Up to some max depth

      • Cap number of paths followed

    • For each path, know T is linked to last table in path

  • Extract aggregate fields

    • Pull in all fields of last table in path

    • Aggregate them (using new aggregates) per t

    • Append aggregates to t


Learning1

Learning

  • Classifier

    • Pick 10 subsets each w/ 10 features

      • Random choice, weighted by “performance”

      • But there’s no classifier yet...so how do features predict class?

    • Build a decision tree for each feature set

      • Have class frequencies at leaves

        • Features might not completely distinguish classes

      • Class prediction:

        • Select class with higher frequency

      • Class probability estimation:

        • Average frequencies over trees


Tests

Tests

  • IPO data

    • 5 tables

      • Most fields in the “main” table, used as T

      • Other tables had key & one data field

      • Predicate on one field in T used as the class

  • Tested against:

    • First-order logic aggregation

      • Extract clauses using an ILP system

      • Append evaluated clauses to each t

    • Various ILP systems

      • Using just data in T

      • (Or T and AH features?)


Tests1

Tests

  • IPO data

    • 5 tables w/ small, simple schema

      • Majority of fields were in the “main” table, i.e. T

        • The only numeric fields were in main table, so no aggregation of numeric features needed

      • Other tables had key & one data field

      • Max path length 2 to reach all tables, no recursion

      • Predicate on one field in T used as the class

  • Tested against:

    • First-order logic aggregation

      • Extract clauses using an ILP system

      • Append evaluated clauses to each t

    • Various ILP systems

      • Using just data in T (or T and AH features?)


Test results

Test results

  • See paper for numbers

  • Accuracy with aggregate features:

    • Up to 10% increase over only features from T

    • Depends on which and how many extra features used

    • Most predictive feature was in a separate table

    • Expect accuracy increase as more info available

    • Shows info was not destroyed by aggregation

    • Vector distance features better

  • Generalization


Interesting ideas i benefits b

Interesting ideas (“I”) & benefits (“B”)

  • Taxonomy

    • I: Division into stages of aggregation

      • Slot in any procedure per stage

      • Estmate complexity per stage

    • B: Might get the discussion going

  • Aggregate features

    • I: Identifying a “main” table

      • Others get aggregated

    • I: Forming partitions to aggregate over

      • Using queries with joins to pull in other tables

      • Abstract partitioning based on predicate

    • I: Comparing case against reference histograms

    • I: Separate comparison method and reference


Interesting ideas i benefits b1

Interesting ideas (“I”) & benefits (“B”)

  • Learning

    • I: Decision tree tricks

      • Cut DT induction off short to get class freqs

      • Starve DT of features to improve generalization


Issues

Issues

  • Some worrying lapses...

    • Lacked standard terms for common concepts

      • “position i [of vector has] the number of instances of [ith value]”... -> histogram

      • “abstraction hierarchy” -> schema

      • “value order” -> enumeration

      • Defined (and emphasized) terms for trivial and commonly used things

    • Imprecise use of terms

      • “variance” for (something like) second moment

      • I’m not confident they know what Mahalanobis distance is

      • They say “left outer join” and show inner join symbol


Issues1

Issues

  • Some worrying lapses...

    • Did not connect “reference vector” and “variance vector” to underlying statistics

      • Should relate to bag prior and field value conditional probability, not just “weighted”

    • Did not acknowledge loss of correlation info from splitting up joined u tuples in their features

      • Assumes fields are independent

      • Dependency was mentioned in the taxonomy

    • Fig 1 schema cannot support § 2 example query

      • Missing a necessary foreign key reference


Issues2

Issues

  • Some worrying lapses...

    • Their formal statement of the task did not show aggregation as dependent on t

      • Needed for c(t) partitioning

    • Did not clearly distinguish when t did or did not contain class

      • No need to put it in there at all

    • No, the higher Gaussian moments are not all zero!

      • Only the odd ones are. Yeesh.

      • Correct reason we don’t need them is: all can be computed from mean and variance

    • Uuugly notation


Issues3

Issues

  • Some worrying lapses...

    • Did not cite other uses of histograms or distributions extracted as features

      • “Spike-triggered average” / covariance / etc.

        • Used by: all neurobiology, neurocomputation

        • E.g.: de Ruyter van Steveninck & Bialek

      • “Response-conditional ensemble”

        • Used by: Our own Adrienne Fairhall & colleagues

        • E.g.: Aguera & Arcas, Fairhall, Bialek

      • “Event-triggered distribution”

        • Used by: me ☺

        • E.g.: CSE528 project


Issues4

Issues

  • Some worrying lapses...

    • Did not cite other uses of histograms or distributions extracted as features...

    • So, did not use “standard” tricks

      • Dimension reduction:

        • Treat histogram as a vector

        • Do PCA, keep top few eigenmodes, new features are projections

    • Nor “special” tricks:

      • Subtract prior covariance before PCA

    • Likewise competing the classes is not new


Issues5

Issues

  • Non-goof issues

    • Would need bookkeeping to maintain variance vector for online learning

      • Don’t have sufficient statistics

      • Histograms are actual “samples”

      • Adding new data doesn’t add new “samples”:changes existing ones

      • Could subtract old contribution, add new one

      • Use a triggered query

    • Don’t bin those nice numerical variables!

      • Binning makes vectors out of scalars

      • Scalar fields can be ganged into a vector across fields!

      • Do (e.g.) clustering on the bag of vectors

  • That’s enough of that


  • Login