- 90 Views
- Uploaded on
- Presentation posted in: General

Aggregate features for relational data Claudia Perlich, Foster Provost

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Aggregate features for relational dataClaudia Perlich, Foster Provost

Pat Tressel

16-May-2005

- Perlich and Provost provide...
- Hierarchy of aggregation methods
- Survey of existing aggregation methods
- New aggregation methods

- Concerned w/ supervised learning only
- But much seems applicable to clustering

- Most classifiers use feature vectors
- Individual features have fixed arity
- No links to other objects

- How do we get feature vectors from relational data?
- Flatten it:
- Joins
- Aggregation

- Flatten it:
- (Are feature vectors all there are?)

- Why consider them?
- Yield flat feature vectors
- Preserve all the data

- Why not use them?
- They emphasize data with many references
- Ok if that’s what we want
- Not ok if sampling was skewed
- Cascaded or transitive joins blow up

- They emphasize data with many references

- They emphasize data with many references:
- Lots more Joes than there were before...

- Why not use them?
- What if we don’t know the references?
- Try out everything with everything else
- Cross product yields all combinations
- Adds fictitious relationships
- Combinatorial blowup

- What if we don’t know the references?

- What if we don’t know the references?

- Why use them?
- Yield flat feature vectors
- No blowup in number of tuples
- Can group tuples in all related tables

- Can keep as detailed stats as desired
- Not just max, mean, etc.
- Parametric dists from sufficient stats
- Can apply tests for grouping

- Choice of aggregates can be model-based
- Better generalization
- Include domain knowledge in model choice

- Anything wrong with them?
- Data is lost
- Relational structure is lost
- Influential individuals are lumped in
- Doesn’t discover critical individuals
- Dominates other data

- Any choice of aggregates assumes a model
- What if it’s wrong?

- Adding new data can require calculations
- But can avoid issue by keeping sufficient statistics

- Why is this useful?
- Promote deliberate use of aggregates
- Point out gaps in current use of aggregates
- Find appropriate techniques for each class

- Based on “complexity” due to:
- Relational structure
- Cardinality of the relations (1:1, 1:n, m:n)

- Feature extraction
- Computing the aggregates

- Class prediction

- Relational structure

- Formal statement of the task:
- Notation (here and on following slides):
- t, tuple (from “target” table T, with main features)
- y, class (known per t if training)
- Ψ, aggregation function
- Φ, classification function
- σ, select operation (where joins preserve t)
- Ω, all tables; B, any other table, b in B
- u, fields to be added to t from other tables
- f, a field in u
- More, that doesn’t fit on this slide

- Formal statement of the task:
- Notation (here and on following slides):
- Caution! Simplified from what’s in the paper!
- t, tuple (from “target” table T, with main features)
- y, class (known per t if training)
- Ψ, aggregation function
- Φ, classification function
- σ, select operation (where joins preserve t)
- Ω, all tables; B, any other table, b a tuple in B
- u, fields to be added to t from joined tables
- f, a field in u
- More, that doesn’t fit on this slide

- Simple
- One field from one object type

- Denoted by:

- Multi-dimensional
- Multiple fields, one object type

- Denoted by:

- Multi-type
- Multiple object types

- Denoted by:

- Propositional
- No aggregation
- Single tuple, 1-1 or n-1 joins
- n-1 is just a shared object

- Not relational per se – already flat

- Independent fields
- Separate aggregation per field
- Separate 1-n joins with T

- Dependent fields in same table
- Multi-dimensional aggregation
- Separate 1-n joins with T

- Dependent fields over multiple tables
- Multi-type aggregation
- Separate 1-n joins, still only with T

- Global
- Any joins or combinations of fields
- Multi-type aggregation
- Multi-way joins
- Joins among tables other than T

- Any joins or combinations of fields

- First-order logic
- Find clauses that directly predict the class
- Ф is OR

- Form binary features from tests
- Logical and arithmetic tests
- These go in the feature vector
- Ф is any ordinary classifier

- Find clauses that directly predict the class

- The usual database aggregates
- For numerical values:
- mean, min, max, count, sum, etc.

- For categorical values:
- Most common value
- Count per value

- For numerical values:

- Set distance
- Two tuples, each with a set of related tuples
- Distance metric between related fields
- Euclidean for numerical data
- Edit distance for categorical

- Distance between sets is distance of closest pair

- Recall the point of this work:
- Tuple t from table T is part of a feature vector
- Want to augment w/ info from other tables
- Info added to t must be consistent w/ values in t
- Need to flatten the added info to yield one vector per tuple t
- Use that to:
- Train classifier given class y for t
- Predict class y for t

- Outline of steps:
- Do query to get more info u from other tables
- Partition the results based on:
- Main features t
- Class y
- Predicates on t

- Extract distributions over results for fields in u
- Get distribution for each partition
- For now, limit to categorical fields
- Suggest extension to numerical fields

- Derive features from distributions

- Select
- Based on the target table T
- If training, known class y is included in T
- Joins must preserve distinct values from T
- Join on as much of T’s key as is present in other table
- Maybe need to constrain other fields?
- Not a problem for correctly normalized tables

- Project
- Include all of t
- Append additional fields u from joined tables
- Anything up to all fields from joins

- Partition query results various ways, e.g.:
- Into cases per each t
- For training, include the (known) class y in t

- Also (if training) split per each class
- Want this for class priors

- Split per some (unspecifed) predicate c(t)

- Into cases per each t
- For each partition:
- There is a bag of associated u tuples
- Ignore the t part – already a flat vector

- Split vertically to get bags of individual values per each field f in u
- Note this breaks association between fields!

- There is a bag of associated u tuples

- Let categorical field be f with values fi
- Form histogram for each partition
- Count instances of each value fi of f in a bag
- These are sufficient statistics for:
- Distribution over fi values
- Probability of each bag in the partition

- Start with one per each tuple t and field f
- Cft, (per-) case vector
- Component Cft[i], count for fi

- Distribution of histograms per predicatec(t) and field f
- Treat histogram counts as random variables
- Regard c(t) true partition as a collection of histogram “samples”
- Regard histograms as vectors of random variables, one per field value fi

- Extract moments of these histogram count distributions
- mean (sort of) – reference vector
- variance (sort of) – variance vector

- Treat histogram counts as random variables

- Net histogram per predicate c(t), field f
- c(t) partitions tuples t into two groups
- Only histogram the c(t) true group
- Could include ~c as a predicate if we want

- Don’t re-count!
- Already have histograms for each t and f – case reference vectors
- Sum the case reference vectors columnwise

- Call this a “reference vector”, Rfc
- Proportional to average histogram over t for c(t) true (weighted by # samples per t)

- c(t) partitions tuples t into two groups

- Variance of case histograms per predicatec(t) and field f
- Define “variance vector”, Vfc
- Columnwise sum of squares of case reference vectors / number of samples with c(t) true
- Not an actual variance
- Squared means not subtracted

- Don’t care:
- It’s indicative of the variance...
- Throw in means-based features as well to give classifier full variance info

- Define “variance vector”, Vfc

- What predicates might we use?
- Unconditionally true, c(t) = true
- Result is net distribution independent of t
- Unconditional reference vector, R

- Per class k, ck(t) = (t.y == k)
- Class priors
- Recall for training data, y is a field in t
- Per class reference vector,

- Unconditionally true, c(t) = true

- Summary of notation
- c(t), a predicate based on values in a tuple t
- f, a categorical field from a join with T
- fi, values of f
- Rfc, reference vector
- histogram over fi values in bag for c(t) true

- Cft, case vector
- histogram over fi values for t’s bag

- R, unconditional reference vector
- Vfc, variance vector
- Columnwise average squared ref. vector

- X[i], i th value in some ref. vector X

- Same general idea – representative distributions per various partitions
- Can use categorical techniques if we:
- Bin the numerical values
- Treat each bin as a categorical value

- Base features on ref. and variance vectors
- Two kinds:
- “Interesting” values
- one value from case reference vector per t
- same column in vector for all t
- assorted options for choosing column
- choices depend on predicate ref. vectors

- Vector distances
- distance between case ref. vector and predicate ref. vector
- various distance metrics

- “Interesting” values
- More notation: acronym for each feature type

- For a given c, f, select that fi which is...
- MOC: Most common overall
- argmaxiR[i]

- Most common in each class
- For binary class y
- Positive is y = 1, Negative is y = 0

- MOP: argmaxiRft.y=1[i]
- MON: argmaxiRft.y=0[i]

- For binary class y
- Most distinctive per class
- Common in one class but not in other(s)
- MOD: argmaxi |Rft.y=1[i] - Rft.y=0[i] |
- MOM: argmaxi MOD / Vft.y=1[i] - Vft.y=0[i]
- Normalizes for variance (sort of)

- MOC: Most common overall

- Distance btw given ref. vector & each case vector
- Distance metrics
- ED: Edit – not defined
- Sum of abs. diffs, a.k.a. Manhattan dist?
- Σi |C[i] – R[i] |

- EU: Euclidean
- √(C[i] T R[i] ), omit √ for speed

- MA: Mahalanobis
- √(C[i] TΣ-1 R[i] ), omit √ for speed
- Σshould be covariance...of what?

- CO: Cosine, 1- cos(angle btw vectors)
- 1 - C[i] T R[i] / √ (|C[i] ||R[i] |)

- ED: Edit – not defined

- Apply each metric w/ various ref. vectors
- Acronym is metric w/ suffix for ref. vector
- (No suffix): Unconditional ref. vector
- P: per-class positive ref. vector, Rft.y=1
- N: per-class positive ref. vector, Rft.y=0
- D: difference between P and D distances

- Alphabet soup, e.g. EUP, MAD,...

- Other features added for tests
- Not part of their aggregation proposal
- AH: “abstraction hierarchy” (?)
- Pull into T all fields that are just “shared records” via n:1 references

- AC: “autocorrelation” aggregation
- For joins back into T, get other cases “linked to” each t
- Fraction of positive cases among others

- Find linked tables
- Starting from T, do breadth-first walk of schema graph
- Up to some max depth
- Cap number of paths followed

- For each path, know T is linked to last table in path

- Starting from T, do breadth-first walk of schema graph
- Extract aggregate fields
- Pull in all fields of last table in path
- Aggregate them (using new aggregates) per t
- Append aggregates to t

- Classifier
- Pick 10 subsets each w/ 10 features
- Random choice, weighted by “performance”
- But there’s no classifier yet...so how do features predict class?

- Build a decision tree for each feature set
- Have class frequencies at leaves
- Features might not completely distinguish classes

- Class prediction:
- Select class with higher frequency

- Class probability estimation:
- Average frequencies over trees

- Have class frequencies at leaves

- Pick 10 subsets each w/ 10 features

- IPO data
- 5 tables
- Most fields in the “main” table, used as T
- Other tables had key & one data field
- Predicate on one field in T used as the class

- 5 tables
- Tested against:
- First-order logic aggregation
- Extract clauses using an ILP system
- Append evaluated clauses to each t

- Various ILP systems
- Using just data in T
- (Or T and AH features?)

- First-order logic aggregation

- IPO data
- 5 tables w/ small, simple schema
- Majority of fields were in the “main” table, i.e. T
- The only numeric fields were in main table, so no aggregation of numeric features needed

- Other tables had key & one data field
- Max path length 2 to reach all tables, no recursion
- Predicate on one field in T used as the class

- Majority of fields were in the “main” table, i.e. T

- 5 tables w/ small, simple schema
- Tested against:
- First-order logic aggregation
- Extract clauses using an ILP system
- Append evaluated clauses to each t

- Various ILP systems
- Using just data in T (or T and AH features?)

- First-order logic aggregation

- See paper for numbers
- Accuracy with aggregate features:
- Up to 10% increase over only features from T
- Depends on which and how many extra features used
- Most predictive feature was in a separate table
- Expect accuracy increase as more info available
- Shows info was not destroyed by aggregation
- Vector distance features better

- Generalization

- Taxonomy
- I: Division into stages of aggregation
- Slot in any procedure per stage
- Estmate complexity per stage

- B: Might get the discussion going

- I: Division into stages of aggregation
- Aggregate features
- I: Identifying a “main” table
- Others get aggregated

- I: Forming partitions to aggregate over
- Using queries with joins to pull in other tables
- Abstract partitioning based on predicate

- I: Comparing case against reference histograms
- I: Separate comparison method and reference

- I: Identifying a “main” table

- Learning
- I: Decision tree tricks
- Cut DT induction off short to get class freqs
- Starve DT of features to improve generalization

- I: Decision tree tricks

- Some worrying lapses...
- Lacked standard terms for common concepts
- “position i [of vector has] the number of instances of [ith value]”... -> histogram
- “abstraction hierarchy” -> schema
- “value order” -> enumeration
- Defined (and emphasized) terms for trivial and commonly used things

- Imprecise use of terms
- “variance” for (something like) second moment
- I’m not confident they know what Mahalanobis distance is
- They say “left outer join” and show inner join symbol

- Lacked standard terms for common concepts

- Some worrying lapses...
- Did not connect “reference vector” and “variance vector” to underlying statistics
- Should relate to bag prior and field value conditional probability, not just “weighted”

- Did not acknowledge loss of correlation info from splitting up joined u tuples in their features
- Assumes fields are independent
- Dependency was mentioned in the taxonomy

- Fig 1 schema cannot support § 2 example query
- Missing a necessary foreign key reference

- Did not connect “reference vector” and “variance vector” to underlying statistics

- Some worrying lapses...
- Their formal statement of the task did not show aggregation as dependent on t
- Needed for c(t) partitioning

- Did not clearly distinguish when t did or did not contain class
- No need to put it in there at all

- No, the higher Gaussian moments are not all zero!
- Only the odd ones are. Yeesh.
- Correct reason we don’t need them is: all can be computed from mean and variance

- Uuugly notation

- Their formal statement of the task did not show aggregation as dependent on t

- Some worrying lapses...
- Did not cite other uses of histograms or distributions extracted as features
- “Spike-triggered average” / covariance / etc.
- Used by: all neurobiology, neurocomputation
- E.g.: de Ruyter van Steveninck & Bialek

- “Response-conditional ensemble”
- Used by: Our own Adrienne Fairhall & colleagues
- E.g.: Aguera & Arcas, Fairhall, Bialek

- “Event-triggered distribution”
- Used by: me ☺
- E.g.: CSE528 project

- “Spike-triggered average” / covariance / etc.

- Did not cite other uses of histograms or distributions extracted as features

- Some worrying lapses...
- Did not cite other uses of histograms or distributions extracted as features...
- So, did not use “standard” tricks
- Dimension reduction:
- Treat histogram as a vector
- Do PCA, keep top few eigenmodes, new features are projections

- Dimension reduction:
- Nor “special” tricks:
- Subtract prior covariance before PCA

- Likewise competing the classes is not new

- Non-goof issues
- Would need bookkeeping to maintain variance vector for online learning
- Don’t have sufficient statistics
- Histograms are actual “samples”
- Adding new data doesn’t add new “samples”:changes existing ones
- Could subtract old contribution, add new one
- Use a triggered query

- Don’t bin those nice numerical variables!
- Binning makes vectors out of scalars
- Scalar fields can be ganged into a vector across fields!
- Do (e.g.) clustering on the bag of vectors

- Would need bookkeeping to maintain variance vector for online learning
- That’s enough of that