Loading in 5 sec....

Aggregate features for relational data Claudia Perlich, Foster ProvostPowerPoint Presentation

Aggregate features for relational data Claudia Perlich, Foster Provost

Download Presentation

Aggregate features for relational data Claudia Perlich, Foster Provost

Loading in 2 Seconds...

- 109 Views
- Uploaded on
- Presentation posted in: General

Aggregate features for relational data Claudia Perlich, Foster Provost

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Aggregate features for relational dataClaudia Perlich, Foster Provost

Pat Tressel

16-May-2005

- Perlich and Provost provide...
- Hierarchy of aggregation methods
- Survey of existing aggregation methods
- New aggregation methods

- Concerned w/ supervised learning only
- But much seems applicable to clustering

- Most classifiers use feature vectors
- Individual features have fixed arity
- No links to other objects

- How do we get feature vectors from relational data?
- Flatten it:
- Joins
- Aggregation

- Flatten it:
- (Are feature vectors all there are?)

- Why consider them?
- Yield flat feature vectors
- Preserve all the data

- Why not use them?
- They emphasize data with many references
- Ok if that’s what we want
- Not ok if sampling was skewed
- Cascaded or transitive joins blow up

- They emphasize data with many references

- They emphasize data with many references:
- Lots more Joes than there were before...

- Why not use them?
- What if we don’t know the references?
- Try out everything with everything else
- Cross product yields all combinations
- Adds fictitious relationships
- Combinatorial blowup

- What if we don’t know the references?

- What if we don’t know the references?

- Why use them?
- Yield flat feature vectors
- No blowup in number of tuples
- Can group tuples in all related tables

- Can keep as detailed stats as desired
- Not just max, mean, etc.
- Parametric dists from sufficient stats
- Can apply tests for grouping

- Choice of aggregates can be model-based
- Better generalization
- Include domain knowledge in model choice

- Anything wrong with them?
- Data is lost
- Relational structure is lost
- Influential individuals are lumped in
- Doesn’t discover critical individuals
- Dominates other data

- Any choice of aggregates assumes a model
- What if it’s wrong?

- Adding new data can require calculations
- But can avoid issue by keeping sufficient statistics

- Why is this useful?
- Promote deliberate use of aggregates
- Point out gaps in current use of aggregates
- Find appropriate techniques for each class

- Based on “complexity” due to:
- Relational structure
- Cardinality of the relations (1:1, 1:n, m:n)

- Feature extraction
- Computing the aggregates

- Class prediction

- Relational structure

- Formal statement of the task:
- Notation (here and on following slides):
- t, tuple (from “target” table T, with main features)
- y, class (known per t if training)
- Ψ, aggregation function
- Φ, classification function
- σ, select operation (where joins preserve t)
- Ω, all tables; B, any other table, b in B
- u, fields to be added to t from other tables
- f, a field in u
- More, that doesn’t fit on this slide

- Formal statement of the task:
- Notation (here and on following slides):
- Caution! Simplified from what’s in the paper!
- t, tuple (from “target” table T, with main features)
- y, class (known per t if training)
- Ψ, aggregation function
- Φ, classification function
- σ, select operation (where joins preserve t)
- Ω, all tables; B, any other table, b a tuple in B
- u, fields to be added to t from joined tables
- f, a field in u
- More, that doesn’t fit on this slide

- Simple
- One field from one object type

- Denoted by:

- Multi-dimensional
- Multiple fields, one object type

- Denoted by:

- Multi-type
- Multiple object types

- Denoted by:

- Propositional
- No aggregation
- Single tuple, 1-1 or n-1 joins
- n-1 is just a shared object

- Not relational per se – already flat

- Independent fields
- Separate aggregation per field
- Separate 1-n joins with T

- Dependent fields in same table
- Multi-dimensional aggregation
- Separate 1-n joins with T

- Dependent fields over multiple tables
- Multi-type aggregation
- Separate 1-n joins, still only with T

- Global
- Any joins or combinations of fields
- Multi-type aggregation
- Multi-way joins
- Joins among tables other than T

- Any joins or combinations of fields

- First-order logic
- Find clauses that directly predict the class
- Ф is OR

- Form binary features from tests
- Logical and arithmetic tests
- These go in the feature vector
- Ф is any ordinary classifier

- Find clauses that directly predict the class

- The usual database aggregates
- For numerical values:
- mean, min, max, count, sum, etc.

- For categorical values:
- Most common value
- Count per value

- For numerical values:

- Set distance
- Two tuples, each with a set of related tuples
- Distance metric between related fields
- Euclidean for numerical data
- Edit distance for categorical

- Distance between sets is distance of closest pair

- Recall the point of this work:
- Tuple t from table T is part of a feature vector
- Want to augment w/ info from other tables
- Info added to t must be consistent w/ values in t
- Need to flatten the added info to yield one vector per tuple t
- Use that to:
- Train classifier given class y for t
- Predict class y for t

- Outline of steps:
- Do query to get more info u from other tables
- Partition the results based on:
- Main features t
- Class y
- Predicates on t

- Extract distributions over results for fields in u
- Get distribution for each partition
- For now, limit to categorical fields
- Suggest extension to numerical fields

- Derive features from distributions

- Select
- Based on the target table T
- If training, known class y is included in T
- Joins must preserve distinct values from T
- Join on as much of T’s key as is present in other table
- Maybe need to constrain other fields?
- Not a problem for correctly normalized tables

- Project
- Include all of t
- Append additional fields u from joined tables
- Anything up to all fields from joins

- Partition query results various ways, e.g.:
- Into cases per each t
- For training, include the (known) class y in t

- Also (if training) split per each class
- Want this for class priors

- Split per some (unspecifed) predicate c(t)

- Into cases per each t
- For each partition:
- There is a bag of associated u tuples
- Ignore the t part – already a flat vector

- Split vertically to get bags of individual values per each field f in u
- Note this breaks association between fields!

- There is a bag of associated u tuples

- Let categorical field be f with values fi
- Form histogram for each partition
- Count instances of each value fi of f in a bag
- These are sufficient statistics for:
- Distribution over fi values
- Probability of each bag in the partition

- Start with one per each tuple t and field f
- Cft, (per-) case vector
- Component Cft[i], count for fi

- Distribution of histograms per predicatec(t) and field f
- Treat histogram counts as random variables
- Regard c(t) true partition as a collection of histogram “samples”
- Regard histograms as vectors of random variables, one per field value fi

- Extract moments of these histogram count distributions
- mean (sort of) – reference vector
- variance (sort of) – variance vector

- Treat histogram counts as random variables

- Net histogram per predicate c(t), field f
- c(t) partitions tuples t into two groups
- Only histogram the c(t) true group
- Could include ~c as a predicate if we want

- Don’t re-count!
- Already have histograms for each t and f – case reference vectors
- Sum the case reference vectors columnwise

- Call this a “reference vector”, Rfc
- Proportional to average histogram over t for c(t) true (weighted by # samples per t)

- c(t) partitions tuples t into two groups

- Variance of case histograms per predicatec(t) and field f
- Define “variance vector”, Vfc
- Columnwise sum of squares of case reference vectors / number of samples with c(t) true
- Not an actual variance
- Squared means not subtracted

- Don’t care:
- It’s indicative of the variance...
- Throw in means-based features as well to give classifier full variance info

- Define “variance vector”, Vfc

- What predicates might we use?
- Unconditionally true, c(t) = true
- Result is net distribution independent of t
- Unconditional reference vector, R

- Per class k, ck(t) = (t.y == k)
- Class priors
- Recall for training data, y is a field in t
- Per class reference vector,

- Unconditionally true, c(t) = true

- Summary of notation
- c(t), a predicate based on values in a tuple t
- f, a categorical field from a join with T
- fi, values of f
- Rfc, reference vector
- histogram over fi values in bag for c(t) true

- Cft, case vector
- histogram over fi values for t’s bag

- R, unconditional reference vector
- Vfc, variance vector
- Columnwise average squared ref. vector

- X[i], i th value in some ref. vector X

- Same general idea – representative distributions per various partitions
- Can use categorical techniques if we:
- Bin the numerical values
- Treat each bin as a categorical value

- Base features on ref. and variance vectors
- Two kinds:
- “Interesting” values
- one value from case reference vector per t
- same column in vector for all t
- assorted options for choosing column
- choices depend on predicate ref. vectors

- Vector distances
- distance between case ref. vector and predicate ref. vector
- various distance metrics

- “Interesting” values
- More notation: acronym for each feature type

- For a given c, f, select that fi which is...
- MOC: Most common overall
- argmaxiR[i]

- Most common in each class
- For binary class y
- Positive is y = 1, Negative is y = 0

- MOP: argmaxiRft.y=1[i]
- MON: argmaxiRft.y=0[i]

- For binary class y
- Most distinctive per class
- Common in one class but not in other(s)
- MOD: argmaxi |Rft.y=1[i] - Rft.y=0[i] |
- MOM: argmaxi MOD / Vft.y=1[i] - Vft.y=0[i]
- Normalizes for variance (sort of)

- MOC: Most common overall

- Distance btw given ref. vector & each case vector
- Distance metrics
- ED: Edit – not defined
- Sum of abs. diffs, a.k.a. Manhattan dist?
- Σi |C[i] – R[i] |

- EU: Euclidean
- √(C[i] T R[i] ), omit √ for speed

- MA: Mahalanobis
- √(C[i] TΣ-1 R[i] ), omit √ for speed
- Σshould be covariance...of what?

- CO: Cosine, 1- cos(angle btw vectors)
- 1 - C[i] T R[i] / √ (|C[i] ||R[i] |)

- ED: Edit – not defined

- Apply each metric w/ various ref. vectors
- Acronym is metric w/ suffix for ref. vector
- (No suffix): Unconditional ref. vector
- P: per-class positive ref. vector, Rft.y=1
- N: per-class positive ref. vector, Rft.y=0
- D: difference between P and D distances

- Alphabet soup, e.g. EUP, MAD,...

- Other features added for tests
- Not part of their aggregation proposal
- AH: “abstraction hierarchy” (?)
- Pull into T all fields that are just “shared records” via n:1 references

- AC: “autocorrelation” aggregation
- For joins back into T, get other cases “linked to” each t
- Fraction of positive cases among others

- Find linked tables
- Starting from T, do breadth-first walk of schema graph
- Up to some max depth
- Cap number of paths followed

- For each path, know T is linked to last table in path

- Starting from T, do breadth-first walk of schema graph
- Extract aggregate fields
- Pull in all fields of last table in path
- Aggregate them (using new aggregates) per t
- Append aggregates to t

- Classifier
- Pick 10 subsets each w/ 10 features
- Random choice, weighted by “performance”
- But there’s no classifier yet...so how do features predict class?

- Build a decision tree for each feature set
- Have class frequencies at leaves
- Features might not completely distinguish classes

- Class prediction:
- Select class with higher frequency

- Class probability estimation:
- Average frequencies over trees

- Have class frequencies at leaves

- Pick 10 subsets each w/ 10 features

- IPO data
- 5 tables
- Most fields in the “main” table, used as T
- Other tables had key & one data field
- Predicate on one field in T used as the class

- 5 tables
- Tested against:
- First-order logic aggregation
- Extract clauses using an ILP system
- Append evaluated clauses to each t

- Various ILP systems
- Using just data in T
- (Or T and AH features?)

- First-order logic aggregation

- IPO data
- 5 tables w/ small, simple schema
- Majority of fields were in the “main” table, i.e. T
- The only numeric fields were in main table, so no aggregation of numeric features needed

- Other tables had key & one data field
- Max path length 2 to reach all tables, no recursion
- Predicate on one field in T used as the class

- Majority of fields were in the “main” table, i.e. T

- 5 tables w/ small, simple schema
- Tested against:
- First-order logic aggregation
- Extract clauses using an ILP system
- Append evaluated clauses to each t

- Various ILP systems
- Using just data in T (or T and AH features?)

- First-order logic aggregation

- See paper for numbers
- Accuracy with aggregate features:
- Up to 10% increase over only features from T
- Depends on which and how many extra features used
- Most predictive feature was in a separate table
- Expect accuracy increase as more info available
- Shows info was not destroyed by aggregation
- Vector distance features better

- Generalization

- Taxonomy
- I: Division into stages of aggregation
- Slot in any procedure per stage
- Estmate complexity per stage

- B: Might get the discussion going

- I: Division into stages of aggregation
- Aggregate features
- I: Identifying a “main” table
- Others get aggregated

- I: Forming partitions to aggregate over
- Using queries with joins to pull in other tables
- Abstract partitioning based on predicate

- I: Comparing case against reference histograms
- I: Separate comparison method and reference

- I: Identifying a “main” table

- Learning
- I: Decision tree tricks
- Cut DT induction off short to get class freqs
- Starve DT of features to improve generalization

- I: Decision tree tricks

- Some worrying lapses...
- Lacked standard terms for common concepts
- “position i [of vector has] the number of instances of [ith value]”... -> histogram
- “abstraction hierarchy” -> schema
- “value order” -> enumeration
- Defined (and emphasized) terms for trivial and commonly used things

- Imprecise use of terms
- “variance” for (something like) second moment
- I’m not confident they know what Mahalanobis distance is
- They say “left outer join” and show inner join symbol

- Lacked standard terms for common concepts

- Some worrying lapses...
- Did not connect “reference vector” and “variance vector” to underlying statistics
- Should relate to bag prior and field value conditional probability, not just “weighted”

- Did not acknowledge loss of correlation info from splitting up joined u tuples in their features
- Assumes fields are independent
- Dependency was mentioned in the taxonomy

- Fig 1 schema cannot support § 2 example query
- Missing a necessary foreign key reference

- Did not connect “reference vector” and “variance vector” to underlying statistics

- Some worrying lapses...
- Their formal statement of the task did not show aggregation as dependent on t
- Needed for c(t) partitioning

- Did not clearly distinguish when t did or did not contain class
- No need to put it in there at all

- No, the higher Gaussian moments are not all zero!
- Only the odd ones are. Yeesh.
- Correct reason we don’t need them is: all can be computed from mean and variance

- Uuugly notation

- Their formal statement of the task did not show aggregation as dependent on t

- Some worrying lapses...
- Did not cite other uses of histograms or distributions extracted as features
- “Spike-triggered average” / covariance / etc.
- Used by: all neurobiology, neurocomputation
- E.g.: de Ruyter van Steveninck & Bialek

- “Response-conditional ensemble”
- Used by: Our own Adrienne Fairhall & colleagues
- E.g.: Aguera & Arcas, Fairhall, Bialek

- “Event-triggered distribution”
- Used by: me ☺
- E.g.: CSE528 project

- “Spike-triggered average” / covariance / etc.

- Did not cite other uses of histograms or distributions extracted as features

- Some worrying lapses...
- Did not cite other uses of histograms or distributions extracted as features...
- So, did not use “standard” tricks
- Dimension reduction:
- Treat histogram as a vector
- Do PCA, keep top few eigenmodes, new features are projections

- Dimension reduction:
- Nor “special” tricks:
- Subtract prior covariance before PCA

- Likewise competing the classes is not new

- Non-goof issues
- Would need bookkeeping to maintain variance vector for online learning
- Don’t have sufficient statistics
- Histograms are actual “samples”
- Adding new data doesn’t add new “samples”:changes existing ones
- Could subtract old contribution, add new one
- Use a triggered query

- Don’t bin those nice numerical variables!
- Binning makes vectors out of scalars
- Scalar fields can be ganged into a vector across fields!
- Do (e.g.) clustering on the bag of vectors

- Would need bookkeeping to maintain variance vector for online learning
- That’s enough of that