Download Presentation

Aggregate features for relational data Claudia Perlich, Foster Provost

Loading in 2 Seconds...

Aggregate features for relational data Claudia Perlich, Foster Provost

Loading in 2 Seconds...

- By
**phong** - Follow User

- 127 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Aggregate features for relational data Claudia Perlich, Foster Provost' - phong

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Overview

- Perlich and Provost provide...
- Hierarchy of aggregation methods
- Survey of existing aggregation methods
- New aggregation methods
- Concerned w/ supervised learning only
- But much seems applicable to clustering

The issues…

- Most classifiers use feature vectors
- Individual features have fixed arity
- No links to other objects
- How do we get feature vectors from relational data?
- Flatten it:
- Joins
- Aggregation
- (Are feature vectors all there are?)

Joins

- Why consider them?
- Yield flat feature vectors
- Preserve all the data
- Why not use them?
- They emphasize data with many references
- Ok if that’s what we want
- Not ok if sampling was skewed
- Cascaded or transitive joins blow up

Joins

- They emphasize data with many references:
- Lots more Joes than there were before...

Joins

- Why not use them?
- What if we don’t know the references?
- Try out everything with everything else
- Cross product yields all combinations
- Adds fictitious relationships
- Combinatorial blowup

Joins

- What if we don’t know the references?

Aggregates

- Why use them?
- Yield flat feature vectors
- No blowup in number of tuples
- Can group tuples in all related tables
- Can keep as detailed stats as desired
- Not just max, mean, etc.
- Parametric dists from sufficient stats
- Can apply tests for grouping
- Choice of aggregates can be model-based
- Better generalization
- Include domain knowledge in model choice

Aggregates

- Anything wrong with them?
- Data is lost
- Relational structure is lost
- Influential individuals are lumped in
- Doesn’t discover critical individuals
- Dominates other data
- Any choice of aggregates assumes a model
- What if it’s wrong?
- Adding new data can require calculations
- But can avoid issue by keeping sufficient statistics

Taxonomy of aggregates

- Why is this useful?
- Promote deliberate use of aggregates
- Point out gaps in current use of aggregates
- Find appropriate techniques for each class
- Based on “complexity” due to:
- Relational structure
- Cardinality of the relations (1:1, 1:n, m:n)
- Feature extraction
- Computing the aggregates
- Class prediction

Taxonomy of aggregates

- Formal statement of the task:
- Notation (here and on following slides):
- t, tuple (from “target” table T, with main features)
- y, class (known per t if training)
- Ψ, aggregation function
- Φ, classification function
- σ, select operation (where joins preserve t)
- Ω, all tables; B, any other table, b in B
- u, fields to be added to t from other tables
- f, a field in u
- More, that doesn’t fit on this slide

Taxonomy of aggregates

- Formal statement of the task:
- Notation (here and on following slides):
- Caution! Simplified from what’s in the paper!
- t, tuple (from “target” table T, with main features)
- y, class (known per t if training)
- Ψ, aggregation function
- Φ, classification function
- σ, select operation (where joins preserve t)
- Ω, all tables; B, any other table, b a tuple in B
- u, fields to be added to t from joined tables
- f, a field in u
- More, that doesn’t fit on this slide

Relational “concept” complexity

- Propositional
- No aggregation
- Single tuple, 1-1 or n-1 joins
- n-1 is just a shared object
- Not relational per se – already flat

Relational “concept” complexity

- Independent fields
- Separate aggregation per field
- Separate 1-n joins with T

Relational “concept” complexity

- Dependent fields in same table
- Multi-dimensional aggregation
- Separate 1-n joins with T

Relational “concept” complexity

- Dependent fields over multiple tables
- Multi-type aggregation
- Separate 1-n joins, still only with T

Relational “concept” complexity

- Global
- Any joins or combinations of fields
- Multi-type aggregation
- Multi-way joins
- Joins among tables other than T

Current relational aggregation

- First-order logic
- Find clauses that directly predict the class
- Ф is OR
- Form binary features from tests
- Logical and arithmetic tests
- These go in the feature vector
- Ф is any ordinary classifier

Current relational aggregation

- The usual database aggregates
- For numerical values:
- mean, min, max, count, sum, etc.
- For categorical values:
- Most common value
- Count per value

Current relational aggregation

- Set distance
- Two tuples, each with a set of related tuples
- Distance metric between related fields
- Euclidean for numerical data
- Edit distance for categorical
- Distance between sets is distance of closest pair

Proposed relational aggregation

- Recall the point of this work:
- Tuple t from table T is part of a feature vector
- Want to augment w/ info from other tables
- Info added to t must be consistent w/ values in t
- Need to flatten the added info to yield one vector per tuple t
- Use that to:
- Train classifier given class y for t
- Predict class y for t

Proposed relational aggregation

- Outline of steps:
- Do query to get more info u from other tables
- Partition the results based on:
- Main features t
- Class y
- Predicates on t
- Extract distributions over results for fields in u
- Get distribution for each partition
- For now, limit to categorical fields
- Suggest extension to numerical fields
- Derive features from distributions

Do query to get info from other tables

- Select
- Based on the target table T
- If training, known class y is included in T
- Joins must preserve distinct values from T
- Join on as much of T’s key as is present in other table
- Maybe need to constrain other fields?
- Not a problem for correctly normalized tables
- Project
- Include all of t
- Append additional fields u from joined tables
- Anything up to all fields from joins

Extract distributions

- Partition query results various ways, e.g.:
- Into cases per each t
- For training, include the (known) class y in t
- Also (if training) split per each class
- Want this for class priors
- Split per some (unspecifed) predicate c(t)
- For each partition:
- There is a bag of associated u tuples
- Ignore the t part – already a flat vector
- Split vertically to get bags of individual values per each field f in u
- Note this breaks association between fields!

Distributions for categorical fields

- Let categorical field be f with values fi
- Form histogram for each partition
- Count instances of each value fi of f in a bag
- These are sufficient statistics for:
- Distribution over fi values
- Probability of each bag in the partition
- Start with one per each tuple t and field f
- Cft, (per-) case vector
- Component Cft[i], count for fi

Distributions for categorical fields

- Distribution of histograms per predicatec(t) and field f
- Treat histogram counts as random variables
- Regard c(t) true partition as a collection of histogram “samples”
- Regard histograms as vectors of random variables, one per field value fi
- Extract moments of these histogram count distributions
- mean (sort of) – reference vector
- variance (sort of) – variance vector

Distributions for categorical fields

- Net histogram per predicate c(t), field f
- c(t) partitions tuples t into two groups
- Only histogram the c(t) true group
- Could include ~c as a predicate if we want
- Don’t re-count!
- Already have histograms for each t and f – case reference vectors
- Sum the case reference vectors columnwise
- Call this a “reference vector”, Rfc
- Proportional to average histogram over t for c(t) true (weighted by # samples per t)

Distributions for categorical fields

- Variance of case histograms per predicatec(t) and field f
- Define “variance vector”, Vfc
- Columnwise sum of squares of case reference vectors / number of samples with c(t) true
- Not an actual variance
- Squared means not subtracted
- Don’t care:
- It’s indicative of the variance...
- Throw in means-based features as well to give classifier full variance info

Distributions for categorical fields

- What predicates might we use?
- Unconditionally true, c(t) = true
- Result is net distribution independent of t
- Unconditional reference vector, R
- Per class k, ck(t) = (t.y == k)
- Class priors
- Recall for training data, y is a field in t
- Per class reference vector,

Distributions for categorical fields

- Summary of notation
- c(t), a predicate based on values in a tuple t
- f, a categorical field from a join with T
- fi, values of f
- Rfc, reference vector
- histogram over fi values in bag for c(t) true
- Cft, case vector
- histogram over fi values for t’s bag
- R, unconditional reference vector
- Vfc, variance vector
- Columnwise average squared ref. vector
- X[i], i th value in some ref. vector X

Distributions for numerical data

- Same general idea – representative distributions per various partitions
- Can use categorical techniques if we:
- Bin the numerical values
- Treat each bin as a categorical value

Feature extraction

- Base features on ref. and variance vectors
- Two kinds:
- “Interesting” values
- one value from case reference vector per t
- same column in vector for all t
- assorted options for choosing column
- choices depend on predicate ref. vectors
- Vector distances
- distance between case ref. vector and predicate ref. vector
- various distance metrics
- More notation: acronym for each feature type

Feature extraction: “interesting” values

- For a given c, f, select that fi which is...
- MOC: Most common overall
- argmaxiR[i]
- Most common in each class
- For binary class y
- Positive is y = 1, Negative is y = 0
- MOP: argmaxiRft.y=1[i]
- MON: argmaxiRft.y=0[i]
- Most distinctive per class
- Common in one class but not in other(s)
- MOD: argmaxi |Rft.y=1[i] - Rft.y=0[i] |
- MOM: argmaxi MOD / Vft.y=1[i] - Vft.y=0[i]
- Normalizes for variance (sort of)

Feature extraction: vector distance

- Distance btw given ref. vector & each case vector
- Distance metrics
- ED: Edit – not defined
- Sum of abs. diffs, a.k.a. Manhattan dist?
- Σi |C[i] – R[i] |
- EU: Euclidean
- √(C[i] T R[i] ), omit √ for speed
- MA: Mahalanobis
- √(C[i] TΣ-1 R[i] ), omit √ for speed
- Σshould be covariance...of what?
- CO: Cosine, 1- cos(angle btw vectors)
- 1 - C[i] T R[i] / √ (|C[i] ||R[i] |)

Feature extraction: vector distance

- Apply each metric w/ various ref. vectors
- Acronym is metric w/ suffix for ref. vector
- (No suffix): Unconditional ref. vector
- P: per-class positive ref. vector, Rft.y=1
- N: per-class positive ref. vector, Rft.y=0
- D: difference between P and D distances
- Alphabet soup, e.g. EUP, MAD,...

Feature extraction

- Other features added for tests
- Not part of their aggregation proposal
- AH: “abstraction hierarchy” (?)
- Pull into T all fields that are just “shared records” via n:1 references
- AC: “autocorrelation” aggregation
- For joins back into T, get other cases “linked to” each t
- Fraction of positive cases among others

Learning

- Find linked tables
- Starting from T, do breadth-first walk of schema graph
- Up to some max depth
- Cap number of paths followed
- For each path, know T is linked to last table in path
- Extract aggregate fields
- Pull in all fields of last table in path
- Aggregate them (using new aggregates) per t
- Append aggregates to t

Learning

- Classifier
- Pick 10 subsets each w/ 10 features
- Random choice, weighted by “performance”
- But there’s no classifier yet...so how do features predict class?
- Build a decision tree for each feature set
- Have class frequencies at leaves
- Features might not completely distinguish classes
- Class prediction:
- Select class with higher frequency
- Class probability estimation:
- Average frequencies over trees

Tests

- IPO data
- 5 tables
- Most fields in the “main” table, used as T
- Other tables had key & one data field
- Predicate on one field in T used as the class
- Tested against:
- First-order logic aggregation
- Extract clauses using an ILP system
- Append evaluated clauses to each t
- Various ILP systems
- Using just data in T
- (Or T and AH features?)

Tests

- IPO data
- 5 tables w/ small, simple schema
- Majority of fields were in the “main” table, i.e. T
- The only numeric fields were in main table, so no aggregation of numeric features needed
- Other tables had key & one data field
- Max path length 2 to reach all tables, no recursion
- Predicate on one field in T used as the class
- Tested against:
- First-order logic aggregation
- Extract clauses using an ILP system
- Append evaluated clauses to each t
- Various ILP systems
- Using just data in T (or T and AH features?)

Test results

- See paper for numbers
- Accuracy with aggregate features:
- Up to 10% increase over only features from T
- Depends on which and how many extra features used
- Most predictive feature was in a separate table
- Expect accuracy increase as more info available
- Shows info was not destroyed by aggregation
- Vector distance features better
- Generalization

Interesting ideas (“I”) & benefits (“B”)

- Taxonomy
- I: Division into stages of aggregation
- Slot in any procedure per stage
- Estmate complexity per stage
- B: Might get the discussion going
- Aggregate features
- I: Identifying a “main” table
- Others get aggregated
- I: Forming partitions to aggregate over
- Using queries with joins to pull in other tables
- Abstract partitioning based on predicate
- I: Comparing case against reference histograms
- I: Separate comparison method and reference

Interesting ideas (“I”) & benefits (“B”)

- Learning
- I: Decision tree tricks
- Cut DT induction off short to get class freqs
- Starve DT of features to improve generalization

Issues

- Some worrying lapses...
- Lacked standard terms for common concepts
- “position i [of vector has] the number of instances of [ith value]”... -> histogram
- “abstraction hierarchy” -> schema
- “value order” -> enumeration
- Defined (and emphasized) terms for trivial and commonly used things
- Imprecise use of terms
- “variance” for (something like) second moment
- I’m not confident they know what Mahalanobis distance is
- They say “left outer join” and show inner join symbol

Issues

- Some worrying lapses...
- Did not connect “reference vector” and “variance vector” to underlying statistics
- Should relate to bag prior and field value conditional probability, not just “weighted”
- Did not acknowledge loss of correlation info from splitting up joined u tuples in their features
- Assumes fields are independent
- Dependency was mentioned in the taxonomy
- Fig 1 schema cannot support § 2 example query
- Missing a necessary foreign key reference

Issues

- Some worrying lapses...
- Their formal statement of the task did not show aggregation as dependent on t
- Needed for c(t) partitioning
- Did not clearly distinguish when t did or did not contain class
- No need to put it in there at all
- No, the higher Gaussian moments are not all zero!
- Only the odd ones are. Yeesh.
- Correct reason we don’t need them is: all can be computed from mean and variance
- Uuugly notation

Issues

- Some worrying lapses...
- Did not cite other uses of histograms or distributions extracted as features
- “Spike-triggered average” / covariance / etc.
- Used by: all neurobiology, neurocomputation
- E.g.: de Ruyter van Steveninck & Bialek
- “Response-conditional ensemble”
- Used by: Our own Adrienne Fairhall & colleagues
- E.g.: Aguera & Arcas, Fairhall, Bialek
- “Event-triggered distribution”
- Used by: me ☺
- E.g.: CSE528 project

Issues

- Some worrying lapses...
- Did not cite other uses of histograms or distributions extracted as features...
- So, did not use “standard” tricks
- Dimension reduction:
- Treat histogram as a vector
- Do PCA, keep top few eigenmodes, new features are projections
- Nor “special” tricks:
- Subtract prior covariance before PCA
- Likewise competing the classes is not new

Issues

- Non-goof issues
- Would need bookkeeping to maintain variance vector for online learning
- Don’t have sufficient statistics
- Histograms are actual “samples”
- Adding new data doesn’t add new “samples”:changes existing ones
- Could subtract old contribution, add new one
- Use a triggered query
- Don’t bin those nice numerical variables!
- Binning makes vectors out of scalars
- Scalar fields can be ganged into a vector across fields!
- Do (e.g.) clustering on the bag of vectors
- That’s enough of that

Download Presentation

Connecting to Server..