Using synthetic data safely in classification
Download
1 / 26

USING SYNTHETIC DATA SAFELY IN CLASSIFICATION - PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on

USING SYNTHETIC DATA SAFELY IN CLASSIFICATION. Jean Nonnemaker 10 January 2009. Motivation. Trainable classifier technologies require large representative training sets Acquiring such training sets is difficult and costly Time and labor to gather existing training data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' USING SYNTHETIC DATA SAFELY IN CLASSIFICATION' - josie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Using synthetic data safely in classification

USING SYNTHETIC DATA SAFELY IN CLASSIFICATION

Jean Nonnemaker

10 January 2009

Using Synthetic Data Safely in Classification


Motivation
Motivation

  • Trainable classifier technologies require large representative training sets

  • Acquiring such training sets is difficult and costly

    • Time and labor to gather existing training data

    • Time and labor to label new training data with ground truth

  • Training sets may not be representative - or sets may be imbalanced

Using Synthetic Data Safely in Classification


One solution
One Solution

Amplify the training data - that is increase it artificially by generating more.

We will call such generated data synthetic in contrast to real data collected in the field.

Using Synthetic Data Safely in Classification


Sample space
Sample Space

  • The set of all samples that exist, e.g. images of the letter ‘e’

    Just as they are found in nature. (Here, for example as found in a real book.)

CCC

Using Synthetic Data Safely in Classification


Feature space
Feature Space

width

  • Features are measurable characteristics of the sample, e.g. width, height of an image

height

Data can be thought of as points in a multi dimensional vector space

Using Synthetic Data Safely in Classification


Parameter space
Parameter Space

TYPESETTING

PARAMETERS

NOISE

PARAMETERS

  • Parameters may be used to generate the data.

Using Synthetic Data Safely in Classification


Ways of using sample parameter and feature spaces
Ways of Using Sample, Parameter and Feature Spaces

We can create synthetic data in

  • Parameter space – e.g. change the generating parameters and generate new samples

  • Sample space – e.g. add noise to the sample

  • Feature space – e.g. adjust feature values

Using Synthetic Data Safely in Classification


Supporting technology knuth
Supporting Technology - Knuth

  • TeX’s Metafont system synthetically generates typefaces.

    • 62 parameters are sufficient to define a typeface

    • Examples: Width, height, darkness and slant.

Using Synthetic Data Safely in Classification


Synthesizing typefaces
Synthesizing Typefaces

The letters ‘e’ and ‘c’ were generated using Knuth's metafont

  • CMR (Computer Modern Roman)

  • CMFF (Computer Modern Funny).

  • Nine interpolations between CMR and CMFF

Interpolation is by convex combinations in the 62-dimensional parameter space

Using Synthetic Data Safely in Classification


Pure typefaces
Pure Typefaces

CMR and CMFF are well known, standard typefaces which are widely used. We refer to CMR and CMFF as pure typefaces.

These are real samples of pure fonts that can be collected.

INTERPOLATED

PURE

PURE

Using Synthetic Data Safely in Classification


Interpolated typefaces
Interpolated Typefaces

Synthesized typefaces are created by interpolating between parameters that define the pure typefaces. They may never have been used but they are legible and should be recognized.

These are interpolated samples and so must be synthesized

INTERPOLATED

Using Synthetic Data Safely in Classification


Description of experiment
Description of Experiment

  • Train two classifiers

    • First on pure data only

    • Second on mixture of pure and interpolated data (synthetically amplified)

  • Ask

    • Is this safe: Does the amplified classifier continue to work well on pure test data?

    • Is it better: Does the amplified classifier work better on interpolated data?

Using Synthetic Data Safely in Classification


Details of experiment

A

B

A

A/A

A/B

B

B/A

B

/B

Details of Experiment

Test On

  • A = pure data (CMR and CMFF fonts)

  • B = interpolated data (interpolated fonts)

Train On

Hypothesis 2 – Error rates on AA and BA are the same. We hope not to reject the null hypothesis

Hypothesis 1 – Error rates on AB are better than BB. We hope to reject the null hypothesis

Using Synthetic Data Safely in Classification


Two hypotheses
Two Hypotheses

Hypothesis 2.

  • AA is trained and tested on pure data. BA is trained on mixed pure and interpolated data and tested on pure data.

  • Our null hypothesis is that AB and AA perform equally.

  • If the experiment does not reject the null hypothesis then synthetic data is safe

Hypothesis 1.

  • AB is trained on pure data and tested on interpolated data. BB is trained on pure and interpolated data and tested on interpolated data.

  • Our null hypothesis is that AB performs better than BB.

  • If the experiment rejects the null hypothesis then synthetic data is better.

Using Synthetic Data Safely in Classification


Details of experiment1
Details of Experiment

  • kNN classifier was trained on 800 samples each of letter ‘e’ and ‘c’ in CMR, and 800 samples each of letter ‘e’ and ‘c’ using CMFF

  • A second kNN classifier was trained using 800 samples of letter ‘e’ and 800 of letter ‘c’ created by interpolating between CMR and CMFF.

  • Each classifier was tested on same 400 samples of CMR and CMFF ‘e’s and ‘c’s

  • Each classifier was tested on 400 samples of ‘e’s and ‘c’s taken by interpolating between CMR and CMFF

Note that we tested on frequently confused letter pairs e/c and i/j

Using Synthetic Data Safely in Classification


CMR/CMFF/CMSSI Experiments (Computer Modern Roman/Computer Modern Sans Serif Italics/Computer Modern Funny Font)

Using Synthetic Data Safely in Classification


Test samples cmr cmff cmssi
Test Samples CMR/CMFF/CMSSI Modern Sans Serif Italics/Computer Modern Funny Font)

Using Synthetic Data Safely in Classification


Error counts cmr cmff cmssi
Error Counts CMR/CMFF/CMSSI Modern Sans Serif Italics/Computer Modern Funny Font)

Since χ² = 17.20, and is > 3.84 we can reject the null hypothesis and therefore can conclude that amplified classifier is better on interpolated data with confidence ≥ 95%

Since χ² = 2.19, and is < 3.84 we cannot reject the null hypothesis and therefore we conclude that interpolated data is safe with confidence ≥ 95%

Using Synthetic Data Safely in Classification


Summary of many experiments
Summary of Many Experiments Modern Sans Serif Italics/Computer Modern Funny Font)

Using Synthetic Data Safely in Classification


Summary of many experiments1
Summary of Many Experiments Modern Sans Serif Italics/Computer Modern Funny Font)

I and J

Using Synthetic Data Safely in Classification


Conclusions
Conclusions Modern Sans Serif Italics/Computer Modern Funny Font)

  • Systematic family of experiments:

    • A wide range of image qualities

  • Experiments show that amplifying training sets with synthetic data generated by interpolation in parameter space:

    • Never worsened accuracy on pure data

    • Often improved accuracy on interpolated data.

  • Using Synthetic Data Safely in Classification


    Conclusions1
    Conclusions Modern Sans Serif Italics/Computer Modern Funny Font)

    • Improvement seems to be greater when the pure fonts are most dissimilar

    • Improvement is greater when fonts are more blurred but with little variance

    • 3-way interpolation showed the most significant results

    • These results hold

      • When image quality is normal

      • When image quality is poor

    Using Synthetic Data Safely in Classification


    Typeface interpolation
    Typeface Interpolation Modern Sans Serif Italics/Computer Modern Funny Font)

    This seems to be the first time that typeface generation has been used together with image quality generation to produce synthetic training data.

    Legibility seems to be convex in typeface and image quality parameter space, that is to say that any font interpolated between two legible fonts is still legible

    Using Synthetic Data Safely in Classification


    Directions of future research
    Directions of Future Research Modern Sans Serif Italics/Computer Modern Funny Font)

    • Can we devise a method for training on synthetic data that is guaranteed never to increase confusion between any two categories?

    • What are the conditions for the generation of synthetic data that improve classification? When is no more improvement possible and worsening likely?

    • Can we generate exactly as many new samples as are needed to force a certain reduction in error rate?

    Using Synthetic Data Safely in Classification


    Directions of future research1
    Directions of Future Research Modern Sans Serif Italics/Computer Modern Funny Font)

    • Can we consistently generate data that is misclassified? We might throw such data into a boosting algorithm so it attempts to accommodate the failure and thus adapt the decision boundary.

    • Which methods are best suited for operating in the three spaces: parameter space, sample space, and feature space?

    • Can we generalize convex combinations to allow non-convex combinations which are bounded and controlled, e.g. extrapolation? Can these also be made safe?

    Using Synthetic Data Safely in Classification


    Questions
    Questions? Modern Sans Serif Italics/Computer Modern Funny Font)

    Using Synthetic Data Safely in Classification


    ad