A Routine Approach to Quality Control Peter Haberl 19. 11. 2001
Content The GDE Controller • Workflow • Gradients • Distortions • Local defects • Condensing Playing with negative AvgDiff values
GDE Controller Workflow .CEL DB Upload server GD Expressionist™ Analyst feature data (.CEL files) GD Expressionist™ Controller GD CoBi™ Database .ABS .REL ... is part of the GD ExpressionistTM system
GDE Controller Workflow • Intensity values • Flagging of outliers Quality Control .CDF .CEL Condensation .DAT GD Expressionist™ Analyst .INS ok Outliers .ABS .CEL .CDF .CEL Condensation DB .CHP Affymetrix GeneData ... extends the conventional data flow
GDE Controller Workflow login options and thresholds available chip layouts (.CDF files) available experiments (.CEL files)
GDE Controller Workflow The Controller is about ... ... detection of location dependent systematic effects (gradients) of intensity dependent systematic effects (distortions) of local defects ... correction of global gradients of global distortions ... condensing constructing expression values using different algorithms
GDE Controller Gradients Gradients: incomplete washing? thermal effects? ... ?
GDE Controller Gradients divide the chip into 4 x 4 sectors (as for the background determination) look at the feature distribution in each sector, in particular at the mode (maximum position) and the width Idea:*) (single chip version) ln ( counts ) ... ln ( intensity ) *) developed in discussions with H. Seidel (Schering, Berlin)
GDE Controller Gradients In an iterative process, transform the intensities I(x,y) I’(x,y) = a(x,y) I(x,y) + b(x,y) such that the sector histograms become aligned. all sector histograms after first step: all sector histograms after third step: scale factor a(x,y) in first step: offset b(x,y) in first step:
GDE Controller Gradients • It was later decided to perform only a multiplicative correction, • I(x,y) I’(x,y) = a(x,y) I(x,y) • for two reasons: • practical application showed that the scale factor is the dominant effect; • the observable AvgDiff is insensitive to the offset b(x,y) . A basic assumption of the ‘single-chip’ version is that the distribution of bright and dark features is random. If this assumption is violated (e.g. for the yeast chip), the ‘single-chip’ version encounters problems. The ‘multi-chip’ version compares the sector histograms not among themselves, but to the sector histograms of a ‘reference chip’. (This is of course only possible if enough ‘similar’ chips are available.)
GDE Controller Gradients Result of Gradient Correction: original corrected ‘heat map’ of the scale factor a(x,y)
GDE Controller Gradients Further example of Gradient Correction: corrected ‘heat map’ of the scale factor original
GDE Controller Distortions Distortions: A log-log plot of coding (i.e. PM and MM) features can show a nonlinear relationship when compared to the features of a ‘reference chip’. One of the reasons can be that chips from different chip lots are combined to a series. (Again, the reference chip can only be constructed if enough ‘similar’ chips are available.)
GDE Controller Distortions experiment reference Idea: divide the reference signal region into stripes containing the same number of points (red lines) in each stripe, determine the median of experiment signals (or – equivalently – the point of maximum density) force this median line to be the diagonal of the new point cloud; this determines the (intensity dependent) transformation
GDE Controller Distortions Result of Distortion Correction: impossible to correct
GDE Controller Reference Chip serves as a ‘virtual standard’ for a given experiment set • the experiment set should be homogeneous: • chips from the same production lot • probes from the same tissue • a small number of differentially expressed genes • doesn’t change the characteristic pattern the chips have to be made comparable, for instance with a global logarithmic-mean normalization the reference chip is computed featurewise (as mean or median) Both gradient and distortion detection/correction require the concept of a Reference chip: normalized set reference chip
GDE Controller Local Defects Local defects: There are local defects which are already visible in a global chip view: view of outlier locations: Aim: Can we reliably detect smaller local defects, if possible automatically?
GDE Controller Local Defects experiment reference 0 1 2 0 1 2 x00 x01 x02 y00 y01 y02 0 0 • construct a ‘ratio chip’ by dividing each feature by its counterpart on • the reference chip • for visualisation purposes, show in • green features which are brighter • red features which are darker • black features that don’t change • local defects should show up as • speckles of homogeneous color, • with diameters of at least several features x10 x11 x12 y10 y11 y12 1 1 0 1 2 x00/y00 x01/y01 x02/y02 0 x10/y10 x11/y11 x12/y12 1 ratio chip Idea:
GDE Controller Local Defects actual defects differential regulation
GDE Controller Local Defects This method can identify defects which would be hard to find ...
GDE Controller Local Defects ... or invisible, even in a zoomed view:
GDE Controller Local Defects differential regulation For old (row-wise spotted) chips, there is the danger that differen-tially expressed genes are detected as chip artefacts Application of pattern search algorithms can solve this problem
GDE Controller Local Defects Further example of a local defect:
GDE Controller Local Defects Defects can have a certain spatial extension:
GDE Controller Local Defects Most frequent structures:
GDE Controller Local Defects ... and others:
GDE Controller Local Defects • An interactive chip viewer allows to • view identified mask areas • zoom and find out which genes • are affected by masking • manually edit the masked areas
GDE Controller Workflow reporting export to database, into analysis software or as .CEL files choose between different condensing algorithms: MAS4, MAS5, GeneData ( = trimmed mean of log(PM) )
Playing with negative AvgDiff values log-log plot: replicates: large differential expression: correlation of large values is visible only positive values can be displayed
Playing with negative AvgDiff values linear-linear plot: replicates: large differential expression: negative values can be displayed poor resolution for small values large values appear scattered
Playing with negative AvgDiff values 3 y = AvgDiff ‘cube-root’ plot: replicates: display of positive and negative values damping at large values ‘zero density regions’ (artefact)
Playing with negative AvgDiff values x2 1 1 x2 2 x - + x + o(x3) , x < < 1 = ± ln( |x| ) + + o( ) , x > > 1 ‘lin-log’ transformation: y = sign(x)*ln( 1 + |x| ) y = x sign(x)*ln( 1 + |x| ) = y = ln(x) interpolates smoothly between linear (for small values) and logarithmic (for large values) behaviour damping of high values
Playing with negative AvgDiff values ‘lin-log’ plot: replicates: large differential expression: A good choice is x = AvgDiff / Target , i.e. the target intensity sets the scale Lines of constant factors are shown in blue (2), red (5) and green (10)
Playing with negative AvgDiff values Target SF = TrimmedMean(AvgDiff) The ‘lin-log’ plots allow to look at positive and negative AvgDiff values simultaneously. But why would we want to look at the negatives at all? Consider the following ‘experiment’: Construct faked .CEL files, where all PM-MM-pairs are interchanged, and condense them with the old Affymetrix algorithm (ignoring AbsCall). Amusing observation: If one ignores that the scale factor gets negative, (MAS doesn’t: “Failed to analyze due to invalid Scale Factor”) the old (MAS4) algorithm would be invariant under PM MM !
Playing with negative AvgDiff values Original data: the ‘three-tissue-dataset’: 3 groups with 6 replicates each perfect group separation: within replicate groups across replicate groups
Playing with negative AvgDiff values PM MM data: These are log-log-plots of negative AvgDiffs. The good correlation at high values indicates that these numbers are reproducible. The difference between replica groups is not so obvious, but ...
Playing with negative AvgDiff values ... clustering again results in a complete group separation: Take-home message: The mismatches carry information which can be measured reproducibly and can be used (at least) for pattern comparisons.