Closed Loop Automatic Optimisation of GC/MS Parameters

Closed Loop Automatic Optimisation of GC/MS Parameters Steve O’Hagan, Warwick Dunn, Marie Brown, Joshua Knowles, Douglas B. Kell School of Chemistry, The University of Manchester. Email: SOHagan@manchester.ac.uk

The problem?Setting up the GC/MS involves selecting numerous parameters which effect the ‘fitness’ of the chromatograms produced. • The effect on the chromatogram of different parameters is not generally predictable for unknown compounds. • Parameters may be correlated or anti-correlated with each other in terms of the ‘fitness’ of the final chromatogram. Using genetic terminology, the parameters often exhibit ‘epistasis’. • What constitutes a set of ‘good’ parameters for one type of sample may be completely different for another type of sample.

The problem – continued… • In many analytical scenarios, only a few compounds are being analysed, or only specific targeted compounds are of interest – in these situations it is entirely practical for the analyst to manually optimise GC/MS operation for each experiment. • In metabolomic studies – many hundreds of compounds may be of interest, and many of these will be unknown or unassigned compounds – thus it is not practical for each experiment to be manually optimised.

Sample volume injected Inlet temperature Split ratio Helium flow rate Acquisition rate Detector voltage Start temp Start temp hold time Ramp speed Final temp Hold final temp The problem?Setting up the GC/MS involves selecting numerous parameters.

GC/MS parameters

Why not use classical optimisation methods? • Non-linearity – effectively expands search space – many classical techniques rely on linear assumptions. • The presence of discontinuous functions rather than smooth functions will invalidate many classical methods. • Multiple optima – traditional techniques are more likely to get trapped in local optima. • More difficult to extend into a multi-objective scenario. • Many traditional optimisation techniques require a theoretical model of the process being optimised.

An Introduction to the GA Approach • Process settings (the ‘phenotype’) encoded using text or binary representation (the ‘genotype’). • Evolutionary ‘Fitness’ - predefined function of process output. • Only those genes with the highest fitness selected for reproduction. • ‘Genes’ reproduce by a combination of ‘mutation’ and ‘cross-over’.

Mutation Mutation Phenotype: Sample volume=5 µL Phenotype: Sample volume=2 µL Genome: ?

Cross-over Cross-Over

GA Applied to Instrument Optimisation

Single Objective GA for Multi-Objective Problems? • Multi-objective optimisation based on simple GA often used some function of the objectives to reduce the dimensionality of the problem: • F1=F(f1,f2,f3,...) • Problem with this approach - can weighting of objectives f1,f2,f3,... Be done in a non-biased way ---- probably not.

A ‘true’ Multi-Objective GA • Based on concept of Pareto Dominance: • (x,y) dominates (p,q) IF F1(x,y)<F1(p,q) and F2(x,y)<=F2(p,q)OR F1(x,y)<=F1(p,q) and F2(x,y)<F2(p,q) • In words: At least one objective is better and the other objectives are not worse.

Pareto Set / Pareto Front • The GA collects a population of non-dominated results - in terms of the decision variables or parameters, this is termed the “Pareto Set”. • Selection tends to retain “Fit individuals” - only non-fit (dominated) results are discarded. • Transformation of variables to the space of objective functions gives rise to a curve or surface termed the “Pareto Front”.

Pareto Set / Pareto Front Transformation of variables (x,y)-->(F1,F2)

Choosing GA Parameters • The PESA II GA requires several parameters to be set up for each optimisation experiment :- • Mutation and cross-over rates. • Internal Population (Number of Experiments performed per generation). • External Population (Max Number of Non-dominated solutions to keep). • We used synthetic objective functions which exhibit characteristics similar to those found in real experiments to minimise the number of generations required to converge to the Pareto Front.

START Generate random internal population, IP, of size ipsize, and evaluate these solutions Add the members of IP to the external population, EP Filter out all dominated solutions from EP. If more than epsize remain, remove duplicate/crowded solutions until EP is of size epsize Delete old IP. Select ipsize solutions from EP using a bin-based selection policy. Copy these “parent” solutions to IP Reproduce: apply recombination and mutation to the parent solutions to produce ipsize offspring. Let the offspring replace the parents in IP, and evaluate End Optimization No Yes STOP ? PESA II Algorithm

Objective (Fitness) Functions • Probably the most difficult aspect of a GA optimisation procedure is the choice of the fitness functions to use. • S/N: the mean signal-to-noise ratio for the 15% of peaks with the worst signal-to-noise found in the sample. • PEAKS: The number of peaks in sample minus the number of peaks flagged as noise peaks. • RUN TIME: The time between the injection of sample and the end of data collection.

Closing the Loop • Link Directly to Manufacturer’s software – e.g. ActiveX, API calls. • Import / Export of Instrument data files. • Well defined format? E.g. text, MS Access database files, documented binary? • Mimic the human user – i.e. interact with software via windows GUI – ‘Eventcorder’. • Communicate directly with the instrument interface via PC port / serial port / parallel port etc. • Robotise physical manipulation of samples – for GC/MS injection – already implemented via auto-sampler. • (But what about automated sample treatment / clean-up – something for the future for our lab!?)

Eventcorder:Simulating User Input • Eventcorder is a suite of programs that enables user input to any windows program to be recorded and played back. • ‘User’ input can be controlled programmatically. • Playback can be achieved via the recorder software, the (included) script editor, or via any ActiveX aware programming environment (e.g. visual basic, visual C / C++, Delphi, Excel VBA etc.) • Eventcorder tries to match mouse positions on playback of ‘click events’ with those originally recorded by comparing small images of the region of the screen around the mouse pointer with images stored during the original recording. If the images fail to match, the playback can be aborted.

Problems Associated with Automating Windows GUI’s • Position of GUI elements may not be fixed. (Screen resolution, windows and dialogues may be moved / re-sized). • Hierarchy of GUI structures (menus, trees etc) not fixed. • Not possible to retrieve information from the application GUI. • Appearance of GUI elements may not be fixed from run to run, and may be ambiguous.

Examples of Windows GUI Problem

Retrieving Experimental Data • Instrument software often allow data export as standard. • May be limited as to what it is possible to export. • Often not able to export raw data – thus must rely on the built-in data processing.

Using a Simple GA to Filter Out ‘Noise’ Peaks • LECO GC/MS software ‘de-convolutes’ chromatograms based on the mass spectra. • Quirk of the software introduces artefact peaks – we attribute these to ‘noise’ and to ‘duplicates’. • A simple single objective GA was used to find a filter function which would discriminate between ‘noise’ and ‘Genuine’ peaks. • A human expert assigned peaks found in a data ‘training set’. • Gmax-bio software and PCA was used to determine which output parameters correlated with the presence of ‘noise’ peaks: Only purity and FWHH were found to model the noise peak data. • Inspection of the training data set using a plot of Purity, P verses FWHH, suggested rather simple criteria, viz:

‘Noise’ Peaks & ‘True’ Peaks

Results on YeastRun time vs Generation; Size=S/N; Colour=Peaks (red=low, blue=high)

Evolution of GC-tof conditions for the optimal separation of typical yeast supernatant (‘metabolic footprint’) metabolites. The diagram shows the 2 main outputs (peak number and run time) for each trial separation in 114 generations (228 examples). The generation number is encoded in the size of the symbol (larger = later) and the S/N via the colour (bluer = higher). The peak number is the ‘raw’ peak number including duplicates provided by the LECO software after correction as described in the text for noise peaks.

Histogram of ‘Ramp Speed’

‘Ramp Speed’ vs Generation Number

Yeast supernatant TIC chromatogram for generation 1 experiment 2. Sample volume 4µl, Injection temperature 270°C, split ratio 1:57, flow rate 2.0ml.min-1, acquisition rate 15Hz, initial hold time 5.0 minutes, GC temperature ramp 24°C.min-1, final GC temperature 290°C, hold time 0.0 minute.

Yeast Supernatant TIC chromatogram - optimised (gen. 113 expt. 2). Sample volume 5µl, Injection temperature 270°C, split ratio 1:45, flow rate 1.0ml.min-1, acquisition rate 10Hz, initial hold time 4.0 minutes, GC temperature ramp 28°C.min-1, final GC temperature 290°C, hold time 1.0 min.

Human serum TIC chromatogram for generation 1 experiment 2. Sample volume 4µl, Injection temperature 270°C, split ratio 1:57, flow rate 2.0ml.min-1, acquisition rate 15Hz, initial hold time 5 minutes, GC temperature ramp 24°C.min-1, final GC temperature 290°C, final hold time 0 minute.

Human serum TIC chromatogram - optimised (gen. 113 expt. 2).Sample volume 2µl, Injection temperature 270°C, split ratio 1:3, flow rate 0.8ml.min-1, acquisition rate 15Hz, initial hold time 4.5 minutes, GC temperature ramp 20°C.min-1, final GC temperature 300°C, final hold time 4.5 minute.

Evolution of GC-tof conditions for separation of typical serum metabolites. The diagram shows the 2 main outputs (peak number and run time) for each trial separation in 120 generations (240 examples). The generation number is encoded in the size of the symbol (larger = later) and the signal:noise via the colour (bluer = higher). The peak number is the ‘raw’ peak number including duplicates provided by the LECO software after correction for noise peaks.

Contributors: • Douglas Kell: Project leader; data analysis (G-max). • Steve O’Hagan: MS Windows – GC/MS automation; Visual Basic & Eventcorder programming. • Warwick Dunn: GC/MS sample preparation. • Joshua Knowles: PESA II algorithm development. • Marie Brown: Data analysis (PCA etc) • BBSRC: Financial support.

Closed Loop Automatic Optimisation of GC/MS Parameters