Seeing LHC/SPS beam dumps with Deep Convolutional Neural Networks ( ConvNets )

Seeing LHC/SPS beam dumps with Deep Convolutional Neural Networks (ConvNets) Bren, Francesco, Elli

Overview • Computer Vision and ConvNets • Digital images and classification problems • How computers see with deep Convolutional Neural Networks: deep learning • Some ’industry standard’ ConvNet architectures for feature recognitions • Using Keras for ConvNet construction, training and testing • Our test use-cases • LHC/SPS beam dump datasets: simulated and measured • Preparing Measured image datasets with unsupervised learning • First comparison of ‘VGG like’ and homebrew ConvNet performance • Some questions • How to overcome the paucity of measured training data for anomalous dumps? • Can we train on simulated data and apply to measured dumps? • Can transfer learning help? • Can we use CNN to produce accurate training data for anomalies? • How does explicit feature extraction and Random Forest Classification compare?

Image recognition problems • Classic problem in computer vision, image processing, and machine vision to determine if image contains specific object, feature, or activity • Object recognition (classification into types of object like cat, dog, car, …) • Identification (individual instance of an object is recognized, like face, voice, fingerprint, written digits) • Detection (seek specific conditions, like abnormalities in medical images) • Very active field, fundamental to many AI applications, huge progress made in past 6-7 years • ConvNets are widely used (with GPUs) • Checkout annual ImageNet Large Scale Visual Recognition Challenge

Digital Image formats • Each pixel in each channel is usually 8 bit resolution, i.e. value 0-255 • A (alpha) channel is to define pixel transparency, can ignore it. • Conversion of RGB to grayscale can be simple or very complicated • Simple method typically like Y = 0.2125(R) + 0.7154(G) + 0.0721(B) • We looked only at grayscale images, but in general RGB is used and logic of ConvNets applies to one channel or to a stack of 3 channels • Image normalization: we scaled by 1/255 to give pixel in range 0 – 1

Convolutional Neural Networks • ANN for analysing images (or any data with local ‘spatial’ patterns…) • Inspired by biology: connectivity pattern between neurons resembles animal visual cortex - individual cortical neurons respond to stimuli in restricted receptive field, which partially overlap • ConvNets consist of stacked layers with well-defined functions: • Input • Convolution (feature mapping) • Normalisation (non-linear activation • Pooling (dimension reduction) • Fully connected layers • Output layer • Key aspect is that convolution layers learn to find features of interest, while in traditional algorithms this feature extraction is hand-engineered. • Basically no prior image processing or feature engineering: more biological approach which does not need prior knowledge and human feature extraction

Convolutional Neural Networks • ANN for analysing images (or any data with local ‘spatial’ patterns…) • Inspired by biology: connectivity pattern between neurons resembles animal visual cortex - individual cortical neurons respond to stimuli in restricted receptive field, which partially overlap • ConvNets consist of stacked layers with well-defined functions: • Input - sensor, or retina • Convolution (feature mapping) • Normalisation (non-linear activation) • Pooling (dimension reduction) • Fully connected layers - classifier or decider • Output layer • Key aspect is that convolution layers learn to find features of interest, while in traditional algorithms this feature extraction is hand-engineered. • Basically no prior image processing or feature engineering: more biological approach which does not need prior knowledge and human feature extraction • feature extraction, or visual cortex

Convolutional Neural Networks • ANN for analysing images (or any data with local ‘spatial’ patterns…) • Inspired by biology: connectivity pattern between neurons resembles animal visual cortex - individual cortical neurons respond to stimuli in restricted receptive field, which partially overlap • ConvNets consist of stacked layers with well-defined functions: • Input sensor, or retina • Convolution (feature mapping) • Normalisation (non-linear activation) • Pooling (dimension reduction) • Fully connected layers classifier or decider • Output layer • Key aspect is that convolution layers learn to find features of interest, while in traditional algorithms this feature extraction is hand-engineered. • More biological approach which does not need prior knowledge and human feature extraction: basically no prior image processing or feature engineering • feature extraction, or visual cortex

Arrangement of ConvNet layers https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html

Input layer • (m x n) image mapped to (m’ x n’) array, and input to first convolution layer. • In many examples we found, large image sizes (e.g. 1280x720 HD) are downsampled to something manageable, like 224x224 • We used 320x280 grayscale SPS/LHC BTV images, downsampled to 224x224 or 256x256 320 256 256 280

Image filters or convolutions • Filters are small (3x3, 5x5, 7x7, …) arrays. • Here we show 2D, but in general they are 3D to process image stacks. • Convolute by scanning across image, summing product of overlapping pixels and recording result in feature map • 3D case is same principle – feature map generated by sum of pixel-wise product in volume • “Padding” on preserves image dimension, off reduces by (n-1) pixels per plane for [n,n] filter Filter Feature map Image

What do filters do? • Different filters extract different features from image • Larger filters have larger receptive area 1-D convolution (audio processing) 2-D convolution (image processing)

Example of vertical edge filter(without padding) Strong response to vertical edge feature

Normalisation (rectification) • Need non-linear response, while convolution is linear • Activation applied per pixel after each convolution – e.g. ReLU: x ← max(0,x) • Other non-linear activations are possible, like tanh, sigmoid…

Example: 4-filter conv layer applied to grayscale image • Convolution depth is number of filters, and of resulting feature map array 256 256 256 256 256 256 4 rectified feature maps Grayscale image Depth 4 depth 4 Convolution with 4 filters of dimension (m x m) Result: [256x256]x4 feature map array

Example: 16-filter conv layer applied to depth 4 feature map • Each filter is a 3D array, and has a depth equal to depth of previous layer • It’s still called a 2D convolution, as each filter produces a 2D output array (!) 256 256 256 256 256 256 16 rectified feature maps Depth 4 16 16 Convolution with 16 filters of dimension (m x m x 4) Result: [256x256] x 16 feature map array

Example: 16-filter conv layer applied to depth 4 feature map • Each filter is a 3D array, and has a depth equal to depth of previous layer • It’s still called a 2D convolution, as each filter produces a 2D output array (!) 256 256 256 256 256 256 16 rectified feature maps Crucial aspect is that a ‘deep’ convolution layer like this allows features from all the preceding maps to be combined into (more complex) feature maps of next layer…. Depth 4 16 16 Convolution with 16 filters of dimension (m x m x 4) Result: [256x256] x 16 feature map array

Visualising this…from https://arxiv.org/pdf/1311.2901.pdf Feature maps after convolution layer 1 • See very basic geometric shapes

In 2nd layer start already to pick out combinations of shapes

In 3rd layer complexity emerges from combinations of compound shapes…

Filters, weights, biases and total parameters • Filters are trainable - this is also what makes ConvNets so powerful • A flat [5,5] filter has 26 weights - 25 individual pixel multiplication values, plus an overall bias. • Similarly, a [3,3,16] filter has (3*3*16+1) = 145 trainable weights. • A conv layer with 64 filters of [3,3,32] has 64*(3*3*32+1) = 18,496 weights • This number does not depend on actual image dimension – shared weights (and many fewer than if connecting weighted neurons to each image pixel) • If filter weights are initially random, network trains from scratch. • With pre-trained filter weights we have –transfer learning

Spatial pooling (sub- or down-sampling): • Features (i.e. filtered images) are pooled by adding together pixels in a region (2x2, 3x3, …) and sliding this across image • Either ‘max’, or ‘average’ pooling • If stride = filter length, reduce image size by same factor • Pooling layer reduces spatial size of representation • No trainable weights for pooling layer

2x2 spatial pooling with stride 2 • Pooling our 16 x (256x256) array with stride 2 and [2,2] 256 128 128 256 16 (128 x 128 x 16) pooled feature map array 16 (256 x 256 x 16) feature map array

Why pooling? Why not just connect to a dense layer to classify features? • Dimensionality. Connecting 64 feature maps of 256x256 pixels to a 1024 neuron dense layer would mean ≈ 4*109 weights • Pooling reduces sensitivity to position and size of feature, i.e. introduces spatial and scale invariance – good for some applications, but maybe not for ours…

Effect of repeated convolution and pooling • Hidden layers are composed of different arrangements of alternating convolution+rectification and pooling layers • The first layer of filters identifies very basic features, like edges (vertical, horizontal, tilted), or curves, or dark/light regions. • Repeated pooling and filtering lets ConvNet build and identify more complex groups of features from low-level geometric building blocks. • Vastly improves discrimination and recognition of complex patterns.

Fully connected (dense) layers • Output of last conv layer flattened and connected to dense layer • Maybe several dense layers with full neural interconnection • These layers recognize patterns coming from feature maps Fully connected layer(s) Flatten Feature map array weights weights

Almost there –output layer • Final layer of output neurons • Each corresponds to one category in training/validation and (hopefully) test data Fully connected layer(s) Output layer Softmax categories Category 0 Flatten Category 1 Category 2 Category 3 Feature map array weights weights weights

Training the Network • Imagine we have 2 Convolution/activation layers each followed by pooling • 1st convolution layer is the input layer • 1 Flattening layer • 2 Dense (fully connected) layers • 1 Softmax output layer • Here 5 layers have trainable weights: convolution filters and dense layers Flatten . ReLU. ReLU. Dense . Dense . Convolute Convolute Pool . Pool . w w Image A or B w w w A Softmax. B

Network Training - Backpropagation • Supervised learning: labelled sets of images sorted in categories • Network trained using Backpropagation and Gradient Descent (stochastic or mini-batch) • Error signal E generated by comparing output values with correct answer Flatten . ReLU. ReLU. Dense . Dense . Convolute Convolute Pool . Pool . w w w w w Image A A 0.7 (E = |1.0 – 0.7| = 0.3) Softmax. B 0.3 (E = |0.0 – 0.3| = 0.3)

Network Training - Backpropagation • Error E used to update training weights wi, via calculation of partial derivative of each weight dE/dwi • As an aside, training and fitting CNNs is very well adapted to GPUs….much faster than CPU Flatten . ReLU. ReLU. Dense . Dense . Convolute Convolute Pool . Pool . w w w w w Image A Error = 0.6 A Softmax. B

One last thing –regularization • With many trainable parameters, over-fitting is an issue • Hampers generalization– recognition of similar-but-not-identical images • Dropout layers help, by randomly switching off a fraction (0.4, 0.5, …) of neurons per layer for each training iteration. Forces other pathways to recognise image features and reduces overfitting - effectively creating set of networks that combine to give result • Dropout used between dense layers. For conv. layers, spatial relationships are encoded in feature maps, with activations highly correlated in adjacent neurons • Another method is image augmentation, which increases training data • Yet another way is to add noise, either to input or to hidden layers

ConvNet application domains • The ConvNet approach is valid for 2D or 3D datasets where “things closer together are more closely related than things far away” (i.e. data can be represented in image-like array where location in the array is important). • An example is a time-series of frequency-sampled data… • ConvNets were used with (deep) reinforcement learning for DeepMind, which played (well) a range of Atari computer games using just video screen as input, for one architecture and one set of hyperparameters

ConvNet architectures and hyperparameters • Find many different architectures and variations on basic layers, tailored to specific applications • In addition, hyper-parameter space is quite large, and related both to CNN structure (filter size, number of filters, pooling size and stride, dropout rate(s), weight initialization, activation function choice) and to optimizer (learning rate, batch size, iterations) • No formula (but many specific examples) - design of ConvNets seems to be an art • One ‘almost hyperparameter’ is image augmentation – this is applying some random shifts, scale factors, rotations, noise, skew to training images to increase the size of training dataset.

Real ConvNet architectures: VGG16 channels filters pixels 16 weight layers, 138 million trainable parameters 1000 output categories 14 15 16 11 12 13 8 9 10 5 6 7 3 4 Layer 1 2 2014

Real ConvNet architecture: GoogleNet/Inception • Uses Inception layer where different dimension of filters are applied in parallel 2014

Many other ConvNet architectures

Keras implementation ### define some CNN configuration parameters img_size = 256 # H/V input pixels num_categories = 6 # categories for classification num_channels = 1 # 3 for RGB, 1 for mono pad = 'same' # padding, ‘same’ to preserve image dimension drop = 0.5 # dropout rate when training ### Assemble the CNN model = Sequential() # input and convolution layers model.add(Conv2D(filters=32, kernel_size=(3,3), activation='relu', input_shape=(img_size, img_size, num_channels), padding=pad)) model.add(MaxPool2D(pool_size=(2,2), strides=2)) model.add(Conv2D(filters=64, kernel_size=(3,3), activation='relu', padding=pad)) model.add(MaxPool2D(pool_size=(2,2), strides=2)) # batch normalisation and flatten model.add(BatchNormalization()) model.add(Flatten()) # Dense layers : dropout to avoid overfitting / regularization model.add(Dense(units=256,activation='relu')) model.add(Dropout(drop)) model.add(Dense(units=256,activation='relu')) model.add(Dense(units=num_categories,activation='softmax')) ### Define optimizer and compile optimizer = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0, amsgrad=False) model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"])

Training • We started with 1000-2000 images per category, either simulated directly or screenshot from simulations, with 4-6 categories. • Split into training, validation and test data sets • ~4’000 training images per epoch without augmentation, 40’000 with augmentation, batch size of 50 (i.e. 80 or 800 iterations per epoch) and 10-70 epochs • Initial learning rate of 0.001 but used callback to reduce by factor 3 if a few epochs without improvement • Networks trained quickly (1-10 minutes) using cloud GPU

V1 our first ConvNet attempt • V1 with 16 weight layers was adapted from VGG16, with [3,3] filters, 5 layers of x2 maxpooling down to 7x7 feature map size (from 224x224), tapered dense layers and ~5.6 M weights • This sort of approach had worked very well with the 28x28 MNIST handwritten numbers dataset (99.6% accuracy) • Initially performed poorly with simulated LHC BTVDD images • Seemed to be strongly overfitting, as 1.000 vac_acc in training but could not distinguish between 0MKB and nominal dump… • Conjectured that it lacks spatial discrimination, so made V2 with less pooling • Main issue found to be over-enthusiastic use of dropout and not normalizing images to 0-1 • When ‘fixed’, tested with failure types not in training data… no augmentation, dropout 0.3 X10 augmentation, dropout 0.3 100% test accuracy 100% test accuracy WTF??

V1 effect of dropout/norm/augment • This network had dropout layers interleaved with convolution layers • With initial dropout value of 0.5, no data augment and non-normalised, output was random • Turning dropout rate down to 0.3 made it work, as did normalising input • Before this became clear, already made V2 with less pooling and only 1 dropout (in FC layer) • Will go back to V1 in future, x10 fewer training weights and more ‘elegant’ interconnectivity no augmentation, dropout 0.5 Lessons: careful with dropout, normalize input, augment data 18% test accuracy (=random) note different v scales no augmentation, dropout 0.3 100% test accuracy

V1 Classification examples Not trained on this fault Not trained on this fault V1 with x10 augmentation 0.3 dropout

V2 our second ConvNet attempt • V2 with 10 weight layers, with [3,3] filters, 2 layers of x2 maxpooling down to 64x64 feature map size (from 256x256), 2 x256 dense layers and ~67 M weights • 100% accurate on simulated LHC BTVDD test data • Tested trained V2 with measured BTVDD images from Timber, to see if training on simulated data is sufficient • Systematically misclassified nominal dumps as missing 1 MKDH (H scale issue) which can be fixed • But other puzzling results like below, where totally wrong • We can try transfer learning here - retrain classification layers with (smaller) measured dataset, starting from already-trained weights. x10 augmentation, dropout 0.5

V2 ConvNet for SPS dump • Aim is try to train a NN using simulations • Produced 1000 simulations of expected BTVDD readings for 4 categories for the training set: • “Nominal”, “3 MKDH & 2 MKDV”, “2 MKDH & 3 MKDV”, “1 MKDH & 3 MKDV” • Produced 1000 simulations for the same categories for the validation set • Data produced for the SPS: • Dump for SFTPRO beam at 400 GeV  2 batches with realistic length and batch spacing • Random CO in x and y at the BTV location  Gauss(0, 3 mm) (quite generous…) • Random emittance in x and y  Gauss_x(9 mm.mrad, 3 mm.mrad), Gauss_y(7 mm.mrad, 3 mm.mrad) • 1000 particles per 420x2x25 ns slots (simpler to simulate) all the same “Simple” test data 3 MKDH & 2 MKDV Nominal 2 MKDH & 3 MKDV 1 MKDH & 3 MKDV

ConvNet for SPS dump • For the test set, two sets produced: • One using the same generations as for the training and validation: “simple” • One using completely unrealistic emittances  x10 than val and train: “challenging” • The idea was to evaluate the network capability to recognise dump at different energies, as far as the kick is constant  true in most cases…only at 14 and 26 GeV this is not true…to be checked what happens then! “Challenging” test data 3 MKDH & 2 MKDV Nominal 2 MKDH & 3 MKDV 1 MKDH & 3 MKDV

ConvNet for SPS dump • Using only the data generated with simulations, i.e. 4000 in total for training and validation set, the results are not great at all…even with the simple test set: “Simple” test data 1 MKDH & 3 MKDV 2 MKDH & 3 MKDV 3 MKDH & 2 MKDV Nominal Nominal 1 MKDH & 3 MKDV

ConvNet for SPS dump • Adding more training data with ImageDataGenerator augment from keras to increase the statistics  4000x10 . Much better. “Simple” test data 1 MKDH & 3 MKDV 2 MKDH & 3 MKDV 3 MKDH & 2 MKDV Nominal Nominal 1 MKDH & 3 MKDV

Seeing LHC/SPS beam dumps with Deep Convolutional Neural Networks ( ConvNets )

Seeing LHC/SPS beam dumps with Deep Convolutional Neural Networks ( ConvNets )

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7