Neural Networks: Model-Free Data Analysis

Sieci neuronowe – bezmodelowa analiza danych? K. M. Graczyk IFT, Uniwersytet Wrocławski Poland

Why Neural Networks? • Inspired by C. Giunti (Torino) • PDF’s by Neural Network • Papers of Forte et al.. (JHEP 0205:062,200, JHEP 0503:080,2005, JHEP 0703:039,2007, Nucl.Phys.B809:1-63,2009). • A kind of model independent way of fitting data and computing associated uncertainty • Learn, Implement, Publish (LIP rule) • Cooperation with R. Sulej (IPJ, Warszawa) and P. Płoński (Politechnika Warszawska) • NetMaker • GrANNet ;) my own C++ library

Road map • Artificial Neural Networks (NN) – idea • FeedForward NN • PDF’s by NN • Bayesian statistics • Bayesian approach to NN • GrANNet

Inspired by Nature The human brain consists of around 1011 neurons which are highly interconnected with around 1015 connections

Applications • Function approximation, or regression analysis, including time series prediction, fitness approximation and modeling. • Classification, including pattern and sequence recognition, novelty detection and sequential decision making. • Data processing, including filtering, clustering, blind source separation and compression. • Robotics, including directing manipulators, Computer numerical control.

Output, target Input layer Hidden layer Feed Forward Artificial Neural Network the simplest example  Linear Activation Functions  Matrix

weights i-th perceptron activation function output Summing input threshold

sigmoid qth tanh(x) threshol activation functions • Heavside function q(x) •  0 or 1 signal • sigmoid function • tanh() • linear signal is amplified Signal is weaker

architecture • 3 -layers network, two hidden: • 1:2:1:1 • 2+2+1 + 1+2+1: #par=9: • Bias neurons, instead of thresholds • Signal One F(x) x Linear Function Symmetric Sigmoid Function

Neural Networks – Function Approximation • The universal approximation theorem for neural networks states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi-layer perceptron with just one hidden layer. This result holds only for restricted classes of activation functions, e.g. for the sigmoidal functions. (Wikipedia.org)

Q2 Q2 F2 s e x A map from one vector space to another

Supervised Learning • Propose the Error Function • in principle any continuous function which has a global minimum • Motivated by Statistics: Standard Error Function, chi2, etc, … • Consider set of the data • Train given NN by showing the data  marginalize the error function • back propagation algorithms • An iterative procedure which fixes weights

Learning Algorithms • Gradient Algorithms • Gradient descent • RPROP (Ridmiller & Braun) • Conjugate gradients • Look at curvature • QuickProp (Fahlman) • Levenberg-Marquardt (hessian) • Newtonian method (hessian) • Monte Carlo algorithms (based on Marcov chain algorithm)

Overfitting • More complex models describe data in better way, but lost generalities • bias-variance trade-off • Overfitting  large values of the weights • Compare with the test set (must be twice larger than original) • Regularization  additional penalty term to error function Decay rate

Data Still Moreprecise than Theory • PDF Nature Observation Measurements Physics given directly by the data Idea Statistics Data free parameters Most of Models model QCD nonoperative Nonparametric QED What about physics Problems Some general constraints Model Independent Analysis Statistical Model  data  Uncertainty of the predictions

Fitting data with Artificial Neural Networks ‘The goal of the network training is not to learn on exact representation of the training data itself, but rather to built statistical model for the process which generates the data’ C. Bishop, ‘Neural Networks for Pattern Recognition’

Q2 F2 x Parton Distribution Function with NN Some method but…

Parton Distributions Functions S. Forte, L. Garrido, J. I. Latorre and A. Piccione, JHEP 0205 (2002) 062 • A kind of model independent analysis of the data • Construction of the probability density P[G(Q2)] in the space of the structure functions • In practice only one Neural Network architecture • Probability density in the space of parameters of one particular NN But in reality Forte at al.. did

Generating Monte Carlo pseudo data The idea comes from W. T. Giele and S. Keller Training Nrep neural networks, one for each set of Ndat pseudo-data The Nrep trained neural networks  provide a representation of the probability measure in the space of the structure functions

uncertainty correlation

10, 100 and 1000 replicas

short enough long too long 30 data points, overfitting

My criticism • The simultaneous use of artificial data and chi2 error function overestimates uncertainty? • Do not discuss other NN architectures • Problems with overfitting (a need of test set) • Relatively simple approach, comparing with the present techniques in NN computing. • The uncertainty of the model predictions must be generated by the probability distribution obtained for the model then the data itself

GraNNet – Why? • I stole some ideas from FANN • C++ Library, easy in use • User defined Error Function (any you wish) • Easy access to units and their weights • Several ways for initiating network of given architecture • Bayesin learning • Main objects: • Classes: NeuralNetwork, Unit • Learning algorithms: so far QuickProp, Rprop+, Rprop-, iRprop-, iRprop+,…, • Network Response Uncertainty (based on Hessian) • Some restarting and stopping simple solutions

Structure of GraNNet • Libraries: • Unit class • Neural_Network class • Activation (activation and error function structures) • Learning algorithms • RProp+, RProp-, iRProp+, RProp-, Quickprop, Backprop • generatormt • TNT inverse matrix package

Bayesian Approach ‘common sense reduced to calculations’

Bayesian Framework for BackProp NN, MacKay, Bishop,… • Objective Criteria for comparing alternative network solutions, in particular with different architectures • Objective criteria for setting decay rate a • Objective choice of regularizing function Ew • Comparing with test data is not required.

Data point, vector input, vector Network response Data set Number of data points Number of data weights Notation and Conventions

Probability of D given Hi Normalizing constatnt Model Classification • A collection of models, H1, H2, …, Hk • We believe that models are classified by P(H1), P(H2), …, P(Hk) (sum to 1) • After observing data D  Bayes’ rule  • Usually at the beginning P(H1)=P(H2)= …=P(Hk)

Single Model Statistics • Assume that model Hi is the correct one • The neural network A with weights w is considered • Task 1: Assuming some prior probability of w, after including data, construct Posterior • Task 2: consider the space of hypothesis and construct evidence for them

Hierarchy

wMP Constructing prior and posterior functions Weight distribution!!! likelihood Prior Posterior probability w0

Computing Posterior hessian Covariance matrix

How to fix proper a? • Two ideas: • Evidence Approximation (MacKay) • Hierarchical • Find wMP • Find aMP • Perform analytically integrals over a If sharply peaked!!!

Getting aMP The effective number of well-determined parameters Iterative procedure during training

Bayesian Model Comparison – Occam Factor Occam Factor • The log of Occam Factor  amount of • Information we gain after data have arrived • Large Occam factor  complex models • larger accessible phase space (larger range of posterior) • Small Occam factor  simple models • small accessible phase space (larger range of posterior) Best fit likelihood

Q2 Misfit of the interpolant data x Occam Factor – Penalty Term Symmetry Factor F2 Tanh(.) change w sign Evidence

Network 121 preferred by data Occam hill

131 network preferred by data

131 seems to be preferred by the data

Neural Networks: Model-Free Data Analysis

Neural Networks: Model-Free Data Analysis

Presentation Transcript