ml4

NeuralNetworks!(MLPs) CS 445/545

WhatcanIdowithaNN? Ashortlistofapplications: (*)Binaryclassification,1-in-Kclassification,regression (*)Generalpatternrecognition/statisticallearning(*)Characterrecognition,facialrecognition (*)ComputerVision:imageclassification,localization,scene recognition,captioning (*)SignalProcessing:noisesuppression,signalanalysis(*)Datacompression (*)NLP:machinetranslation,sentimentanalysis (*)Finance:statisticalarbitrage,riskanalysis(*)AI:Q-Learning(reinforcementlearning)(*)Medicine:diagnosis,imaging,genomics(*)Law:informationretrieval (*)Computationalcreativityapplications

Abitofhistory • 1960s:Rosenblattprovedthattheperceptronlearningruleconvergestocorrectweightsinafinitenumberofsteps,providedthetrainingexamplesarelinearlyseparable. • 1969:MinskyandPapertprovedthatperceptronscannotrepresentnon-linearlyseparabletargetfunctions. • However,theyshowedthataddingafullyconnectedhiddenlayermakes • thenetworkmorepowerful. • – I.e.,Multi-layerneuralnetworkscanrepresentnon-lineardecisionsurfaces. • Lateritwasshownthatbyusingcontinuousactivationfunctions(ratherthanthresholds),afullyconnectednetworkwithasinglehiddenlayercaninprinciplerepresentanyfunction. • 1986:“rediscovery”of backprop algorithm:Hintonetal.

Abitofhistory

Linearseparability Hyperplane In2D: w1x1w2x2w00 Feature1 w w x  x2 0 1 1 w w2 2 Feature2 Aperceptroncanseparatedatathatis linearly separable.

Multi-layerneuralnetworkexample Decisionregionsofamultilayerfeedforwardnetwork.(From T.M.Mitchell,MachineLearning) Thenetworkwastrainedtorecognize1of10 vowelsoundsoccurringinthecontext“h_d”(e.g.,“had”,“hid”) Thenetworkinputconsistsoftwoparameters,F1andF2,obtainedfromaspectralanalysisofthesound.The10networkoutputscorrespondtothe10possiblevowelsounds.

Goodnews:Addinghiddenlayerallowsmoretargetfunctionstoberepresented.Goodnews:Addinghiddenlayerallowsmoretargetfunctionstoberepresented. • Badnews:Noalgorithmforlearninginmulti-layerednetworks,andnoconvergencetheorem! • QuotefromMinskyandPapert’sbook,Perceptrons(1969): • “[Theperceptron]hasmanyfeaturesto attractattention: itslinearity;itsintriguing learningtheorem;itsclearparadigmaticsimplicityasakindofparallelcomputation.Thereisnoreasontosupposethatanyofthesevirtuescarryovertothemany-layeredversion.Nevertheless,weconsiderittobeanimportantresearchproblemtoelucidate(orreject)our intuitivejudgmentthattheextensionis sterile.”

Twomajorproblemstheysawwere: • Howcanthelearningalgorithmapportioncredit(orblame)toindividualweightsforincorrectclassificationsdependingona(sometimes)largenumberofweights? • Howcansuchanetworklearnusefulhigher-orderfeatures? • Goodnews:Successfulcredit-apportionmentlearningalgorithmsdevelopedsoonafterwards(e.g.,back-propagation). • Badnews:However,inmulti-layernetworks,thereisno • guaranteeofconvergencetominimalerrorweightvector. • Butinpractice,multi-layernetworksoftenworkverywell.

Summary • Perceptronscanonlybe100%accurateonlyonlinearlyseparableproblems. • Multi-layernetworks(oftencalledmulti-layerperceptrons,orMLPs)canrepresentanytargetfunction. • However,inmulti-layernetworks,thereisnoguaranteeofconvergencetominimalerrorweightvector. • Onecanshow,mathematically,thatone hiddenlayerissufficienttoapproximateany function toarbitraryaccuracywithaNN.ThisisknownastheUniversalApproximationTheorem(1989)(we say:“NNsareuniversal functionapproximators”);RNNsare TuringComplete.

A “two”-layerneuralnetwork (activationrepresentsclassification) outputlayer hiddenlayer (internalrepresentation) (activationsrepresentfeaturevectorforonetrainingexample) inputs •Inputlayer— Itcontainsthoseunits (artificialneurons)whichreceiveinputfrom the outside worldonwhichnetworkwilllearn,recognizeaboutorotherwiseprocess. •Outputlayer— It containsunits that respondto the informationabouthowit’slearnedany task. •Hiddenlayer— Theseunits areinbetweeninputandoutputlayers.Thejob of hiddenlayeris totransformtheinputintosomethingthatoutputunitcanuseinsomeway. Mostneuralnetworksarefullyconnectedthatmeanstosayeachhiddenneuronisfully connectedtotheeveryneuroninitspreviouslayer(input)andtothenextlayer(output)layer.

ClassificationPipeline

DifferentTypesofNeuralNetworks Perceptron— NeuralNetworkhavingtwoinputunitsandoneoutputunitswithnohiddenlayers.These arealso knownas‘singlelayerperceptrons. RadialBasisFunctionNetwork— Thesenetworksaresimilar tothefeedforwardneuralnetworkexcept radialbasisfunctionisusedasactivationfunctionoftheseneurons. MultilayerPerceptron— Thesenetworksusemorethanonehiddenlayerofneurons,unlikesinglelayer perceptron.Thesearealsoknownasdeepfeedforwardneuralnetworks. RecurrentNeuralNetwork—Typeofneuralnetworkinwhichhiddenlayerneuronshasself-connections.Recurrentneuralnetworkspossessmemory.Atanyinstance,hiddenlayerneuronreceivesactivationfromthelowerlayeraswellasitpreviousactivationvalue. Long/ShortTermMemoryNetwork(LSTM)—Typeofneuralnetworkinwhichmemorycellis incorporatedinsidehiddenlayerneuronsiscalledLSTMnetwork. ConvolutionalNeuralNetwork— GetacompleteoverviewofConvolutionalNeuralNetworksthroughour blogLog AnalyticswithMachineLearning and DeepLearning.

Example:ALVINN • (Pomerleau,1993) • ALVINNlearnstodriveanautonomousvehicle • atnormalspeedsonpublichighways. • Input:30x32gridofpixelintensitiesfrom • camera

(Note:biasunitsand weightsnotshown) Eachoutputunitcorrespondtoaparticularsteeringdirection.Themosthighlyactivatedonegivesthedirectiontosteer.

Example:DeepMind(DeepQlearningfor Atari,2014)

Activationfunctions • Advantagesofsigmoidfunction:nonlinear,differentiable,hasreal-valuedoutputs,andapproximatesathresholdfunction.

Sigmoidactivationfunction: o(wx), where (z) 1 1ez

Thederivativeofthesigmoidactivationfunctioniseasily • expressedintermsofthefunctionitself: • d(z)(z)(1(z)) dz Thisisusefulinderivingtheback-propagationalgorithm.

(z)(1(z)) 1 (1ez)1 1ez (z)    1    1  1   1ez  1ez  d1(1ez)2 d (1ez) 2   1 1    dz dz 1ez 1ez      1 1   1   ez  z (1ez)2   1e (1ez)2   1e 1 z    z 2z 2 (1e ) (1e )   z e  z e (1ez)2  (1ez)2 d(z) (z)(1(z)) And thus themath Gods said… dz

Neuralnetworknotation (activationrepresentsclassification) (internal representation) (activationsrepresentfeaturevectorforonetrainingexample)

Neuralnetworknotation (activationrepresentsclassification) (internal representation) (activationsrepresentfeaturevectorforonetrainingexample) Sigmoidfunction:

Neuralnetworknotation xi:activationofinputnodei. hj:activationofhiddennodej. (activation ok:activationofoutputnodek. representsclassification) wji:weightfromnodeitonodej. o:sigmoidfunction. (internal representation) (activationsrepresentfeaturevectorforonetrainingexample) Foreachnodejinhiddenlayer,   hj  wjixi wj0 iinputlayer  Sigmoidfunction: Foreachnodekinoutputlayer,   ok wkjhj wk0 jhiddenlayer 

Classificationwithatwo-layerneuralnetwork (“Forwardpropagation”) Assumetwo-layernetworks(i.e.,onehiddenlayer): Presentinputtotheinputlayer. Forwardpropagatetheactivationstimestheweightstoeachnodeinthehiddenlayer. Applyactivationfunction(sigmoid)tosumofweightstimesinputs toeachhiddenunit. Forwardpropagatetheactivationstimesweightsfromthehiddenlayertotheoutputlayer. Applyactivationfunction(sigmoid)tosumofweightstimesinputs toeachoutputunit. Interprettheoutputlayerasaclassification.

SimpleExample Input: HiddenLayer: o1 o1 o1 o1 .1 .1 −.5 0.470 −.5 −.2 −.2 −.1 −.1 0.547 h2 h1 .1 .1 −.2 .3 −.4 −.2 −.4 .1 .2 .3 .1 .2 1 x1 0.4 x2 0.1 1 x1 0.4 x2 0.1

OutputLayer: 0.461 0.455 .1 −.5 −.2 −.1 0.547 0.470 .1 −.2 −.4 .3 .1 .2 1 x1 0.4 x2 0.1

“Softmax”operation Oftenusedtoturnoutputvaluesintoaprobabilitydistribution eoi ysm(oi) , ysm= .501 ysm= .499 K eok k1 0.461 0.455 .1 whereKis thenumberofoutputunits. −.5 −.2 −.1 0.547 0.470 .1 −.2 −.4 .3 .1 .2 1 x1 0.4 x2 0.1

Whatkindsofproblemsaresuitableforneuralnetworks? • Havesufficienttrainingdata • Longtrainingtimesareacceptable • Notnecessaryforhumanstounderstandlearnedtarget • functionorhypothesis

Advantagesofneuralnetworks • Designedtobeparallelized(e.g.splitminibatches,useGPUs) • Robustonnoisytrainingdata • Fasttoevaluatenewexamples

Trainingamulti-layerneuralnetwork Repeatforagivennumberofepochsoruntilaccuracyontrainingdataisacceptable: Foreachtrainingexample: 1.Presentinputtotheinputlayer. Forwardpropagatetheactivationstimestheweightstoeachnode inthehiddenlayer. Forwardpropagatetheactivationstimesweightsfromthehiddenlayertotheoutputlayer. Ateachoutputunit,determinetheerror. Runtheback-propagationalgorithmone layer atatimetoupdateallweightsinthenetwork.

Trainingamultilayerneuralnetworkwithback- • propagation(stochasticgradientdescent) • Supposetrainingexamplehasform(x,t) • (i.e.,bothinputandtargetarevectors). • Error(or“loss”)Eissum-squarederroroveralloutputunits: E(w)1 2  koutputlayer (to)2 k k • Goaloflearningistominimizethemeansum-squarederror • overthetrainingset.

Trainingamultilayerneuralnetworkwithback- • propagation(stochasticgradientdescent) • Idea--Minimizesum-of-squareserror E(w)1 2  koutputlayer (to)2 k k • overtheentiretrainingdataset. • Notethatwe“tune”theparametersoftheNN(the weights)during • training. Theweightsofthenetworkaretrainedsothattheerrorgoesdownhilluntilitreachesa localminimum, justlikeaball rollingundergravity.

GeoffreyHinton:NNtrainingwithMNIST

Aiva:AIComposedMusic(2017)

Laterintheslideswewillderivetheback-propagationequations(youcanalsofindaderivationinthetext).Laterintheslideswewillderivetheback-propagationequations(youcanalsofindaderivationinthetext). Thederivationcanbesomewhatchallenging,however,youonlyneedonebasictooltoderivethem:multi-variatedifferentiation(e.g.chainrule,partialderivatives). Fornow,let’sjust walkthroughthe basicalgorithm.

Backpropagationalgorithm • (StochasticGradientDescent) • Initializethenetworkweightswtosmallrandomnumbers(e.g., • between−0.05and0.05). • Untiltheterminationconditionismet,Do: • Foreach(x,t)trainingset,Do: • Propagatetheinputforward: • Inputxtothenetworkandcomputetheactivationhjof • eachhiddenunitj. • Computetheactivationokofeachoutputunitk.

2.Calculateerrorterms Foreachoutputunitk,calculateerrortermk: Foreachhiddenunitj,calculateerrortermj:   j hj(1hj) wkj k koutput units 

3.Updateweights HiddentoOutputlayer:Foreachweightwkj wkjwkjwkj where wkjkhj InputtoHiddenlayer:Foreachweightwji wji wji wji where wji jxi

BackpropagationAlgorithm(BP) – ForwardsPhase:computetheactivationofeachneuroninthehiddenlayersandoutputsusing:    ok  wkjhj wk0  wjixiwj0 hj jhiddenlayer  iinputlayer • Backwardspass • Computetheerrorattheoutputusing:    – Computetheerroratthehiddenlayer(s)using:jhj(1hj) wkjk koutput units  – Updatetheoutputlayerweightsusing:wkj wkj wkj wkj khj where • Updatethehiddenlayerweightsusing:wjiwjiwji • wherewjijxi • (Ifusingsequentialupdating)randomizetheorderoftheinputvectorsso thatyoudon’ttrain in exactly the sameordereach iteration.

TrainingTime • TheAimistobalancebetweenGeneralization&Memorization • (Minimizingcostfunctionisnotnecessarilygoodidea). • Usingtwo(orthree)disjointsets: • Training-TestingSets • Training-Testing-ValidationSets • Aslongastheerrorforthetraining-testingsetdecreases,trainingcontinues(unlessmax#iterationsachieved). • Whentheerrorbeginstoincrease,thenetisstartingtomemorize.

SomeProsandConsofBP • Connectionism • BiologicalIssues • Noexcitatoryorinhibitoryforrealneurons • NoGlobalconnectioninMLP • Nobackwardpropagationinrealneurons • Usefulinparallelhardwareimplementation • ComputationalEfficiency • LearningAlgorithmissaidtobecomputationallyefficient,whenitscomplexityispolynomial. • TheBPalgorithmiscomputationallyefficient. • InMLPwithatotalofWweights,itscomplexityislinearinW • LocalMinima • – Presenceoflocalminimaisasignificantissue,particularlyforhighdimensionaldata.

Batch(or “True”) GradientDescent: Changeweightsonlyafteraveraginggradientsfromalltraining examples: Weightsfromhiddenunitstooutputunits: Weightsfrominputunitstohiddenunits:

Mini-BatchGradientDescent:Changeweightsonlyafteraveraginggradientsfromasubset ofBtrainingexamples: Ateachiterationt:GetnextsubsetofBtrainingexamples,Bt,untilallexampleshavebeenprocessed. Weightsfromhiddenunitstooutputunits: Weightsfrominputunitstohiddenunits:

LocalMinima,Momentum,etc. • RecallthatBPis aninstanceof“hillclimbing”(e.g.gradientdescent). Withnon-convexproblemswearenotguaranteedtosettleintoaglobalminimum. • Ifwethinkoftheanalogyofaballrollingdownahill,wecanconsidergivingtheballsome“weight”byimplementinga momentumterm. • Thepurposeofthemomentumtermistomitigatetheinstanceofgetting“stuck”inalocalminimum(i.e. a“valley”)andtoavoid performanceoscillationsduringtraining.

Momentum Introduceamomentumterm,inwhichchangeinweightisdependentonpastweightchange: (hidden-to-output)(input-to-hidden) wheretistheiterationthroughthemainloopofback-propagation.αisaparameterbetween0and1;αdeterminesthe“strength”of themomentumterm. Theideaistokeepweightchangesmovinginthesame direction.

Updateweights,withmomentum HiddentoOutputlayer:Foreachweightwkj wkjwkjwkj where InputtoHiddenlayer:Foreachweightwji wji wji wji where

BackpropExample

Trainingset: 1 0 Test set: 1 Label:0.9 1 Label:.8 0 1 Label:-.3 o1 .1 .1 .1 h2 1 h1 .1 .1 .1 .1 .1 .1 1 x1 x2

Trainingset: 1 0 Test set: 1 Label:.9 1 Label:.8 0 1 Label:-.3 Target:.9 o1 .1 .1 .1 h2 1 h1 .1 .1 .1 .1 .1 .1 1 x1 1 x2 0

ml4

ml4

Presentation Transcript

Sea Ice

Sea Ice