ml4
ppt on machine learning
ml4
E N D
Presentation Transcript
NeuralNetworks!(MLPs) CS 445/545
WhatcanIdowithaNN? Ashortlistofapplications: (*)Binaryclassification,1-in-Kclassification,regression (*)Generalpatternrecognition/statisticallearning(*)Characterrecognition,facialrecognition (*)ComputerVision:imageclassification,localization,scene recognition,captioning (*)SignalProcessing:noisesuppression,signalanalysis(*)Datacompression (*)NLP:machinetranslation,sentimentanalysis (*)Finance:statisticalarbitrage,riskanalysis(*)AI:Q-Learning(reinforcementlearning)(*)Medicine:diagnosis,imaging,genomics(*)Law:informationretrieval (*)Computationalcreativityapplications
Abitofhistory • 1960s:Rosenblattprovedthattheperceptronlearningruleconvergestocorrectweightsinafinitenumberofsteps,providedthetrainingexamplesarelinearlyseparable. • 1969:MinskyandPapertprovedthatperceptronscannotrepresentnon-linearlyseparabletargetfunctions. • However,theyshowedthataddingafullyconnectedhiddenlayermakes • thenetworkmorepowerful. • – I.e.,Multi-layerneuralnetworkscanrepresentnon-lineardecisionsurfaces. • Lateritwasshownthatbyusingcontinuousactivationfunctions(ratherthanthresholds),afullyconnectednetworkwithasinglehiddenlayercaninprinciplerepresentanyfunction. • 1986:“rediscovery”of backprop algorithm:Hintonetal.
Linearseparability Hyperplane In2D: w1x1w2x2w00 Feature1 w w x x2 0 1 1 w w2 2 Feature2 Aperceptroncanseparatedatathatis linearly separable.
Multi-layerneuralnetworkexample Decisionregionsofamultilayerfeedforwardnetwork.(From T.M.Mitchell,MachineLearning) Thenetworkwastrainedtorecognize1of10 vowelsoundsoccurringinthecontext“h_d”(e.g.,“had”,“hid”) Thenetworkinputconsistsoftwoparameters,F1andF2,obtainedfromaspectralanalysisofthesound.The10networkoutputscorrespondtothe10possiblevowelsounds.
Goodnews:Addinghiddenlayerallowsmoretargetfunctionstoberepresented.Goodnews:Addinghiddenlayerallowsmoretargetfunctionstoberepresented. • Badnews:Noalgorithmforlearninginmulti-layerednetworks,andnoconvergencetheorem! • QuotefromMinskyandPapert’sbook,Perceptrons(1969): • “[Theperceptron]hasmanyfeaturesto attractattention: itslinearity;itsintriguing learningtheorem;itsclearparadigmaticsimplicityasakindofparallelcomputation.Thereisnoreasontosupposethatanyofthesevirtuescarryovertothemany-layeredversion.Nevertheless,weconsiderittobeanimportantresearchproblemtoelucidate(orreject)our intuitivejudgmentthattheextensionis sterile.”
Twomajorproblemstheysawwere: • Howcanthelearningalgorithmapportioncredit(orblame)toindividualweightsforincorrectclassificationsdependingona(sometimes)largenumberofweights? • Howcansuchanetworklearnusefulhigher-orderfeatures? • Goodnews:Successfulcredit-apportionmentlearningalgorithmsdevelopedsoonafterwards(e.g.,back-propagation). • Badnews:However,inmulti-layernetworks,thereisno • guaranteeofconvergencetominimalerrorweightvector. • Butinpractice,multi-layernetworksoftenworkverywell.
Summary • Perceptronscanonlybe100%accurateonlyonlinearlyseparableproblems. • Multi-layernetworks(oftencalledmulti-layerperceptrons,orMLPs)canrepresentanytargetfunction. • However,inmulti-layernetworks,thereisnoguaranteeofconvergencetominimalerrorweightvector. • Onecanshow,mathematically,thatone hiddenlayerissufficienttoapproximateany function toarbitraryaccuracywithaNN.ThisisknownastheUniversalApproximationTheorem(1989)(we say:“NNsareuniversal functionapproximators”);RNNsare TuringComplete.
A “two”-layerneuralnetwork (activationrepresentsclassification) outputlayer hiddenlayer (internalrepresentation) (activationsrepresentfeaturevectorforonetrainingexample) inputs •Inputlayer— Itcontainsthoseunits (artificialneurons)whichreceiveinputfrom the outside worldonwhichnetworkwilllearn,recognizeaboutorotherwiseprocess. •Outputlayer— It containsunits that respondto the informationabouthowit’slearnedany task. •Hiddenlayer— Theseunits areinbetweeninputandoutputlayers.Thejob of hiddenlayeris totransformtheinputintosomethingthatoutputunitcanuseinsomeway. Mostneuralnetworksarefullyconnectedthatmeanstosayeachhiddenneuronisfully connectedtotheeveryneuroninitspreviouslayer(input)andtothenextlayer(output)layer.
DifferentTypesofNeuralNetworks Perceptron— NeuralNetworkhavingtwoinputunitsandoneoutputunitswithnohiddenlayers.These arealso knownas‘singlelayerperceptrons. RadialBasisFunctionNetwork— Thesenetworksaresimilar tothefeedforwardneuralnetworkexcept radialbasisfunctionisusedasactivationfunctionoftheseneurons. MultilayerPerceptron— Thesenetworksusemorethanonehiddenlayerofneurons,unlikesinglelayer perceptron.Thesearealsoknownasdeepfeedforwardneuralnetworks. RecurrentNeuralNetwork—Typeofneuralnetworkinwhichhiddenlayerneuronshasself-connections.Recurrentneuralnetworkspossessmemory.Atanyinstance,hiddenlayerneuronreceivesactivationfromthelowerlayeraswellasitpreviousactivationvalue. Long/ShortTermMemoryNetwork(LSTM)—Typeofneuralnetworkinwhichmemorycellis incorporatedinsidehiddenlayerneuronsiscalledLSTMnetwork. ConvolutionalNeuralNetwork— GetacompleteoverviewofConvolutionalNeuralNetworksthroughour blogLog AnalyticswithMachineLearning and DeepLearning.
Example:ALVINN • (Pomerleau,1993) • ALVINNlearnstodriveanautonomousvehicle • atnormalspeedsonpublichighways. • Input:30x32gridofpixelintensitiesfrom • camera
(Note:biasunitsand weightsnotshown) Eachoutputunitcorrespondtoaparticularsteeringdirection.Themosthighlyactivatedonegivesthedirectiontosteer.
Example:DeepMind(DeepQlearningfor Atari,2014)
Activationfunctions • Advantagesofsigmoidfunction:nonlinear,differentiable,hasreal-valuedoutputs,andapproximatesathresholdfunction.
Sigmoidactivationfunction: o(wx), where (z) 1 1ez
Thederivativeofthesigmoidactivationfunctioniseasily • expressedintermsofthefunctionitself: • d(z)(z)(1(z)) dz Thisisusefulinderivingtheback-propagationalgorithm.
(z)(1(z)) 1 (1ez)1 1ez (z) 1 1 1 1ez 1ez d1(1ez)2 d (1ez) 2 1 1 dz dz 1ez 1ez 1 1 1 ez z (1ez)2 1e (1ez)2 1e 1 z z 2z 2 (1e ) (1e ) z e z e (1ez)2 (1ez)2 d(z) (z)(1(z)) And thus themath Gods said… dz
Neuralnetworknotation (activationrepresentsclassification) (internal representation) (activationsrepresentfeaturevectorforonetrainingexample)
Neuralnetworknotation (activationrepresentsclassification) (internal representation) (activationsrepresentfeaturevectorforonetrainingexample) Sigmoidfunction:
Neuralnetworknotation xi:activationofinputnodei. hj:activationofhiddennodej. (activation ok:activationofoutputnodek. representsclassification) wji:weightfromnodeitonodej. o:sigmoidfunction. (internal representation) (activationsrepresentfeaturevectorforonetrainingexample) Foreachnodejinhiddenlayer, hj wjixi wj0 iinputlayer Sigmoidfunction: Foreachnodekinoutputlayer, ok wkjhj wk0 jhiddenlayer
Classificationwithatwo-layerneuralnetwork (“Forwardpropagation”) Assumetwo-layernetworks(i.e.,onehiddenlayer): Presentinputtotheinputlayer. Forwardpropagatetheactivationstimestheweightstoeachnodeinthehiddenlayer. Applyactivationfunction(sigmoid)tosumofweightstimesinputs toeachhiddenunit. Forwardpropagatetheactivationstimesweightsfromthehiddenlayertotheoutputlayer. Applyactivationfunction(sigmoid)tosumofweightstimesinputs toeachoutputunit. Interprettheoutputlayerasaclassification.
SimpleExample Input: HiddenLayer: o1 o1 o1 o1 .1 .1 −.5 0.470 −.5 −.2 −.2 −.1 −.1 0.547 h2 h1 .1 .1 −.2 .3 −.4 −.2 −.4 .1 .2 .3 .1 .2 1 x1 0.4 x2 0.1 1 x1 0.4 x2 0.1
OutputLayer: 0.461 0.455 .1 −.5 −.2 −.1 0.547 0.470 .1 −.2 −.4 .3 .1 .2 1 x1 0.4 x2 0.1
“Softmax”operation Oftenusedtoturnoutputvaluesintoaprobabilitydistribution eoi ysm(oi) , ysm= .501 ysm= .499 K eok k1 0.461 0.455 .1 whereKis thenumberofoutputunits. −.5 −.2 −.1 0.547 0.470 .1 −.2 −.4 .3 .1 .2 1 x1 0.4 x2 0.1
Whatkindsofproblemsaresuitableforneuralnetworks? • Havesufficienttrainingdata • Longtrainingtimesareacceptable • Notnecessaryforhumanstounderstandlearnedtarget • functionorhypothesis
Advantagesofneuralnetworks • Designedtobeparallelized(e.g.splitminibatches,useGPUs) • Robustonnoisytrainingdata • Fasttoevaluatenewexamples
Trainingamulti-layerneuralnetwork Repeatforagivennumberofepochsoruntilaccuracyontrainingdataisacceptable: Foreachtrainingexample: 1.Presentinputtotheinputlayer. Forwardpropagatetheactivationstimestheweightstoeachnode inthehiddenlayer. Forwardpropagatetheactivationstimesweightsfromthehiddenlayertotheoutputlayer. Ateachoutputunit,determinetheerror. Runtheback-propagationalgorithmone layer atatimetoupdateallweightsinthenetwork.
Trainingamultilayerneuralnetworkwithback- • propagation(stochasticgradientdescent) • Supposetrainingexamplehasform(x,t) • (i.e.,bothinputandtargetarevectors). • Error(or“loss”)Eissum-squarederroroveralloutputunits: E(w)1 2 koutputlayer (to)2 k k • Goaloflearningistominimizethemeansum-squarederror • overthetrainingset.
Trainingamultilayerneuralnetworkwithback- • propagation(stochasticgradientdescent) • Idea--Minimizesum-of-squareserror E(w)1 2 koutputlayer (to)2 k k • overtheentiretrainingdataset. • Notethatwe“tune”theparametersoftheNN(the weights)during • training. Theweightsofthenetworkaretrainedsothattheerrorgoesdownhilluntilitreachesa localminimum, justlikeaball rollingundergravity.
Laterintheslideswewillderivetheback-propagationequations(youcanalsofindaderivationinthetext).Laterintheslideswewillderivetheback-propagationequations(youcanalsofindaderivationinthetext). Thederivationcanbesomewhatchallenging,however,youonlyneedonebasictooltoderivethem:multi-variatedifferentiation(e.g.chainrule,partialderivatives). Fornow,let’sjust walkthroughthe basicalgorithm.
Backpropagationalgorithm • (StochasticGradientDescent) • Initializethenetworkweightswtosmallrandomnumbers(e.g., • between−0.05and0.05). • Untiltheterminationconditionismet,Do: • Foreach(x,t)trainingset,Do: • Propagatetheinputforward: • Inputxtothenetworkandcomputetheactivationhjof • eachhiddenunitj. • Computetheactivationokofeachoutputunitk.
2.Calculateerrorterms Foreachoutputunitk,calculateerrortermk: Foreachhiddenunitj,calculateerrortermj: j hj(1hj) wkj k koutput units
2.Calculateerrorterms Foreachoutputunitk,calculateerrortermk: Foreachhiddenunitj,calculateerrortermj: j hj(1hj) wkj k koutput units
3.Updateweights HiddentoOutputlayer:Foreachweightwkj wkjwkjwkj where wkjkhj InputtoHiddenlayer:Foreachweightwji wji wji wji where wji jxi
BackpropagationAlgorithm(BP) – ForwardsPhase:computetheactivationofeachneuroninthehiddenlayersandoutputsusing: ok wkjhj wk0 wjixiwj0 hj jhiddenlayer iinputlayer • Backwardspass • Computetheerrorattheoutputusing: – Computetheerroratthehiddenlayer(s)using:jhj(1hj) wkjk koutput units – Updatetheoutputlayerweightsusing:wkj wkj wkj wkj khj where • Updatethehiddenlayerweightsusing:wjiwjiwji • wherewjijxi • (Ifusingsequentialupdating)randomizetheorderoftheinputvectorsso thatyoudon’ttrain in exactly the sameordereach iteration.
TrainingTime • TheAimistobalancebetweenGeneralization&Memorization • (Minimizingcostfunctionisnotnecessarilygoodidea). • Usingtwo(orthree)disjointsets: • Training-TestingSets • Training-Testing-ValidationSets • Aslongastheerrorforthetraining-testingsetdecreases,trainingcontinues(unlessmax#iterationsachieved). • Whentheerrorbeginstoincrease,thenetisstartingtomemorize.
SomeProsandConsofBP • Connectionism • BiologicalIssues • Noexcitatoryorinhibitoryforrealneurons • NoGlobalconnectioninMLP • Nobackwardpropagationinrealneurons • Usefulinparallelhardwareimplementation • ComputationalEfficiency • LearningAlgorithmissaidtobecomputationallyefficient,whenitscomplexityispolynomial. • TheBPalgorithmiscomputationallyefficient. • InMLPwithatotalofWweights,itscomplexityislinearinW • LocalMinima • – Presenceoflocalminimaisasignificantissue,particularlyforhighdimensionaldata.
Batch(or “True”) GradientDescent: Changeweightsonlyafteraveraginggradientsfromalltraining examples: Weightsfromhiddenunitstooutputunits: Weightsfrominputunitstohiddenunits:
Mini-BatchGradientDescent:Changeweightsonlyafteraveraginggradientsfromasubset ofBtrainingexamples: Ateachiterationt:GetnextsubsetofBtrainingexamples,Bt,untilallexampleshavebeenprocessed. Weightsfromhiddenunitstooutputunits: Weightsfrominputunitstohiddenunits:
LocalMinima,Momentum,etc. • RecallthatBPis aninstanceof“hillclimbing”(e.g.gradientdescent). Withnon-convexproblemswearenotguaranteedtosettleintoaglobalminimum. • Ifwethinkoftheanalogyofaballrollingdownahill,wecanconsidergivingtheballsome“weight”byimplementinga momentumterm. • Thepurposeofthemomentumtermistomitigatetheinstanceofgetting“stuck”inalocalminimum(i.e. a“valley”)andtoavoid performanceoscillationsduringtraining.
Momentum Introduceamomentumterm,inwhichchangeinweightisdependentonpastweightchange: (hidden-to-output)(input-to-hidden) wheretistheiterationthroughthemainloopofback-propagation.αisaparameterbetween0and1;αdeterminesthe“strength”of themomentumterm. Theideaistokeepweightchangesmovinginthesame direction.
Updateweights,withmomentum HiddentoOutputlayer:Foreachweightwkj wkjwkjwkj where InputtoHiddenlayer:Foreachweightwji wji wji wji where
Trainingset: 1 0 Test set: 1 Label:0.9 1 Label:.8 0 1 Label:-.3 o1 .1 .1 .1 h2 1 h1 .1 .1 .1 .1 .1 .1 1 x1 x2
Trainingset: 1 0 Test set: 1 Label:.9 1 Label:.8 0 1 Label:-.3 Target:.9 o1 .1 .1 .1 h2 1 h1 .1 .1 .1 .1 .1 .1 1 x1 1 x2 0