270 likes | 349 Views
Learn how to design systems with error recovery and mitigation strategies through patterns like singleton, facade, and observer. Explore techniques such as redundancy, reducing human errors, and maximizing user participation. Understand fault correlation, error containment, system monitoring, and routine maintenance practices.
E N D
Hibatűrő rendszerek tervezési mintái Segédfóliák az Autonóm és hibatűrő inf. rsz. tárgyhoz Kocsis Imre (ikocsis@mit.bme.hu) 2010.09.20.
Units of Mitigation • Howcanyoukeepthewholesystemfrom being unavailablewhen an erroroccurs? • „Design thesystemintopartsthatwillcontainbothanyerrors and theerrorrecovery. Choosethedivisionsthatmakesenseforyoursystem. Design the rest of thesystemaroundthesepartsthatrepresentthebasicunitsoferrormitigation.”
CorrectingAudits • Faultydatacauseserrors. • „Detect and correctdataerrorsassoonaspossible. Checkrelateddataforerrors, correct and recordtheoccurence of theerror.”
Redundancy • Howcanwereducetheamount of timebetweenerrordetection and theresumption of normaloperationaftererrorrecovery? • „Provideredundantcapabilitiesthatsupportquickactivationtoenableerrorprocessingtocontinuein parallel withnormalexecution.”
Minimize Human Intervention • Howcanwepreventpeoplefromdoingthewrongthings and causingerrors? • „Design thesystemin a waythatit is abletoprocess and resolveerrorsautomatically, beforetheybecomefailures. Thisspeedserrorrecovery and reducestherisk of proceduralerrors.”
Maximize Human Participation • Shouldthesystemignorepeopletotally? Thatwillreduceproceduralerrors. • „Knowtheuser and theiravailability. Design thesystemtoenableknowledgeableoperatingpersonneltoparticipate. […] ProvideappropriateMaintenanceInterfaces and Fault Observercapabilities […]”
MaintenanceInterface • Shouldmaintenance and applicationrequests be intermingledontheapplication input and output channels? • „Provide a separateinterfacetothesystemforthe (almost) exclusiveuse of maintenanceinteractions.”
SomeoneinCharge • Anythingcan go wrong, evenduringerrorprocessing. Whenthishappensthesystemmight stop doingtheerrorprocessinginadditiontonotdoingthenormalprocessing. • „All fault tolerancerelatedactivitieshavesomecomponent of thesystemthat is clearlyincharge and has theabilitytodeterminecorrectcompletion and theresponsibilitytotakeactionifitdoesnotcompletecorrectly.”
Escalation • Whatdoesthesystemdowhenitsattempttoprocess an errorin a component is notacheivingthecorrecteffect? • „Whenrecoveryormitigation is failing, escalatetheactiontothenext more drasticaction.”
Fault Correlation • What fault is activating? • „Lookattheuniquesignature of theerrorto sort itintothe fault categoryforwhicherrorprocessingstepsareknown.”
ErrorContainmentBarrier • What is thefirstthingthatthesystem must dowhenitdetects an error? • „Isolatetheerrorto a unit of mitigation. Stop theerror flow with a barrier, quarantine and initiateeithererrorrecoveryorerrormitigation.”
System Monitor • Howdoesone part of a systemkeeptrackthatanother part is alive and functioning? • „Create a Monitor tostudysystembehavior, orthebehavior of specificpartsofthesystemtomakesurethattheycontinueoperatingcorrectly. Whenthewatchedcomponents stop, the monitor shouldreporttheoccurencetothe Fault Observer and initiatecorrectiveactions.”
ExistingMetrics • Howtomeasuretheseverity of an overloadwithoutcontributingtotheoverload? • „Usepre-existingindicatorsalready tied totheresourceas an indicator of thesystem’soverloadcondition.” • Megjegyzés: nem csak a teljesítményre igaz!
RoutineMaintenance • Howcanwekeeppreventableerrorsfromoccuring? • „Performroutine, preventivemaintenanceonthesystem.”
RoutineExercises • HowdoyouknowthatRedundantelementsthatwill be calledinto service by a Failoverincase of an errororfailurewillactuallywork? • „Routinelyexercise, orexecutethesystemcomponentsthatwill be requiredin an errorsituation. Thiswillidentifylatentfaults.”
Quarantine • Howcanthesystempreventerrorsfromspreading? • „Establish a barrieraroundtheelementthatpreventsitfrombothcontributingtotheusefulwork and alsopreventsitfrompropagatingitserrorintootherparts of thesystem.”