hibat r rendszerek tervez si mint i n.
Download
Skip this Video
Download Presentation
Hibatűrő rendszerek tervezési mintái

Loading in 2 Seconds...

play fullscreen
1 / 99

Hibatűrő rendszerek tervezési mintái - PowerPoint PPT Presentation


 • 50 Views
 • Uploaded on

Hibatűrő rendszerek tervezési mintái. Segédfóliák az Autonóm és hibatűrő inf . rsz . tárgyhoz Kocsis Imre ( ikocsis @ mit.bme.hu ) 2012.09.12 . Ismétlés: singleton. Ismétlés: Facade. Ismétlés: Observer. Architekturális mintanyelv. Units of Mitigation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Hibatűrő rendszerek tervezési mintái' - hertz


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
hibat r rendszerek tervez si mint i

Hibatűrő rendszerek tervezési mintái

Segédfóliák az Autonóm és hibatűrő inf. rsz. tárgyhoz

Kocsis Imre (ikocsis@mit.bme.hu)

2012.09.12.

units of mitigation
Units of Mitigation
 • Howcanyoukeepthewholesystemfrom being unavailablewhen an erroroccurs?
 • „Design thesystemintopartsthatwillcontainbothanyerrors and theerrorrecovery. Choosethedivisionsthatmakesenseforyoursystem. Design the rest of thesystemaroundthesepartsthatrepresentthebasicunitsoferrormitigation.”
units of mitigation1
Units of Mitigation
 • Division:
  • Architecture
  • Availablerecovery/mitigationtechniques
 • Desirable: fail-silent
 • Example: three-tieredsystem, tieras unit
  • In-tierredundancyschemes…
  • … orrequestqueuing
 • Somefurtherproblems:
  • Howtoprocesserrorsinside?
  • Whatshouldtheblocks be?
 • Note: rough-grainedpattern (HW and SW blocks)
correcting audits
CorrectingAudits
 • Faultydatacauseserrors.
 • „Detect and correctdataerrorsassoonaspossible. Checkrelateddataforerrors, correct and recordtheoccurence of theerror.”
 • Leadsto a host of otherpatterns
redundancy
Redundancy
 • Assumption: errorprocessingusuallystopsnormalexecution
 • Howcanwereducetheamount of timebetweenerrordetection and theresumption of normaloperationaftererrorrecovery?
 • „Provideredundantcapabilitiesthatsupportquickactivationtoenableerrorprocessingtocontinuein parallel withnormalexecution.”
someone in charge
SomeoneinCharge
 • Anythingcan go wrong, evenduringerrorprocessing. Whenthishappensthesystemmight stop doingtheerrorprocessinginadditiontonotdoingthenormalprocessing.
 • „All fault tolerancerelatedactivitieshavesomecomponent of thesystemthat is clearlyincharge and has theabilitytodeterminecorrectcompletion and theresponsibilitytotakeactionifitdoesnotcompletecorrectly.”
someone in charge1
SomeoneinCharge
 • N.B. doesnotpromote a global SPOF
  • Onthecontrary, seeescalation
 • Example (Action / InCharge):
  • Checkpoint / eachtask
  • Rollback and roll forward / component R
  • Loadshedding / component S
 • Also: voting / leaderselectiontechniques
minimize human intervention
Minimize Human Intervention
 • Howcanwepreventpeoplefromdoingthewrongthings and causingerrors?
 • „Design thesystemin a waythatit is abletoprocess and resolveerrorsautomatically, beforetheybecomefailures. Thisspeedserrorrecovery and reducestherisk of proceduralerrors.”
minimize human intervention1
Minimize Human Intervention
 • How?
  • Makesureallerrorsarereportedtothe Fault Observer
  • Indiv. componentsdonottalktotheoutsideworld
  • Concentrate output and input
  • Design automatic F/E/F, detection, processing, treatment
maximize human participation
Maximize Human Participation
 • Shouldthesystemignorepeopletotally? Thatwillreduceproceduralerrors.
 • „Knowtheuser and theiravailability. Design thesystemtoenableknowledgeableoperatingpersonneltoparticipate. […] ProvideappropriateMaintenanceInterfaces and Fault Observercapabilities […]”
escalation
Escalation
 • Whatdoesthesystemdowhenitsattempttoprocess an errorin a component is notacheivingthecorrecteffect?
 • „Whenrecoveryormitigation is failing, escalatetheactiontothenext more drasticaction.”
fault observer
Fault Observer
 • The systemdoesnot stop whenerrorsaredetected, itautomaticallycorrectsthem.
maintenance interface
MaintenanceInterface
 • Shouldmaintenance and applicationrequests be intermingledontheapplication input and output channels?
 • „Provide a separateinterfacetothesystemforthe (almost) exclusiveuse of maintenanceinteractions.”
fault correlation
Fault Correlation
 • Hol találkoztunk ezzel korábban?
 • Mit jelentett a korreláció?
 • Példák?
 • Figyelem: ez egy nagy terület, a minta csak a szükségességéről beszél (~ „kell diagnosztika”)
  • Topológia-alapú megközelítések
  • Dinamikus modellezés: automaták, nyelvek
  • Statikus modellezés: terjedési relációk
  • Tanuló módszerek
 • Diagnosztika-elmélet: később
fault correlation1
Fault Correlation
 • What fault is activating?
 • „Lookattheuniquesignature of theerrorto sort itintothe fault categoryforwhicherrorprocessingstepsareknown.”
fault correlation2
Fault correlation
 • Gyakorlat-féle: adott egy topológia (DAG) és a szolgáltatási szintű hibahatások helyei. Lehetséges hibaok-helyek halmaza…?
 • Algoritmus?
 • Komplexitás?
 • Véges automata hibahelyekkel és mondat „hibás kimenettel”. Melyik (hibás) állapotokat érinthettük?
slide30

Mi a közös bennük?

  • „Bus Guardian” TT architektúrákban
  • Try/catch blokk
  • Desktop vírusvédelem
error containment barrier
ErrorContainmentBarrier
 • What is thefirstthingthatthesystem must dowhenitdetects an error?
 • „Isolatetheerrorto a unit of mitigation. Stop theerror flow with a barrier, quarantine and initiateeithererrorrecoveryorerrormitigation.”
complete parameter checking
Completeparameterchecking
 • Howcanthetimefrom fault activationtoerrordetection be minimized?
 • „Performfrequentchecksondata and operationstodetecterrorsquicklyandpreventerrorsfrompropagatingtothe rest of thesystem.”
 • More specifically, checkalltheinputs and parametersrigorously.
 • Level/granularity: design decision
complete parameter checking1
Completeparameterchecking
 • Hogyan ellenőrizzük ezt?
  • A = B / C ;
system monitor
System Monitor
 • Howdoesone part of a systemkeeptrackthatanother part is alive and functioning?
 • „Create a Monitor tostudysystembehavior, orthebehavior of specificpartsofthesystemtomakesurethattheycontinueoperatingcorrectly. Whenthewatchedcomponents stop, the monitor shouldreporttheoccurencetothe Fault Observer and initiatecorrectiveactions.”
system monitor1
System Monitor
 • Célja alapvetően a rendszerszinten manifesztálódó hibás állapotok felderítése.
 • A „hogyan” mind a detektálás, mind a javítás esetében nyitva marad, persze.
slide40

Hogyan nézzük meg, hogy egy szolgáltatás működik-e?

 • Potenciális problémák pl.:
  • „ritkás” munkavégzés
  • normál működés más csatornán zajlik
heartbeat
Heartbeat
 • How does the System Monitor know that a particular monitored task is still working?
 • „The System Monitor shouldsee a periodicheartbeatfromthemonitoredtask. Ifthemonitoredtaskdoesnotsupply a heartbeatresponsewithintherequiredtimethenrecoveryactionshould be taken.”
 • Variants: autonomous / request-response
 • Végül is ez is heartbeat ebben az értelemben:
acknowledgement
Acknowledgement
 • Whenthere is a dialogbetweentwotasks, what’stheeasiestwayforonetasktodeterminethattheothertask is alive and functioning?
 • „Send an acknowledgement for all requests. Allrequestsshouldrequire a replytoacknowledgereceipt and toindicatethatthemonitoredsystem is alive and abletoadheretotheprotocol. […]”
acknowledgement1
Acknowledgement
 • Hasznos minta, de ésszel alkalmazandó. Mikor ellenjavallott?
realistic threshold
Realistic Threshold
 • Howmuchtimeshouldelapsebeforethe System Monitor takesactionwhen an error is detected?
 • Whycanthis be a problem?
 • Terminology
  • Messaginglatency
  • Detecionlatency
realistic treshold
RealisticTreshold
 • „SetthemessaginglatencybasedupontheworstcasecommunicationstimecombinedwiththetimerequiredtoprocessoneHeartbeatmessage.
 • Setthedetectionlatencybaseduponthecriticality of thefunctionality. Makeit a multiple of themessaginglatency.
 • Setthemsothattheavailabilityrequirement is met, yetfalsetriggersdonotoccur.” (restarttime!)
existing metrics
ExistingMetrics
 • Howtomeasuretheseverity of an overloadwithoutcontributingtotheoverload?
 • „Usepre-existingindicatorsalready tied totheresourceas an indicator of thesystem’soverloadcondition.”
 • Példa?
routine maintenance
RoutineMaintenance
 • Howcanwekeeppreventableerrorsfromoccuring?
 • „Performroutine, preventivemaintenanceonthesystem.”
routine maintenance1
RoutineMaintenance
 • Someapplicablepatterns:
  • RoutineAudits, RoutineExercises
  • DeferrableWork
 • Fromthepracticalpoint of view:
  • HW
  • Software: garbage collection, rejuvenation, patching/updates is bizonyos értelemben
  • Data
  • Maintenancewindows!
routine exercises
RoutineExercises
 • HowdoyouknowthatRedundantelementsthatwill be calledinto service by a Failoverincase of an errororfailurewillactuallywork?
 • „Routinelyexercise, orexecutethesystemcomponentsthatwill be requiredin an errorsituation. Thiswillidentifylatentfaults.”
riding over transients
Riding over Transients
 • Howcanthesystemavoidwastingresourcesprocessingtransienterrorsthatwon’thave a longtermeffectonthesystem?
 • „Monitor thesystem and conduct Fault Correlation. Ifthecorrelationindicates a transient fault, monitor thefrequency of occurence, buttake no action (unlesstresholdreached).”
detekt l si mint k11
Detektálási minták
 • Néhány minta explicit tárgyalásától eltekintettünk.
quarantine
Quarantine
 • Howcanthesystempreventerrorsfromspreading?
 • „Establish a barrieraroundtheelementthatpreventsitfrombothcontributingtotheusefulwork and alsopreventsitfrompropagatingitserrorintootherparts of thesystem.”
 • Examples?
  • Desktopvirusscanners, Intel vPro, BME intranet, spacecrafts
concentrated recovery
Concentratedrecovery
 • Whenprocessing an error, howshouldthesystemminimizeunavailability?
 • „Focusallnecessaryresourcesontherecoverytasksothattherecoverytimecan be minimized.”
 • Alsoreducesrisk
 • Needsproperselection of UnitsofMitigation
error handler
ErrorHandler
 • Developing and maintainingapplicationcode is complicatedbytheneedtoprocesserrors.
 • „Separate error processing code in special handling blocks for easier maintenance.”
 • (Java) exceptions!
restart
Restart
 • Howcanexecutionresumewhenrecoveryfromtheerror is notpossible?
 • „Restartthesystem. Suffertheloss of time and statetoreinitializeandrestarttheapplicationfrombeginning.”
restart1
Restart
 • Általában több szintje értelmezhető, pl.
  • cold/warm
  • kiszolgáló / taszk (konténer)
 • Együtt járó minták:
  • eszkaláció után,
  • felelős által végezve,
  • koncentrált helyreállítás részeként,
  • hiba-megfigyelőnek jelentve,
  • rendszermonitor által megfigyelve
 • Követheti checkpoint-hoz visszatérés
rollback
Rollback
 • Whereshouldprocessingresumeaftererrorrecovery?
 • „Returnto a pointwhereprocessingcan be synchronizedthat is beforethepoint of error.”
 • Needscheckpoint + stablerequest log
roll forward
Roll-forward
 • Whereshouldprocessingresumeaftererrorrecovery?
 • „Advancetothenextpointwheretheprocessingacrossthesystemcan be synchronized. Donotresumeexecutionfromthepoint of error; continueasthoughtheerroneousactionsdidnotcomplete (ordidcompletesuccessfully).”
 • System state: wouldhavebeenencounteredwithouttheerror.
return to reference point
ReturntoReferencePoint
 • Wherecanexecutionresumewhen an erroroccursthatcan be recovered, butforwhichtherecoverydoesnotprovideappropriateRollback/Roll-forwardplaces?
 • „Resumeexecutionat a pointthat is knownto be safe.”
 • „Safeplace” createdat design time
failover
Failover
 • The active part of a group of redundantelements has a fault; howcanerror-freeexecutioncontinue?
 • „Switchsystemexecutionfromthecurrentactiveelementto a redundantelement.”
checkpoint
Checkpoint
 • Workinprogressmight be lostduringtherecoveryfrom an error.
 • „Savestateperiodically. Buildinthecapabilitytorestorethesystemtothesamestatethatwassaved, withouthavingtorecreatetheentireexecutionfromstartuptothepoint of thesavedstate.”
 • Frequency/timing: tradeoff
remote storage
Remote Storage
 • Whatstoragelocationshould be usedforcheckpointstoreducethetimebeforeexecutioncanberesumedaftererrorrecovery?
 • „Storethesavedcheckpointsin a centrally accessiblelocation. Thisenables a newprocessortoaccessthesavedcheckpointwhichminimizestheperiod of unavailability.”
 • Examples?
hibakezel s error mitigation
Hibakezelés (Error mitigation)
 • Hibás állapot maszkolása
 • Hatás-kompenzáció
 • Leginkább a túlterhelődés kezelése
  • Külső kérések
  • Hibák hatására kieső erőforrások!
  • Virtualizáció/cloud?
overload toolboxes
Overload Toolboxes
 • How should the system handle situations of overload?
 • „Have multiple toolboxes with which to mitigate overloads. Avoid grouping of all the possible techniques together.”
 • Emlékezzünk vissza a detektálásra!
deferrable work
Deferrable Work
 • What work should the system shed when the choices are handling most of the incoming work or the routine maintenance workload?
 • „Make the routine work deferrable.”
 • Valószínűleg a rendszer és perfiériája működik, különben honnan jönne a munkaterhelés...
reassess overload decision
Reassess Overload Decision
 • What should the system do when the workload mitigation techniques are not working?
 • „Provide the system with a feedback loop which provides information to enable the system to reexamin fault correlation decisions.”
 • Ha működő „Shed Load”/... mellett is túlterhelt a rendszer, a hiba máshol (is) van
equitable resource allocation
Equitable resource allocation
 • How should requests for scarce resources be handled?
 • „Pool all similar requests and allocate resources to the pools based upon their availability and priority.”
queue for resources
Queue for resources
 • What should be done with requests for resources that cannot be handled immediately when they arrive?
 • „Store requests for service that cannot be handled immediately in a queue.”
 • A sor lehetőleg véges hosszúságú legyen
expansive automatic controls
Expansive Automatic Controls
 • How can we avoid both the wasted effort processing the requests that can’t immediately be handled in an overload and at the same time increase overall request completions?
 • „Design some resources into the system that will be used only in case of overload. Provide new ways for the system to do its work that either uses reserved or fewer resources.”
protective automatic controls
Protective Automatic Controls
 • What actions should an overloaded system take to avoid spending all of its time doing overhead work associated with new requests arriving?
 • „Automatically impose restrictions on how much work the system accepts to protect the system’s ability to function.”
 • A „tartalékkapacitás” függvényében
shed load
Shed Load
 • How can the system best handle too many requests and keep them from overwhelming the system?
 • „Shed some requests so that the others may receive good service.”
 • Áteresztőképesség   Rendelkezésre állás
final handling
Final Handling
 • How can the system best handle too many requests and keep them from overwhelming the system?
 • „Integrate the release of resources for internally terminated transactions with the usual release of resources done by normal task termination.”
 • Mikroszinten: „finally” blokk
share the load
Share the Load
 • How can you increase the available processing power?
 • „Move some of the work to other processors. Move work that does not require high levels of synchronization.”
 • Cloud és virtualizált rendszerek!
shed work at periphery
Shed Work at Periphery
 • How does the system shed load that is beyond system capacity for the lowest additional effort?
 • „Detect which work is eligible for shedding as close to the edges of the system as possible. Provide this detection mechanism with information about the processing capacity of the most limiting part of the system.”
 • Műszakilag nem triviális
slow it down
Slow it Down
 • How does the system shed load that is beyond system capacity for the lowest additional effort?
 • „Detect which work is eligible for shedding as close to the edges of the system as possible. Provide this detection mechanism with information about the processing capacity of the most limiting part of the system.”
 • Műszakilag nem triviális