1 / 61

Mars pathfinder failure

Mars pathfinder failure. Documentation. Speech Dave Wilner recorded by Mike Jones Comments by Glenn Reeves, Mars Pathfinder Flight Software Cognizant Engineer Hereafter called JPL (Jet Propulsion Lab) Talk by Ian A. Mason, University of New England, Australia

risa
Download Presentation

Mars pathfinder failure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mars pathfinder failure

  2. Documentation • Speech Dave Wilner recorded by Mike Jones • Comments by Glenn Reeves, Mars Pathfinder Flight Software Cognizant Engineer • Hereafter called JPL (Jet Propulsion Lab) • Talk by Ian A. Mason, University of New England, Australia • http://mcs.une.edu.au/~iam/Data/threads/threads.html

  3. LAUNCH 4/12/1996 Mars Pathfinder was originally designed as atechnology demonstration of a way to deliver an instrumented lander and a free-ranging robotic rover to the surface of the red planet. Pathfinder not only accomplished this goal but also returned an unprecedented amount of data and outlived its primary design life. Pathfinder mission

  4. Budget • Due to limited funds, Pathfinder’s development had to be dramatically different from the way in which previous spacecraft had been developed. • Instead of the traditional 8- to 10-year schedule and $1-billion-plus budget, Pathfinder was developed in three years for less than $150 million= the cost of some Hollywood movies!

  5. Pathfinder exploration • landing: 4/7/1997last transmission: 27/09/1997 • Pathfinder & Soujerner

  6. Lander • The lander was controlled by a derivative of the commercially availableIBM RAD6000 computer, radiation-hardened to survive the flight. • The computer featured a computing speed of20 MIPS 128 MB of DRAMfor storage of flight software and engineering and science data, including images and rover information. • 6 MB ROMstored flight software and time-critical data.

  7. Rover Sojourner • The rover, capable of autonomous navigation and performance of tasks, communicated with Earth via the lander. • Sojourner’s control system was built around an Intel 80C85,with a computing speed of 0,1 MIPS and 500 KB of RAM. • ? ROM

  8. The landerhardware and software

  9. VMEbus Mil1553 Camera Radio CPU RS6000 Cruise stage controls thrusters,valves, a sun sensor,a star scanner Lander interface toaccelerometers,a radar altimeter,an instrument formeteorological scienceknown as the ASI/MET Mil 1553 bus Mil1553: specific paradigm:the software will schedule activity at an 8 Hz rate.This **feature** dictated the architecture of the softwarewhich controls both the 1553 bus and the devices attached to it.

  10. The software • VxWorks 5.x (x = 3 or 4?) • 2 tasks to control the 1553 bus and the attached instruments. • bc_sched task (called the bus scheduler) • a task controlled the setup of transactions on the 1553 bus • bc_dist task (for distribution) taskalso referred as the “communication task” • handles the collection of the transaction results i.e. the data.

  11. Marsrobot general communication pattern t1 - bus hardware starts via hardware control on the 8 Hz boundary. The transactions for the this cycle had been set up by the previous executionof the bc_sched task. t2 - 1553 traffic is complete and the bc_dist task is awakened. t3 - bc_dist task has completed all of the data distribution t4 - bc_sched task is awakened to setup transactions for the next cycle t5 - bc_sched activity is complete bc-distMEDIUM priority bc-schedHIGH priority Spacecraft functionsLOW priority Mil 1553 transaction time 125 ms (8 Hz) bc-sched Check order! bc-dist Science functions (ASI/MET, …)LOWEST priority t1 t2 t3 t4 t5 t1

  12. 1553 communication • Powered 1553 devices deliver data. • Tasks in the system that access the information collected over the 1553 do so via a double buffered shared memory mechanism into which the bc_dist task places the latest data. • The exception to this is the ASI/MET task which is delivered its information via an interprocess communication mechanism (IPC). The IPC mechanism uses the VxWorks pipe() facility.

  13. VMEbus D-Buffer D-Buffer D-Buffer Mil1553 Camera Radio CPU RS6000 MEM Packedbuffer Cruise stage controls thrusters,valves, a sun sensor,a star scanner IPC PIPE FileDescriptorList Lander interface toaccelerometers,a radar altimeter,an instrument formeteorological scienceknown as the ASI/MET

  14. Dedicated Systems’ tasking graphics model - example P4 D Mailbox started thread D4 message queue P3 Thread data usage Shared data

  15. bc-dist bc-sched Spacecraftfunctiontaks 1 Spacecraftfunctiontasks Sciencefunctiontask 1 Sciencefunctiontasks ASI/MET task Mil 1553transsetup Mil 1553datasetup Shared Shared Shared Shared data data data data pipe File Descriptor Table System_mutex

  16. IPC mechanism • Tasks wait on one or more IPC "queues" for messages to arrive using the VxWorks select()mechanism to wait for message arrival. • Multiple queues are used when both high and lower priority messages are required. • Most of the IPC traffic in the system is not for the delivery of real-time data. The exception to this is the use of the IPC mechanism with the ASI/MET task. • The cause of the reset on Mars was in the use and configuration of the IPC mechanism.

  17. VXWorks Select () • Pending on multiple file descriptiors: this routine permits a task to pend until one of a set of file descriptors becomes available • Wait for multiple I/O devices (task level and driver level) • file descriptors • pReadFds, pWriteFds • Bits set in pReadFds will cause select() to pend until data becomes available on any of the corresponding file descriptors. • Bits set in pWriteFds will cause select() to pend until any of the corresponding file descriptorsbecomes available. • http://www.eelab.usyd.edu.au/tornado/docs/vxworks/ref/selectLib.html

  18. Thread A Thread B Thread C Marsrobot design Middle priority long lasting Comm threadbc_dist System_mutex Different I/O channels Low priority thread Shared ressource for Communication Using select() Lowest priority sporadicmeteo threadASI/MET

  19. The problem • Priority inversion • Bounded • Unbounded

  20. Priority Inversion • Priority inversion occurs when a thread of low priority blocks the execution of threads of higher priority. • Priority inversion comes in two flavours: • bounded priority inversion (common & relatively harmless) • unbounded priority inversion (insidious & potentially disastrous) • Priority inversion is not new • the earliest mention of it that I've found dates back to the Burroughs MCP (Master Control Program) of the early 1970's.

  21. Bounded Priority Inversion • Suppose a high priority thread becomes blocked waiting for an event to happen. A low priority thread then starts to run and in doing so obtains (i.e locks) a mutex for a shared resource. While the mutex is locked by the low priority thread, the event occurs waking up the high priority thread. • Inversion takes place when the high priority thread tries to lock the mutex held by the low priority thread. In effect the high priority thread must wait for the low priority thread to finish. • It is called bounded inversion since the inversion is limited by the duration of the critical section.

  22. Bounded priority inversion time run blocked ISR A ready HIGH:TASK A (40) LockMUTEX (m) Bounded inversion time LOW TASK C (30) UnLockMUTEX (m) LockMUTEX (m)

  23. Unbounded Priority Inversion • This is a simple elaboration on bounded inversion. Here the high level thread can be blocked indefinitely by a medium priority thread. The medium level thread running prevents the low priority thread from releasing the lock. All that is required for this to happen is that while the low level thread has locked the mutex, the medium level thread becomes unblocked, preempting the low level thread. The medium level thread then runs indefinitely.

  24. Unbounded priority inversion time run blocked ISR B ISR A ready HIGH:TASK A (40) Unbounded inversion time LockMUTEX (m) MIDDLE: TASK B (35) LOW TASK C (30) LockMUTEX (m)

  25. Mission failure • The failure was identified by the spacecraft as a failure of the bc_dist task to complete its execution before the bc_sched task started. • The reaction to this by the spacecraft was to reset the computer. • This reset reinitializes all of the hardware and software. It also terminates the execution of the current ground commanded activities. No science or engineering data is lost that has already been collected (the data in RAM is recovered so long as power is not lost). • The remainder of the activities for that day were not accomplished until the next day.

  26. Marsrobot normal operation time run blocked Comm threadpre-emption ready Comm thread Pre-emption HIGH:Bus thread bc_sched OK! MIDDLEComm thread bc_dist Un-LockSystemMUTEX (m) LockSystemMUTEX (m) LOW Tasks LOWEST Meteo thead End of cycle

  27. Marsrobot priority inversion time run blocked Comm threadpre-emption ready Comm thread Pre-emption System Reset HIGH:Bus thread bc_sched NOK! MIDDLEComm thread bc_dist LockSystemMUTEX (m) LOW Tasks LockSystemMUTEX (m) LOWEST Meteo thead Un-LockSystemMUTEX (m) LockSystemMUTEX (m) End of cycle

  28. Priority inversion • The higher priority bc_dist task was blocked by the much lower priority ASI/MET task that was holding a shared resource. • The ASI/MET task had acquired this resource and then been preempted by several of the medium priority tasks. • When the bc_sched task was activated, to setup the transactions for the next 1553 bus cycle, it detected that the bc_dist task had not completed its execution. • The resource that caused this problem was a mutex (here called system_mutex) used within the select() mechanism to control access to the list of file descriptors that the select() mechanism was to wait on.

  29. The select() mechanism creates a system_mutex to protect the "wait list" of file descriptors for those devices which support select(). • The VxWorks pipe mechanism is such a device and the IPC mechanism used is based on using pipes. • The ASI/MET task had called select(), which had called pipeIoctl(), which had called selNodeAdd(), which was in the process of giving the system_mutex. • The ASI/ MET task was preempted and semGive() was not completed. • Several medium priority tasks ran until the bc_dist task was activated. • The bc_dist task attempted to send the newest ASI/MET data via the IPC mechanism which called pipeWrite(). • pipeWrite() blocked, taking the system_mutex. More of the medium priority tasks ran, still not allowing the ASI/MET task to run, until the bc_sched task was awakened. • At that point, the bc_sched task determined that the bc_dist task had not completed its cycle (a hard deadline in the system) and declared the error that initiated the reset.

  30. Debug the problem • On replica on earth • Total Tracing on • Context switches • Uses of synchronisation objects • Interrupts • Took time to reproduce the error • Trace analyses => priority inversion problem

  31. Bug Detection • The software that flies on Mars Pathfinder has several debug features within it that are used in the lab but are not used on the flight spacecraft (not used because some of them produce more information than we can send back to Earth). • These features remain in the software by design because JPL strongly believes in the"test what you fly and fly what you test" philosophy.

  32. One of these tools is a trace/log facility which was originally developed to find a bug in an early version of the VxWorks port (Wind River ported VxWorks to the RS6000 processor for us for this mission). • This trace/log facility was built by David Cummings who was one of the software engineers on the task. Lisa Stanley, of Wind River, took this facility and instrumented the pipe services, msgQ services, interrupt handling, select services, and the tExec task. • The facility initializes at startup and continues to collect data (in ring buffers) until told to stop. The facility produces a voluminous dump of information when asked.

  33. System tracing • Traces system call or OS events • Uses circular buffer • Overhead • RT if ...... ASI/MET Tx Ty bc_dist bc_sched VxWorks 5.x TRACE TICKER routine Physical I/O (BIOS) Hardware

  34. After the problem occurred on Mars JPL did run the same set of activities over and over again in the lab. • The bc_sched was already coded so as to stop the trace/log collection and dump the data (even though JPL knew they could not get the dump in flight) for this error. • So, when JPL went into the lab to test it they did not have to change the software. • In less that 18 hours JPL were able to cause the problem to occur. Once they were able to reproduce the failure the priority inversion problem was obvious. ??

  35. Problem correction (1) • Once JPL understood the problem the fix appeared obvious: change the creation flags for the semaphore so as to enable the priority inheritance. • The Wind River folks, for many of their services, supply global configuration variables for parameters such as the "options" parameter for the semMCreate used by the select service(although this is not documented and those who do not have vxWorks source code or have not studied the source code might be unaware of this feature).

  36. Problem correction (2) • However, the fix is not so obvious for several reasons • The code for this is in the selectLib() and is common for all device creations. When you change this global variable all of the select semaphores created after that point will be created with the new options. There was no easy way in our initialization logic to only modify the semaphore associated with the pipe used for bc_dist task to ASI/MET task communications. • If you make this change, and it is applied on a global basis, how will this change the behavior of the rest of the system ? • The priority inversion option was deliberately left out by Wind River in the default selectLib() service for optimum performance. How will performance degrade if we turn the priority inversion on ? • Was there some intrinsic behavior of the select mechanism itself that would change if the priority inversion was enabled ?

  37. Problem correction (3) • JPL did end up modifying the global variable to include the priority inversion. This corrected the problem. • JPL asked Wind River to analyze the potential impacts for (3) and (4). • They concluded that the performance impact would be minimal and that the behavior of select() would not change so long as there was always only one task waiting for any particular file descriptor. This is true in our system. JPL believes that the debate at Wind River still continues on whether the priority inversion option should be on as the default. • For (1) and (2) the change did alter the characteristics of all of the select mutexes. JPL concluded, both by analysis and test, that there was no adverse behavior. JPL tested the system extensively before they changed the software on the spacecraft.

  38. CHANGED THE SOFTWARE ON THE SPACECRAFT • JPL did not use the vxWorks shell to change the software(although the shell is usable on the spacecraft). • The process of "patching" the software on the spacecraft is a specialized process. It involves sending the differences between what you have onboard and what you want (and have on Earth) to the spacecraft. • Custom software on the spacecraft (with a whole bunch of validation) modifies the onboard copy.

  39. WHY DIDN’T JPL CATCH IT BEFORE LAUNCH ? • The problem would only manifest itself when ASI/MET data was being collected and intermediate tasks were heavily loaded. • Our before launch testing was limited to the "best case" high data rates and science activities. • The fact that data rates from the surface were higher than anticipated and the amount of science activities proportionally greater served to aggravate the problem. • We did not expect nor test the "better than we could have ever imagined" case.

  40. Lessons learned • Only detailed traces of actual system behavior enabled the faulty execution sequence to be captured and identified. • Leaving the « debugging » facilities in the system saved the day. Without the ability to modify the system in the field, the problem could not have been corrected. • Finally, the engineer's initial analysis that "the data bus task executes very frequently and is time-critical -- we shouldn't spend the extra time in it to perform priority inheritance" was exactly wrong. • It is precisely in such time critical and important situations where correctness is essential, even at some additional performance cost.

  41. Lessons learned – human factors • JPL engineers later confessed that one or two system resets had occurred in their months of pre-flight testing. They had never been reproducible or explainable, and so the engineers, in a very human-nature response of denial, decided that they probably weren't important, using the rationale "it was probably caused by a hardware glitch". • Part of it too was the engineers' focus. They were extremely focused on ensuring the quality and flawless operation of the landing software. Should it have failed, the mission would have been lost. It is entirely understandable for the engineers to discount occasional glitches in the less-critical land-mission software, particularly given that a spacecraft reset was a viable recovery strategy at that phase of the mission.

  42. Priority inversion solution implementations • History • Priority Inheritance – pro’s and con’s • Priority ceiling - pro’s and con’s

  43. History • Theory provides (at least) two simple solutions to priority inversion: • Priority Inheritance Protocol • Priority Ceiling Protocol • The first is the simplest, while the second has nicer theoretical properties. • The theoretical results (about both) date back to about 1987, while the actual protocols date back quite earlier. • Burroughs MCP implemented a version of Priority Ceiling Protocol in the 1970's • Lampson & Redell suggest the Priority Ceiling Protocol (in Mesa) in the late 1970's • Important IEEE paper by L. Sha, R. Rajkumar & P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. IEEE Transactions on Computers, vol. 39, pp. 1175-1185, Sep 1990

  44. Priority Inheritance Protocol • Priority Inheritance means that when a thread waits on a mutex owned by a lower priority thread, the priority of the owner is increased to that of the waiter. In the priority inheritance protocol when a thread locks a mutex its priority is not changed. The action takes place when a thread attempts to lock a mutex owned by another thread. • In this situation the priority of the thread owning the mutex is raised to the priority of the blocked thread (if higher). • When the thread releases the mutex its old priority (i.e prior to locking this mutex) is restored. • This prevents unbounded priority inversion since the low priority thread gets a high priority and thus cannot be pre-empted by medium priority thread.

More Related