Michael P. Kasick (Carnegie Mellon University), Priya Narasimhan (Carnegie Mellon University), Kevin Atkinson (University of Utah), Jay Lepreau (University of Utah)
In the large-scale Emulab distributed system, the many failure reports make skilled operator time a scarce and costly resource, as shown by statistics on failure frequency and root cause. We describe the lessons learned with error reporting in Emulab, along with the design, initial implementation, and results of a new local erroranalysis approach that is running in production. Through structured error reporting, association of context with each error-type, and propagation of both error-type and context, our new local analysis locates the most prominent failure at the procedure, script, or session level. Evaluation of this local analysis for a targeted set of common Emulab failures suggests that this approach is generally accurate and will facilitate global fingerpointing, which will aim for reliable suggestions as to the root-cause of the failure at the system level.
In Proceedings of the Third USENIX Workshop on Real, Large Distributed Systems (WORLDS '06), November 2006
The slides from the WORLDS-06 talk: PDF
|Kevin Atkinson <firstname.lastname@example.org>||Last Modified: Tue Nov 14 09:39:45 MST 2006|