4.2. Identifying sources of resilience: learning from what goes well
One of the aims of Resilience Engineering is to learn from the everyday performance and from successful operations, rather than by only through lessons learned after failures. In line with this, identifying Sources of Resilience means investigating the mechanisms by which organizations successfully handle expected and unexpected conditions. Such mechanisms (e.g., strategies, processes, tools) allow the organization to adapt, perform and deliver required services in spite of the variability and complexity they experience in their operations. This adaptive capacity can be recognized by looking at the work-as-done, both in daily operations and unusual or exceptional scenarios, in order to identify sources of resilience and to learn from what goes well.
- 1 Implementation
- 1.1 Introduction
- 1.2 Before a crisis
- 1.3 During a crisis
- 1.4 After a crisis
- 2 Understanding the context
- 2.1 Detailed objectives
- 2.2 Targeted actors
- 2.3 Expected benefits
- 2.4 Relation to adaptive capacity
- 2.5 Relation to risk management
- 2.6 Illustration
- 2.7 Implementation considerations
- 3 Relevant material
- 3.1 Relevant Practices, Methods and Tools
- 3.2 References
- 3.3 Terminology
- 4 Navigate in the DRMG
Organizations need to invest in the understanding of everyday operations in order to better be prepared for crisis situations. Resources for building up and maintaining this understanding need to be allocated, an investment with the purpose of retaining, enhancing or amplifying the organization's (or, organizations') resilient capabilities. This means, among other resources, that time needs to be available from experts to share their views on the functioning of the system, as well as facilitators or analysts (possibly experts on resilience management) that are able to compile this knowledge so that the organization may learn from it in a methodological manner.
To identify sources of resilience:
- Build the necessary skills to understand and identify sources of resilience at different levels of the organization.
- Select methods for the identification of possible sources of resilience with the involvement of roles and actors at different levels in the organization, making sure to account for an adequate diversity of perspectives. In order to achieve such diversity, combine individual interviews and workshop-based techniques, taking into account time constraints and availability of resources.
- Plan the methods around triggering questions to be used as guide for defining and describing margins and couplings in daily operations (triggering questions before) or looking back at past events to identify successful skills, strategies, and procedures (triggering questions after).
- Use the outcome of your analysis to revise your internal guidelines, training or to create ad-hoc ones.
Before a crisis
The following triggering questions can be used to guide a discussion aimed to understand work-as-done, both in daily operations and in situations of crisis.
This can be done in a number of activities, such as dedicated workshops, through interviews, group interviews, observational studies informing analyses, and over-the-shoulder observations, etc. The analyses as such can be part of other safety, security, and change management activities, audits, safety assessments, concept design sessions, etc.
The discussion should be intended as a way to improve the capability of the organization to react to a situation of crisis, by revising internal guidelines and procedures in light of the existing practices that have shown to work well.
- Which strategies (e.g. working methods or contingency procedures) can be used to handle a sudden loss of capacity and/or increase in demands?
- For which events is there a response ready?
- How and when can existing roles and tasks be reorganized in response to such events?
- Is the personnel exposed to unusual situations as part of the training?
- Which margins are available in everyday operational situations that can be used to handle suddenly increased demands?
- Which margins have been defined and anticipated beforehand?
- How is it possible to increase existing margins?
- When is it necessary to negotiate this increase with other actors? With which actors?
- Are there criteria to establish when it is possible to revert to the original margins?
- How and when can additional resources (human, technical, material) be allocated/called in to integrate existing ones?
- What back-up (incl. legacy) resources and working methods are available? Is personnel (still) familiar with these in order to readily use them?
- What kind of coordination with other actors needs to be established for additional resources?
- Are there criteria to establish when it is possible to revert to the original set of resources?
- Which roles in the organization can monitor the margins/resources available, both during and after an unexpected increase in demands?
- How are margins/resources monitored?
- Which monitoring mechanisms are put in place by the organization to anticipate and assess possible threats that may occur in the future?
- During the management of everyday operations or crises, are there different goals that may come in conflict (e.g. ensuring adequate safety margins vs. minimizing economic losses)?
- How do operators succeed in meeting conflicting goals and finding appropriate balance among them?
Dependencies and interactions:
- What strategies (could) foster a smooth coordination among actors and minimize constraints and bottlenecks?
- Where do more efforts need to be spent to understand the potential for small variations in conditions and performance outcomes to combine, propagate, and amplify across organizations (so-called “cascading”, “butterfly” or “snowball” effects)?
- What do operators (need to) know about the other parts of the system that they are interacting with?
- How are formal and informal networks nurtured that are useful in handling crises?
During a crisis
Observe and document application of procedures, methods etc. and their outcome, i.e. not only when they fail, but also when they succeed. Take a step back and reflect on whether conflicting goals are balanced appropriately, where more adaptive capacity is needed, and whether complexity is handled appropriately.
- Where do we never experience (this problem/good operation)? Why is that?
- Is the organization flexible, adaptable? To what extent and in what way can the organization change to adapt to demands?
- Do we support colleagues in case of overload?
- Do we have people available with different competences that can take different roles if required?
After a crisis
The following triggering questions can be used after the occurrence of an actual crisis which was successfully managed, in order to understand which of the existing practices have shown to work well. This can be done in a number of activities, such as dedicated workshops, debriefing sessions, after-action reviews, exercise analyses, interviews, group interviews, incident investigations, lessons learned analyses, etc. Example activities that can be done during these activities using the triggering questions are:
- Analyzing the differences between the intended use of procedures and their actual use during the crisis (Understanding which surprises were experienced and which strategies or working methods came out to be successful).
- Sharing of case studies between organizations (Explaining what happened, from the point of view of those involved, and ask to the participants how they would have reacted to the same situation).
- Proposing changes and/or adaptation to existing plans, resource allocations, guidelines, and procedures, based on what was learnt from the crisis.
- Which strategies (e.g. working methods or contingency procedures) were used to handle sudden losses of capacity and/or increases in demands?
- Were the exiting roles reorganized in response to such events?
- Was the allocation of tasks among different actors modified?
- Were the situations experienced in the context of training activities useful to handle the situation?
- Which margins were actually available to handle sudden losses of capacity and/or increases in demands?
- Which of these margins were defined and anticipated beforehand?
- As the crisis developed, was an adjustment of the margins required?
- Was it necessary to negotiate margin adjustments with other actors?
- If the available margins were changed during the crisis, when was it possible to revert to the original margins?
- Was it necessary to allocate/call in additional resources (human, technical, material) as the crisis developed?
- Was a coordination with other actors needed in order to allocate/call in such additional resources?
- If additional resources were called in from other organizations or from other departments, when was it possible to release them back?
- Which roles in the organization monitored the margins/resources available?
- How were margins/resources actually monitored?
- Were the threats experienced during the crisis somehow anticipated by the available monitoring mechanisms?
- In which way did the available monitoring mechanisms help to anticipate the threats?
- During the management of the crisis, did we experience situations of conflicting goals that affected our way of managing it?
- How did the operators succeed in meeting conflicting goals and finding the appropriate balance between them (e.g. ensuring adequate safety margins vs. minimizing economic losses)?
Dependencies and interactions:
- Which strategies worked better to minimize constraints and bottlenecks when coordinating among different actors?
- How did the knowledge of other parts of the organization contribute to facilitate the handling of sudden losses of capacity and/or increases in demands?
- Which strategies worked to minimize the cascading-effects of the crisis?
- How can we improve existing training by taking into account successful synergies with different organizations/departments experienced during the handling of the crisis?
Understanding the context
One of the aims of Resilience Engineering is providing a deepened understanding of everyday performance, in order to learn, not only from failures, but also from successful operations. Resilience management should not only be based on analysis of risk and "brittleness" illustrated through failures during incidents and crises, but on an understanding of all outcomes of everyday operations, including the positive ones. Learning from what goes well during normal operations in safety critical work as well as when incidents and crises occur, can support better preparedness and learning, thus increasing resilience. The study of everyday operations can reveal how the organization are managing normal conditions through the adaption to occurring events , but also how and when procedures are adapted.
Actors that may benefit from this topic include actors involved in safety, security, and change management activities, audits, safety assessments, concept development sessions, debriefing sessions, after-action reviews, exercise analyses, and incident investigations. This may include policy makers, middle and line management, operational management, and a variety of operational roles.
Enhanced understanding of everyday situations focusing on essential functions that makes a critical infrastructure work. The organization can use this understanding to retain, enhance or amplify the organization's (or, organizations') resilient capabilities, thereby ensuring that everyday processes go well as much as possible.
Relation to adaptive capacity
This capability card is in essence an elaboration on how to identify and increase adaptive capacity.
Relation to risk management
Support investments in the ability to maintain operation and continuity of operations for different kinds of systems and organizations at different levels.
"'High Workload at the Maternity Ward
A remarkably large number of births one evening led to chaos at the maternity ward. The ward was understaffed and no beds were available for more patients arriving. Also, patients from the emergency room with gynaecological needs were being directed to the maternity ward as the emergency room was overloaded. To cope with the situation one of the doctors started to free resources by sending all fathers of the new-born babies home. Although not a popular decision among the patients this re-organization freed up beds, allowing the staff to increase their capacity and successfully manage all the patients and births. After this incident an analysis of the situation was performed that resulted in a new procedure for “extreme load at maternity hospital. The system demonstrated several important abilities contributing to system resilience as it uses its adaptive capacity to respond to and learn from the event". (Rankin et al, 2013)
Initial familiarisation with resilience concepts, in particular the understanding of everyday work when nothing goes wrong.
Implementation can vary based on the number of dedicated workshops. Typically focus groups engage 4-8 experts and 2 facilitators for a about a day, but the number of focus groups or workshops (and experts) is dependent on the scope of the analysis. For example, for small systems/organizations a single workshop or focus group may be sufficient, but with larger systems/organizations natural boundaries between subparts may be defined for which a number of workshops are run. Note that the integration and interactions between subparts deserve explicit and dedicated attention.
It is also possible to complement existing practices in the organization, for instance by including the proposed triggering questions while planning or reviewing operations, or during audits.
Pre-workshop and follow-up analysis and fact checking may also be expected according to standard workshop, focus group, or interview methodologies.
Relevant Practices, Methods and Tools
Understanding the difference between how work is assumed or expected to be done (Work-as-Imagined) and how it is actually done (Work-as-Done) (see Herrera, et al, 2017):
- Teach value of, and how to ask, open-ended questions. (Schein, 2013)
- Implement “Learning Teams” in your query where Work-as-Imagined and Work-as-Done are investigated (Hollnagel, 2017; Conklin, 2012).
- Patient safety senior executive walk-arounds to understand how the work gets done on the frontlines.
- Prepare to shift people for the “unexpected” such as environmental disasters or threats such as chemical spills or earthquakes, riots, terrorist attacks, and epidemics.
- Overcapacity protocols to manage overcrowding in emergency departments. *Development of “rapid assessment zones” to reduce overcrowding in emergency departments.
- Do simulations involving surprises as part of a certification program.
- Share case studies between plants that tell story, from point of view of those involved, to just before revealing what happened, ask: “What would you do? How could this play out? What would you do to avoid/support…?”
Resilience Analysis Grid (RAG) with questions related to the resilience potentials to anticipate, monitor, respond and learn (Hollnagel, 2017 latest version of RAG).
Critical incident investigation work that uses a framework based on resilience perspectives (Health care Canada).
- Berkes, F., and C. Folke. 2002. Back to the future: ecosystem dynamics and local knowledge, in L. H. Gunderson and C. S. Holling (Eds.). Panarchy: understanding transformations in human and natural systems (pp. 121-146). Washington, D.C.: Island Press.
- Braithwaite J. (2015). Re-conceptualising patient safety through innovation and systems change. Seminar, September 29, 2015, Hong Kong. Available at: https://s3-eu-west-1.amazonaws.com/bmj-internationalforum/pdfs/Asia+Forum/A3+-+Jeffrey+Braithwaite+slides.pptx
- Cavallo, A. & Ireland, V. (2014). Preparing for complex interdependent risks: A System of Systems approach to building disaster resilience. International Journal of Disaster Risk Reduction, 9, 181–193.
- Djalante, R., Holley, C., Thomalla, F., & Carnegie, M. (2013). Pathways for adaptive and integrated disaster resilience. Natural Hazards, 69(3), 2105–2135.
- Furniss, D., Back, J., Blandford, A., Hildebrandt, M., & Broberg, H. (2011). A resilience markers framework for small teams. Reliability Engineering & System Safety, 96(1), 2-10.
- Gero, A., Fletcher, S., Rumsey, M., Thiessen, J., Kuruppu, N., Buchan, J., Daly, J., & Willetts, J. (2015). Disasters and climate change in the Pacific: adaptive capacity of humanitarian response organizations. Climate and Development, 7(1), 35–46.
- Hémond, Y. & Robert, B. (2014). Assessment process of the resilience potential of critical infrastructures. International Journal of Critical Infrastructures, 10(3-4), 200-217.
- Hoffman, R. R., & Woods, D. D. (2011). Beyond Simon’s Slice: Five Fundamental Trade-Offs that Bound the Performance of Macrocognitive Work Systems. IEEE Intelligent Systems, 26(6), 67–71.
- Hollnagel, E. (2004). Barriers and accident prevention. Aldershot, UK: Ashgate.
- Hollnagel, E. (2014). Is safety a subject for science? Safety Science 67, 21-24.
- Hollnagel, E., Woods, D. D., & Leveson, N. (2006). Resilience engineering: concepts and precepts. UK: Ashgate Publishing, Ltd.
- MSB (2014). Gemensamma grunder för samverkan och ledning vid samhällsstörningar (MSB777). Myndigheten för samhällsskydd och beredskap (MSB).
- Rankin, A., Dahlbeck, N., & Lundberg, J. (2013). A case study of factor influencing role improvisation in crisis response teams. Cognition, Technology & Work 15(1).
- Shirali, G. A., Motamedzade, M., Mohammadfam, I., Ebrahimipour, V., & Moghimbeigi, A. (2016). Assessment of resilience engineering factors based on system properties in a process industry. Cognition, Technology and Work, 18(1), 19–31.
- Van Der Beek, D., & Schraagen, J. M. (2015). ADAPTER: Analysing and developing adaptability and performance in teams to enhance resilience. Reliability Engineering and System Safety, 141, 33-44.
- Woods, D. D. (2003). Creating foresight: how resilience engineering can transform NASA’s approach to risky decision making. Work, 4(2), 137–144.
- Woods, D. D. (2006). Essential characteristics of resilience. In E. Hollnagel, D. D. Woods, & N. Leveson (Eds.), Resilience engineering: Concepts and precepts (pp. 21–34). Aldershot, UK: Ashgate.
DARWIN adapts the following working definition: "The ability to resist, absorb, accommodate to and recover from the effects of disturbances and changes in a timely and efficient manner, including through adaptation and restoration of basic structures and functions" (Source: DARWIN D1.1, 2015).
Some widely used related definitions that this working definition is based on:
"Adaptive capacity of an organization in a complex and changing environment. Note Resilience is the ability of an organization to manage disruptive related risk" (Source: ISO 22300).
"The ability of a system, community or society exposed to hazards to resist, absorb, accommodate to and recover from the effects of a hazard in a timely and efficient manner, including through the preservation and restoration of its essential basic structures and functions. Comment: Resilience means the ability to "resile from" or "spring back from" a shock. The resilience of a community in respect to potential hazard events is determined by the degree to which the community has the necessary resources and is capable of organizing itself both prior to and during times of need." (Source: UNISDR, 2009).
"Intrinsic ability of a system or organization to adjust its functioning prior to, during, or following changes, disturbances, and opportunities so that it can sustain required operations under both expected and unexpected conditions" (Source: Hollnagel, 2014)
Work as done refers to he assumptions or expectations of what other people do [as part of their work] is called Work-as-Imagined (WAI), while that which people actually do [as part of their work] is called Work-as-Done (WAD) (Hollnagel, 2018, p. 17).
Work as imagined refers to the assumptions or expectations of what other people do [as part of their work] is called Work-as-Imagined (WAI), while that which people actually do [as part of their work] is called Work-as-Done (WAD). The term 'imagined' is not used in an uncomplimentary or negative sense but simply recognises that our descriptions of work will never completely correspond to work as it takes place in practice - as it is actually done (Source: Hollnagel, 2018, p. 17-18) and how work is being thought of either before it takes place when it is being planned or after it has taken place when the consequences are being evaluated (Source: Wears and Hollnagel, 2015).
- Adaptive capacity
"ability of systems, institutions, humans, and other organisms to adjust to potential damage, to take advantage of opportunities, or to respond to consequences" ISO 14080:2018(en), 126.96.36.199. "The adaptive capacity of a system is usually assessed by observing how it responds to disruptions or challenges. Adaptive capacity has limits or boundary conditions, and disruptions provide information about where those boundaries lie and how the system behaves when events push it near or over those boundaries" (Source: Woods and Cook, 2006, p. 69)
"How closely or how precarious the system is currently operating relative to one or another kind of performance boundary" (from Woods, 2006 - Woods, D. D. "Essential Characteristics of Resilience." In Resilience Engineering: Concepts And Precepts, edited by E. Hollnagel, D. D. Woods, and N. Leveson, 19–30. Adelshot, UK: Ashgate, 2006.)
Brittleness describes how rapidly a system's performance declines when it nears and reaches its boundary conditions (Source: Woods, 2015).
Coupling (loose/tight) refers to the time-dependency of a process, the flexibility of action sequences, the number of ways to achieve a goal, and the availability of slack in operational resources (from Perrow, 1984 - Perrow, Charles. Normal Accidents: Living with High-Risk Technologies. New York: Basic Books, 1984.)
- Parent theme: Assessing resilience
- Parent card: Assessing resilience (card old)
- Resilience abilities
- Categories: Evaluation, Situation understanding, Learning lessons, Planning, Training, Governance, Procedures
- Functions of crisis management: BEFORE, Preparation, Build knowledge of crisis situations, DURING, Damage control and containment, Assess emergency and response, AFTER, Learning, Assess performance