By Thomas Hornbæk Svendsen, Subject Matter Expert, NNIT
The availability of IT systems is critical to any life science company. Planning, quality assurance and quality controls, production and logistics will be severely impacted in case the unlikely occurs: one or more IT systems are unavailable for a long period of time. Unlikely events do occur – often at the most inconvenient moments. To prepare for this situation, you need to have a technical recovery plan to get the systems back online as fast as possible and to ensure data restoration.
There’s more to getting back to normal opera- tion than simply restoring IT systems and the services they provide – it is also crucial to restore data to a well-defined moment. If no plans for system and data recovery have been established, it will be difficult to determine whether data managed by the systems has been compromised, either as a result of the actual breakdown (e.g., data transfers not being committed or properly rolled back) or as a result of an incorrect restore of databases, security settings, audit trails and more. In case data is GxP critical, this becomes even more critical.
Planning is essential for a successful recovery. Without documented recovery plans in place, it will be luck – in combination with hard work – that decides if a recovery is successful, or cumbersome and potentially catastrophic. However, what does a proper recovery plan look like in terms of scope, sequence of events, level of details and organisational setup?
To facilitate a successful restoration of an IT system, it is necessary to prepare and maintain a Technical Recovery Plan (TRP). The TRP must include a well-defined scope of the comprised services, sequence of the tasks needed to restore normal operation, dependencies on other systems and services, and a clear defini- tion on when the restore can be considered successful. It is also critical to define the re- quired organisational setup covering technical capabilities, chain of command and decision making power.
With proper preparations and consolidation, the TRP will enable the use of methodology and procedures. In this way, the organisation becomes independent from individuals when system and service recovery is needed.
Start out by defining the scope of the TRP – what should be covered, what should not be covered, and are there any dependencies that must be taken into consideration? Defining the scope is critical. A full recovery of services depends on a thorough analysis of both system and information architecture. The relation between the individual components of the IT systems must be de- scribed, as well as the data flow and whether data is GxP critical. Without the proper scoping, data integrity might be compromised.
When working with the scope, it is useful to lay out the first version of the system drawing. Often, system components and dependencies initially overlooked will be uncovered, and it will help you verify the Configuration Management Database (CMDB) integrity and whether the system has been subject to adequate configura- tion and change control. Expect to be surprised and having to adjust the scope as part of an iterative process.
Along with these considerations, it is of para- mount importance to define what constitutes a disaster: is it when parts of the system are unavailable, when the system cannot provide adequate support to the organisation or to customers, or is it when there is a complete failure of services? Furthermore, is must be considered how to establish adequate govern- ance procedures ensuring that the TRPs are up-to-date at all times.
The TRP is a plan that must provide the coordina- tor overview. It should not detail all steps to restore normal operation – such information should only be found in procedures comprising the detailed steps needed to restore each compo- nent within the scope. It is a balance to define the level of details to be contained in a TRP – the list below serves as a guideline of elements to include:
Cornerstones in any TRP are well-defined Recovery Point Objective (RPO) and Recovery Time Objective (RTO). RPO and RTO define the acceptable amount of data loss, and the dura- tion of time within which a service must be restored, respectively. The below figure illus- trates the RPO and RTO concept:
Besides being an opportunity to scrutinise documentation, procedures and responsibili- ties, a TRP provides the organisation with institutionalised procedures on how to act in case of a system breakdown. A TRP also enables the organisation to react constructively in case of a breakdown and to restore services to a defined point in time, with a defined maximum loss of data and without the risk of declaring system restore completed too soon. Therefore, TRPs are a good investment, which will help you resume business as usual much easier and faster should the unlikely happen, and your IT systems break down.