7.2 Data Centre Operations
Key Takeaways
- SOPs document routine repeatable tasks, MOPs give step-by-step instructions with risk assessment and back-out plans for a specific change, and EOPs define emergency response; abnormal-operating procedures cover off-normal but non-emergency conditions.
- An SLA is provider-to-customer, an OLA is internal between teams, and an underpinning contract (UC) is with an external third-party supplier such as a generator or carrier vendor.
- Maintenance strategies are preventive (schedule-based), predictive (condition/monitoring-based), and corrective/run-to-failure (after breakdown).
- Capacity management tracks power, cooling, space, and network headroom to avoid stranded capacity; DCIM is the primary planning tool.
- Change management routes non-routine work through an approved MOP, a maintenance window, risk assessment, and Lockout-Tagout with arc-flash PPE per NFPA 70E.
Operational Procedures: SOP, MOP, EOP
Disciplined operations depend on written, version-controlled procedures. The three core document types are examined repeatedly, so master the distinction:
| Document | Full name | Purpose |
|---|---|---|
| SOP | Standard Operating Procedure | Routine, repeatable operations: daily walk-throughs, start-up/shutdown, log checks, alarm acknowledgement |
| MOP | Method of Procedure | Step-by-step instructions for one specific change or maintenance task, with prerequisites, risk assessment, and a back-out plan |
| EOP | Emergency Operating Procedure | Response to emergencies: fire, loss of utility, EPO activation, flooding, or cooling failure |
A fourth category, the abnormal-operating procedure (sometimes abbreviated AOP or AOR), sits between the SOP and the EOP: it governs conditions that are off-normal but not yet an emergency, for example running on generator, a UPS on bypass, or a failed cooling unit while the site is still stable. Note that the term AOR is also used for "Area of Responsibility" — the document that clarifies who owns which system — so read the context. Together these procedures form the operations playbook required by frameworks such as the Uptime Institute Management & Operations (M&O) Stamp of Approval and the EPI Data Centre Operating Standard (DCOS).
Service Agreements: SLA vs OLA vs Underpinning Contract
Operations must deliver against layered commitments drawn from ITIL and ISO/IEC 20000:
- SLA (Service Level Agreement) — the customer-facing contract that quantifies availability (for example 99.99%), response and restoration times, and credits or penalties for breach.
- OLA (Operational Level Agreement) — an internal agreement between teams within the same provider (for example the facilities team committing a two-hour response to the hosting team). OLAs underpin the SLA but never involve the customer.
- Underpinning Contract (UC) — a contract with an external third-party supplier, such as a generator maintenance vendor, chiller service company, or telecom carrier. If a UC promises a four-hour parts response, the SLA to the customer cannot credibly promise faster.
The key exam trap is direction: SLA is external to the customer, OLA is internal, and the UC is external to a supplier. A chain of realistic OLAs and UCs is what makes an SLA achievable.
Maintenance Strategies
Data centre reliability is largely a maintenance discipline. Three strategies appear on the exam:
- Preventive (planned) maintenance follows a fixed schedule regardless of current condition — quarterly generator load-bank tests, annual UPS battery capacity tests, filter changes. It reduces failure probability but can replace healthy parts early.
- Predictive (condition-based) maintenance triggers work from monitored condition data such as vibration, thermal imaging, oil analysis, or battery impedance trends. It times intervention just before failure, cutting both downtime and wasted parts.
- Corrective maintenance repairs equipment after a fault; run-to-failure deliberately waits for breakdown on non-critical, low-cost components. Reliability-Centred Maintenance (RCM) blends these strategies by criticality.
Capacity Management
Capacity management ensures the facility never runs out of any of its four resources — power, cooling, physical space, and network — before the others. A cabinet may have rack units free yet be power-capped; a row may have power yet lack cooling headroom. Capacity that exists but cannot be used because a paired resource is exhausted is called stranded capacity. Operators use DCIM (Data Center Infrastructure Management) to model headroom, forecast growth, and place new equipment where all four resources are available, feeding real-time power and thermal data from EPMS and BMS.
Change Management
Every non-routine action is a risk to uptime, so change management wraps it in controls. A proposed change is documented in a MOP, assessed for risk, reviewed (often by a change advisory board), and scheduled into a maintenance window. High-risk electrical work additionally requires Lockout-Tagout (LOTO) to prevent inadvertent re-energisation and arc-flash-rated PPE selected from an incident-energy analysis per NFPA 70E. A sound change always defines success criteria, abort/back-out triggers, and post-change verification — because the leading cause of data centre outages is human error during change, not equipment failure.
Operational Roles and Incident Management
Sound operations also depend on clear human structure. A staffed site runs on defined shifts with an Area of Responsibility (AOR) matrix or RACI chart stating who is accountable for each system, so no task falls between teams. Incidents are logged, triaged by severity, escalated on a defined path, and closed with a root-cause analysis whose lessons feed back into updated SOPs, MOPs, and EOPs — a continuous-improvement loop that maturity frameworks such as the EPI DCOS and the Uptime M&O Stamp assess directly. Shift handovers use a structured log so that in-progress abnormal conditions — a generator run, a UPS on bypass, a suppressed alarm — are never lost between crews, and every maintenance window closes only after post-work verification confirms the site is back to its normal, fully redundant state.
Common operations traps
- SLA vs OLA direction: the SLA faces the customer, the OLA is internal, and an underpinning contract faces an external supplier.
- SOP vs MOP: an SOP is the routine 'what to do'; a MOP is the change-specific 'how to do it' with a back-out plan.
- Preventive vs predictive: preventive follows the calendar, predictive follows the condition data.
- Human error: most outages occur during change, so no live electrical work proceeds without an approved MOP, LOTO, and NFPA 70E arc-flash PPE.
Capacity management and change management are two sides of one coin: capacity management decides whether new load can be added safely, while change management controls how it is added. Both lean on DCIM as the shared system of record, and both ultimately exist to protect the availability the SLA promises.
Which agreement governs commitments between two internal teams within the same data centre provider, rather than with the customer or an outside supplier?
A team replaces a UPS fan only when online battery-impedance and thermal-imaging data show it is beginning to degrade. Which maintenance strategy is this?
Which document provides step-by-step instructions for a specific change, including prerequisites, risk assessment, and a back-out plan?