7.3 Commissioning, Efficiency & Resilience
Key Takeaways
- Commissioning runs in five levels: Cx1 factory witness test, Cx2 site inspection on delivery, Cx3 component/pre-functional verification, Cx4 Integrated Systems Test (IST / pull-the-plug), and Cx5 continuous/performance commissioning.
- PUE = Total Facility Energy / IT Equipment Energy; DCiE = IT Energy / Total Facility Energy = 1/PUE expressed as a percentage; PUE is codified in ISO/IEC 30134-2.
- WUE (ISO/IEC 30134-9) is litres of water per IT kWh and CUE (ISO/IEC 30134-8) is kg CO2e per IT kWh; both are lower-is-better.
- RTO is the maximum tolerable time to restore a service; RPO is the maximum tolerable data loss measured in time since the last good replica.
- Backup sites range from hot (fully provisioned, replicated, recovery in minutes) to warm (hardware but not current data, hours) to cold (space, power, and cooling only, days to weeks).
Commissioning (Cx Levels 1-5)
Commissioning is the structured process of proving that infrastructure was built and performs as designed before it carries live load. The widely used five-level model progresses from factory to fully integrated operation:
| Level | Name | What it proves |
|---|---|---|
| Cx 1 | Factory Witness Test (FWT) | Equipment is tested at the manufacturer before shipment |
| Cx 2 | Site Inspection | Visual inspection on delivery and after installation |
| Cx 3 | Component / Pre-functional Verification | Each installed unit is tested individually (site acceptance) |
| Cx 4 | Integrated Systems Test (IST) | The whole electrical-mechanical chain is tested together, including pull-the-plug fault simulation |
| Cx 5 | Continuous / Performance Commissioning | Ongoing and seasonal verification during operation |
The capstone is Cx 4, the Integrated Systems Test, popularly called pull-the-plug: with load banks emulating the IT load, the utility feed is dropped so the team can verify that the UPS bridges the gap, the generator starts and synchronises, the ATS transfers, cooling re-stabilises, and BMS/EPMS log the event correctly. Only an IST proves the systems work as a system, not just as parts.
Efficiency Metrics: PUE and DCiE
Power Usage Effectiveness (PUE), defined by The Green Grid and standardised as ISO/IEC 30134-2, is the headline efficiency metric:
PUE = Total Facility Energy / IT Equipment Energy
A perfect PUE of 1.0 means every watt entering the facility reaches IT equipment; industry average sits near 1.55, and hyperscale leaders report 1.10-1.20. PUE can never be below 1.0 — a value under 1.0 signals a measurement-boundary error. Its inverse is Data Center Infrastructure Efficiency (DCiE):
DCiE = IT Equipment Energy / Total Facility Energy = 1 / PUE, usually expressed as a percentage.
Worked PUE example
Suppose a facility draws 1,800 kW total and the IT load is 1,200 kW.
- PUE = 1,800 / 1,200 = 1.5
- DCiE = 1,200 / 1,800 = 0.667 = 66.7%
A PUE of 1.5 means that for every watt of IT load the facility spends an extra 0.5 W on cooling, UPS losses, lighting, and distribution.
WUE, CUE, and Efficiency Best Practices
Two companion metrics extend the picture and are also part of the ISO/IEC 30134 family:
- WUE (Water Usage Effectiveness) = annual water consumed (litres) / IT energy (kWh), standardised as ISO/IEC 30134-9. Source-WUE adds water consumed upstream in power generation.
- CUE (Carbon Usage Effectiveness) = annual carbon emissions (kg CO2e) / IT energy (kWh), standardised as ISO/IEC 30134-8. The Renewable Energy Factor (REF) is ISO/IEC 30134-3.
Lower is better for all three. Practical efficiency levers include airflow management (blanking panels, brush grommets, hot/cold-aisle containment), raising supply set-points within the ASHRAE recommended 18-27 C envelope to unlock more free-cooling hours, using air-side or water-side economizers, running UPS in eco-mode, distributing power at higher voltage to cut losses, and consolidating and virtualising IT to raise utilisation.
Business Continuity and Disaster Recovery
Resilience planning starts with a Business Impact Analysis (BIA) (per ISO/IEC 22301), which sets recovery priorities and two defining targets:
- RTO (Recovery Time Objective) — the maximum acceptable time a service can be down before it must be restored. It drives investment in failover speed and redundancy.
- RPO (Recovery Point Objective) — the maximum acceptable data loss, measured as time since the last good backup or replica. It drives backup and replication frequency.
A one-hour RTO with a fifteen-minute RPO, for example, implies automated failover plus near-continuous replication. Recovery capability is delivered through alternate sites of increasing readiness:
| Site type | Provisioning | Typical recovery |
|---|---|---|
| Hot site | Fully equipped, data continuously replicated (often active-active) | Minutes |
| Warm site | Hardware and connectivity in place, data not fully current | Hours |
| Cold site | Space, power, and cooling only; no active servers | Days to weeks |
Hot sites give the lowest RTO/RPO at the highest cost; cold sites are cheapest but slowest. Choosing among them is a direct trade-off against the RTO and RPO the business is willing to fund.
Measuring PUE Correctly
PUE is only meaningful if the measurement boundary is honest. The Green Grid defines three measurement categories (Level 1-3) by where and how often energy is metered: Level 1 (basic) measures IT energy at the UPS output; Level 2 (intermediate) measures at the PDU output; and Level 3 (advanced) measures at the IT equipment inlet with continuous, ideally sub-hourly, sampling. Comparing a Level 1 annual figure against a competitor's Level 3 figure is meaningless. A partial PUE (pPUE) isolates one subsystem — for example the cooling plant — so it can be benchmarked independently.
A second worked example
If a site reports IT energy of 2,000 kW and cooling, UPS, and lighting overhead of 900 kW, total facility energy is 2,900 kW, so PUE = 2,900 / 2,000 = 1.45 and DCiE = 2,000 / 2,900 = 69%. Shaving 200 kW of cooling overhead through containment and higher set-points would drop total to 2,700 kW and PUE to 1.35 — exactly the kind of low-cost gain airflow management delivers.
Resilience Beyond the Site
Backup strategy also protects the data itself. The widely cited 3-2-1 rule keeps three copies of data on two media types with one copy off-site. Reliability engineers track MTBF (mean time between failures) and MTTR (mean time to repair); availability rises as MTBF grows and MTTR shrinks. The choice of hot, warm, or cold site, the replication method, and the backup regime are all sized to hit the RTO and RPO the Business Impact Analysis established — closing the loop from business requirement back to physical design and operations.
A data centre's total facility power draw is 1,800 kW and its IT equipment load is 1,200 kW. What is its PUE?
In business continuity planning, what does the Recovery Point Objective (RPO) define?
Which disaster-recovery site provides only space, power, and cooling with no active servers or current data, giving the longest recovery time?
You've completed this section
Continue exploring other exams