How to Fly a Data Center – A Pilot’s View
What Flying teaches you
As a certified airline transport pilot (ATP) with over 5,000 hours flying jets, aerobatics and experimental aircraft, I have learned much about critical infrastructure design and operations.
I’ve also learned about design, redundancy and the importance of establishing and managing processes and procedures and about human factor.
Between my job leading a data center company and my love of flying, this is a life-long study.
Staying airborne, returning safely to the ground and keeping a data center operating, rely on understanding your aircraft or data center and establishing safe continuous operations.
Flying has taught me to appreciate both good design and reliable technology. Yet, it has also taught me that design, while very important, must evolve and be smart. Both in the air and on the ground, redundancy has the potential to lull a person into complacency.
There are reasons why 4 engine airplanes are rapidly being replaced by 2 engine aircraft, and why 2N data centers builds are being replaced by N+x concurrently maintainable designs. Both industries learned that, at some point, more systems are less efficient, introduce more components and therefore a higher potential for failure.
“Flying has taught me the difference between a crisis and a drama.”
More systems also mean an operating crew faced with even a noncritical outage on their hands operates in a degraded way even if the plane or data center are safe. A degraded operating crew becomes the pre-imminent failure risk.
If something does go wrong, it is good established procedures, training and a good understanding of the design that informs the proper decision. Risk avoidance, risk management and risk mitigation are achieved by quality maintenance, good training and by applying the correct procedures.
Flying has taught me the difference between a crisis and a drama. Experience and training teach you how to spot a routine warning and when an alarm means there is real cause for concern. More importantly, it teaches you what to do about it.
For example, a nonpressurized aircraft door opening inflight (as has happened to me) is dramatic and will scare the passengers but, as long as the pilot doesn’t panic, it does not lead to a crisis.
There are many more movies about heroic acts in the sky than in data centers. But in reality, in both domains operational safety is not about becoming the person who saves the day at the last second. When someone is forced to be a hero it is usually because of a system or process failure. Real heroism is knowing how things work, how they should operate and what to do when things go wrong.
Check and check again
Flying safely starts on ground before you get into the pilot seat. The most important routine in any normal flight is the pre-flight check. The importance of the logbook, maintenance, process and communication are vital. There is no substitute for doing the prep work. Not doing it will put you behind the machine, and these machines are fast and hard to catch up with.
I once found a flashlight in the engine compartment of a plane. It had just been in for maintenance. The logbook, maintenance history, who last flew the plane, any comments on the state of aircraft, how it handled, all must be studied before take-off. Poor pilots fail to log issues. Often this is because they believe it reflects on them. No-one likes talking about their near misses and outages. When handling aircraft there are strict rules about reporting when something breaks. Abnormalities must be investigated, most things degrade over time and do not break at once – and therefore trend monitoring is essential. Luckily, technology is of great assistance here.
“My biggest take away is that an airplane operator’s job is to make sure you don’t get into a situation that leaves you narrow survival options.”
In my business, I believe it would be great if there were a data center reporting system. Businesses are reluctant to share, which I understand. Therefore, we should establish an anonymous reporting process to report incident investigations. This will bring remarkable advance to the industry. The most fatal attitude in aviation and data centers is the belief that ‘this will never happen to me.’ And the best remedy is to learn from mistakes. The more learning the better.
Learn more about InCommand
Dealing with the unexpected
Here are some examples of unexpected things I’ve experienced in the cockpit. The flight mistakes that most worried me are when I made bad decisions. For example, I took off from Aspen, Colorado, in a Piston twin aircraft in complete whiteout conditions. My radar fried as I took off, but I couldn’t go back and land due to the weather. Everything else went well. But I could not get out of my mind that if I lost an engine on that flight with the mountainous terrain all around, I would be in real trouble. My biggest take away is that an airplane operator’s job is to make sure you don’t get into a situation that leaves you narrow survival options.
Once, when descending for landing on a flight, I had an engine outage on a twin-engine airplane – it was a pilot mistake. It was a new plane for me, and the engines were not well set up. It was a piston engine with the fuel mixture (yes, they still exist on aircraft) set too rich. As I advanced the mixture during the descent, it choked the engine. It was in a descent and I didn’t need much power. It took a while until I realized the outage as the plane didn’t have a central warning system (the airplane equivalent of a building management system) and when I did, I reversed the last input. The engine started and proceeded to a safe landing.
I’ve also had instrument outages and radio malfunctions where redundancy and training kicked in. I’ve had a few false alarms caused by bad probes and faulty sensors, mostly after maintenance. A checklist and a call to the manufacturer confirmed the malfunction.
The unexpected can and does happen in the air and inside the data center. My approach to flying is reflected in our approach to operating our own data centers and helping manage those of our clients. We do everything possible to avoid heroic actions. We develop procedures with a very precise actions list to take when something out of the ordinary occurs.
How we learn
The aircraft industry and commercial airline businesses are many decades older than the data center industry. Regulations and standards in the airline industry are well established and yet things still go wrong – sometimes tragically. The most infamous last words of a cockpit’s voice recorder are ‘what does this do?’ and ‘I know what I’m doing’.
In the data center space, we constantly try to learn from other industries. How are they improving reliability? How are they improving safety? What’s new in their training and operations?
Individual experience matters a lot. So, for example, because I’ve not been able to fly as much as I’d like in 2020, I’ve focused on doubling down on training.
Things can always be improved. No one can say that data centers are the world’s most efficient buildings. We constantly learn of ways to improve reliability and efficiency. The opportunity is to constantly strive for better through giving people the tools to view and change how they operate through daily, continuous advances.
Kicking the tires, lighting the fires and flying by the seat of your pants makes flying fun only when nothing goes wrong. It is not a recipe for long life and it is not a smart way to operate any critical system.
In my next article I’m going to get into the details of what the data center industry can learn from aircraft makers and operators about redundancy – when it works and when it fails.