Read time: 3 minutes

Thermal runaway, the unexpected killer of your data centre

By , ITWeb
19 Feb 2014

Thermal runaway, the unexpected killer of your data centre

What is the right way to cost-effectively cool a data centre? Is it about increasing the input temperature? Maybe aisle containment is the answer? What about free air cooling? Should high density racks be water cooled?

All of these are questions that data centre owners should be reviewing on a regular basis when looking at their data centre. Those failing to review their cooling could be playing Russian roulette with their facility. Mark Hirst, Cannon T4 Product Manager of Cannon Technologies, looks at what can happen.

The right level of cooling can have a significant impact on power bills and make a company look as if it is meeting environmental targets. In recent years, data centre owners have embraced the cost savings that moving from an input temperature of 16°C to 23°C have delivered. At the same time, they know that there is an increasing amount of 'green' legislation that means it is about more than just the money.

Historic cooling approaches

Historically, data centres were cooled through the use of Computer Room Air Conditioning (CRAC) units spread around the outside of the room and cold air forced under the floor. With low power usage in the racks, this was sufficient to cool all the equipment. Since the advent of blade systems and the increase in switches and storage, power usage per square foot has soared, along with the heat.

Modern cooling

The introduction of aisle containment, free air cooling, in-row cooling, water cooling, air flow monitoring and better room design have delivered significant improvements in cooling. Some of these, such as aisle containment, can be retrofitted to a data centre for limited cost and with little disruption to operations. This is critical because not only does it extend the life of a data centre, but it makes economic sense.

Many of these technologies, however, are only being deployed in new builds. Free-air cooling, a huge subject in its own right, can be done as part of a complete refurbishment, but some options such as heat wheel or large plenum have to be part of the building fabric. Water has to be carefully designed and implemented to ensure there is no risk of power and water coming into contact.

Another approach that can be used in any data centre is the increase in input temperature. Until the early 2000s, it was not unusual for a large percentage of the computer equipment inside a data centre to be on a three- to five-year lease. At the same time, advances in internal IT system cooling were not high on the agenda of manufacturers. This meant generational replacement of hardware gave some cooling efficiency, but not a huge amount.

In the last decade, however, we have had a number of significant changes. The end of the dot com recession and the current recession has meant systems are being kept much longer. The introduction of blade systems and the massive heat increases they bring have ushered in an era of highly efficient cooling inside the systems.

As a result of all this, increasing the input temperatures into servers and storage systems can produce appreciable savings in power and cooling. The electrical cost of a fan inside a server can be less than the cost of injecting more air when it is just a single server that needs the extra cooling.

With all of this, why the doom and gloom of thermal runaway and data centre meltdown?

Thermal runaway

First, there is no suggestion that any of these technologies are not fit for purpose. Each of them can cool data centres at a lower cost than simple CRAC and forced air. The risk comes due to a combination of technologies being applied either wrongly or with no proper failsafe planning.

The start point here is the input temperature. Depending on the technology used for cooling, it can take an hour or more to remove just a couple of degrees of heat from a data centre. It takes far less time for heat to increase. A complete failure of cooling could see temperatures rise in minutes, even after cooling is resumed, the temperatures may continue to rise if the cooling system does not have enough excess capacity to cope.

As we increase input temperatures, we shrink the gap between acceptable input temperature and the level at which failure becomes more likely. The older the equipment, the lower that failure temperature is. As temperatures rise, the fans work harder inside equipment, pulling in more air to try and cool the equipment and that lowers the available volume of cool air for other systems.

Any cooling failure, therefore, has the potential to cause not just a single system failure but to cause a cascade of failures. This is because as other systems begin to overheat, they respond by drawing in more air, increasing the rate at which cool air is replaced by hotter air. This is known as a positive feedback loop.

Solution

The solution is twofold:

1. Model or test the impact of a complete cooling system failure. Identify the point and speed at which temperature rises.

2. Add failover capacity that can be brought into play immediately a failure occurs to prevent the start of the overheating process.

For many data centre owners, this will mean adding some cost back into the data centre. While this may seem unpalatable, the alternative is likely to cost more both short term in replacement of equipment and long term in loss of trust and business.

Editorial contacts
Debby Freeman
Communicator
(+44) 014 8784 3366
debby.freeman@communicator-marketing.co.uk

Daily newsletter