More than 70% of all data center outages are caused by human error and not by a fault in the infrastructure design. Furthermore, “mistakes” that led to an outage can often be traced to a poor decision by senior management.
“These decisions are seemingly disconnected in time and space from the site of the incident. It could be design compromises, budget cuts, staff reduction and vendor selection to name a few,” said Philip Hu, managing director of North Asia at Uptime Institute.
He added: “More importantly, human error speaks to management decision regarding staffing levels, training, maintenance and overall rigor of the data center operations.
Hu was one the keynote speakers during the recent Data Center Summit 2016 organized by Computerworld Hong Kong.
Uptime Institute is an advisory organization and is recognized globally for the creation and administration of the tier standards & certifications for data center design, construction, and operational sustainability.
Hu pointed out that while tier certification verifies a data center’s full compliance in terms of design, installed infrastructure and ongoing operations, there is no existing standard to help data center executives assess operations.
“There is a lack of appropriate procedures to address the largest risk to data center availability. Data center operators do not have a means to conduct risk analysis at a portfolio level and provide their senior management with the information needed to make calculated decisions on whether to accept the risk identified in the report or take the corrective actions required to mitigate risks,” Hu said.
Staff training is the biggest oversight in data centers
Uptime Institute has identified five management and operations deficiencies in today’s data centers: staffing, maintenance, training, planning, coordination & management, and operating conditions.
The biggest oversights are being committed in training (over 35%) and in operating conditions (over 33%), with data centers exhibiting ineffective behaviors in these areas.
“Many facilities do not have a formal program with lessons plans. Their on-the-job programs are not documented and there is no list of training required by position,” Hu said.
The enterprises’ neglect in providing sufficient training exacerbates the staffing deficiencies in many of these organizations.
Being understaffed and overworked with no plans to add headcount is the least of the enterprise’s problems where data center staffing in concerned.
“Many DC staff have no experience with looking after data center-specific equipment. They are brought onboard without being vetted against a list of required qualifications – because companies do not keep such a list. Roles and responsibilities are not documented,” Hu said.
Slack record-keeping in the data center
Hu also pointed out that senior management is often lax in demanding proper documentation of data center operations and maintenance activities.
“Many enterprises cannot perform failure analysis of their data center because there are no records of outages or near misses,” he said.
The lack of documentation also hampers preventive maintenance. Many data centers have no list of required PM activities and if there is one, the PM activities are not fully scripted.
“There is no quality control in the process,” Hu observed. “Also, the Maintenance Management System (MMS) is missing critical data such as warranty info, maintenance history and performance data, among others. Hence, it is not surprising that the MMS is unable to produce a deferred maintenance report.”
Proper record-keeping is also lost on the operations side, he added.
“Data centers are missing site policies, especially site configuration policies. There is simply no process for keeping reference library documents up-to-date.”
A third-party proof of M&O QoS
Recognizing a gap that needs to be addressed, Uptime Institute has set up its M&O (Management & Operations) Program which conducts a thorough review to provide data center visibility and accountability for DC operations teams, service vendors and leadership.
“M&O is designed to work across groups and departments, cultures and practices,” Hu said. “It is built around the fundamental notion that facilities management issues are addressable – often without a major resource commitment – and represent the single greatest opportunity to reduce risks to data center availability.”
Uptime Institute to date has conducted over 100 M&O reviews and has awarded the M&O stamp of approval to companies such as Visa, UBS, Morgan Stanley, Infomart Data Centers and Fujitsu to name a few.
Massive-scale cloud adoption will drive data center growth
Meanwhile in a keynote presentation at the Data Center Summit, Philbert Shih, managing director of Structure Research, said Hong Kong can expect continuous growth in the capacity of data centers across the city.
“I argue that cloud is going to be a big driver of continuous growth of data center capacity in Hong Kong,” he said.
“What is happening here is that dedicated and virtual hosting infrastructure adoptions are somewhat lower in this market. We are seeing a kind of skipping phenomenon where organizations who have been on premise all this time are now moving to the cloud.”
In terms of infrastructure deployment models, Structure Research are seeing hosting and co-location trending upwards.
“What we are seeing from the providers’ side, they tend to come out with mixed services. They are offering many different deployment models.”
As these trends play themselves out over time, it is not going to be a black or white world, Shih noted.
“The data center landscape will be much more integrated – with direct connections, with workloads residing in different infrastructure environments. I do not think we are close to a tipping point. It will take time to play out. I do not think we will ever get to a place where everything is in the cloud and not on-premise or a private data center.
“What I think will happen is that we will have primary, mission-critical production workloads in your on-premise data center. You will probably scale out to the cloud or scale back. You may want to shut down certain facilities and you will use a co-location facility for that. You will disperse and diversify your infrastructure. And the challenge is to manage that through as few relationships as possible through a single pane of glass.”
According to Shih, companies are looking for as much efficiency and flexibility in these deployment models.
“Ultimately, enterprises want to save money by being more efficient, which will allow you to divert resources from spending in infrastructure to spending for your primary source of revenue.”