Authorize.net explains the Independence Weekend fiasco

vickeryhill3rd Party Integrations, eCommerce, Web Development, Web Marketing

Detailed below is Authorize.net’s explanation of outage from 3am eastern July 3,2009 to nearly 3pm eastern. Hundreds of thousands of websites and millions of transactions were affected, including some of VickeryHill’s customers. Our use of AlertSite to monitor transactions on a customer’s site, www.verilux.com, made us one of the first to learn of the problem and, dare we say, instrumental in bringing customer service from Authorize.net to Twitter (as their website, telephones and email were down).

Unanticipated Downtime Summary: Authorize.net

The security and reliability of the Authorize.Net Payment Gateway are of the utmost importance to everyone at Authorize.Net and our parent company, CyberSource Corporation. Each year we invest significant financial resources and energy to assure that we can accommodate our growth-both in terms of the number of merchants we serve and the number of transactions we process. Our mission is simply to provide the industry’s most secure and reliable payment solutions.

As part of our efforts to assure that we are meeting our goals, we continuously look at our infrastructure and processes to confirm that they meet the needs of our merchant base. Unfortunately, late in the evening of July 2, 2009, we encountered a service disruption that adversely impacted our merchant base. The purpose of this communication is to explain what happened and perhaps more importantly, the steps we are taking to ensure that it will not happen again.

Fisher Plaza Fire
Authorize.Net’s primary servers are housed in the Seattle-based Fisher Plaza, a state-of-the-art “class A” data center with full redundancy and diversity in all systems, including the building structure, power, cooling and connectivity. However, even with state-of-the-art technologies, things can go wrong, and on July 2, 2009, at approximately 11:10 PM Pacific time, Fisher Plaza suffered a fire that knocked out its primary power supply.

Upon losing power, Fisher Plaza immediately moved to backup generators, but was forced to shut the generators down by the Seattle Fire Department due to their proximity to the fire location. With this loss of power and backup power, at approximately 11:40 PM, Authorize.Net (along with many other Web sites) was forced offline.

Backup Data Center
As already mentioned, Authorize.Net continuously reviews its infrastructure to assure that it is adequate to meet the demands of the business. As part of the 2009 planning process, it was determined that we were approaching the maximum capabilities of our current backup data center, and the decision was made to open a new, state-of-the-art backup data center in San Jose, CA. This new backup facility would not only act as a backup, but as a true “hot” site (in other words, real-time synchronization), so that the Authorize.Net platform could be switched from one data center to the other “on the fly.” At the time the fire occurred at Fisher Plaza we were just completing the final stages of transitioning to the new San Jose backup data center.

Perfect Storm
It’s fair to characterize the events beginning with the fire at Fisher Plaza as a perfect storm. Not only did the fire occur late at night, but the long July 4th holiday weekend had begun and the majority of our engineers, operations team and Customer Support representatives were on holiday. While many of these people were called back to work shortly after the fire started, it took longer than we would have liked to gather the full team. The fire knocked out our primary data center, which also provided our primary communication infrastructure, including e-mail and phone lines. We quickly organized our teams to fail over to the new San Jose data center. But while near completion, the backup data center was still undergoing final testing and configuration, and when we attempted to fail over, a number of unanticipated errors occurred. These errors, combined with our inability to access the Seattle data center (the fire department had deemed the building unsafe and no one was allowed in), caused the outage to last much longer than we would have expected.

Most transaction services resumed at approximately 11:00 AM on July 3rd and over the following days, other services were also restored. These other services included issues with Automated Recurring Billing, Batch Uploads, the Verified Merchant Seal and others.

Important: At no time whatsoever was any data that is handled or stored by Authorize.Net compromised.

Communication
Unfortunately, the fire also crippled our ability to communicate to our merchants and partners. While we would normally post messages on the Merchant Interface, Reseller Interface, public Web site, Customer Support line, etc., none of these tools were available due to the fire.

At approximately 7:30 AM on July 3rd, we began posting messages to our Twitter account and we continued to provide updates via this channel. As service was restored, messages were also posted on the Merchant Interface, Reseller Interface and phone lines. In addition, our Customer Support department opened on Sunday, July 5th to assist merchants and partners.

Current Situation
Authorize.Net resumed processing transactions Friday morning and continues to operate normally from our backup data center. Fisher Plaza is being powered by external backup generators and is now operational as our backup center to San Jose. While it may take several weeks to restore normal power to Fisher Plaza, the external generators allow this facility to be fully operational. However, until primary power is resumed we will continue to operate out of the San Jose data center. We do not anticipate any further interruptions in service.

Lessons
Even as our engineering and operations teams continue to ensure normal operations, the postmortem process is already under way. We are examining all aspects of this outage and implementing steps to mitigate future risks. Over the next weeks, we will be completing the work to ensure that we have two fully functional, synchronized hot sites. Failing over from one to the other will occur in a matter of seconds. Steps are also being taken to ensure that we have the ability to implement emergency communication by distributing our voice, e-mail and Web capabilities across multiple sites.

Over the next days and weeks the postmortem will continue. Processes will be refined and further protections put into place.

Conclusion
We would like to reiterate that the stability, security and reliability of the Authorize.Net Payment Gateway are of the utmost importance to everyone at Authorize.Net and CyberSource. We are deeply sorry for the inconvenience the downtime may have caused and want to assure you that we are taking the necessary steps to avoid any similar situations in the future.

Thank you for your understanding and patience and thank you for being an Authorize.Net merchant. We will work hard to live up to the confidence you have placed in us.