5 common data center outage problems

It may be good to learn from our own mistakes but it’s certainly cheaper and less stressful to learn from other people’s blunders, bungles and gaffes. With this in mind here are 5 of the most common outage problems to affect companies in recent years. Use them to influence your own SaaS operations.

1.DNS configuration problem

The Domain Name System (DNS) is similar to a phone book except it puts users in touch with websites by translating the domain name into an IP address. However, sometimes the IP address doesn’t connect the user to the right site. This can be the result of a DNS hack~in which traffic is rerouted to another site or a DNS configuration fault of the service provider.

Site24x7-Blogs_Website-and-Cloud-Infrastructure-Monitoring_20130912-164343

Recently a major business social networking site was hit by a configuration error that led to it and 5,000 other sites going down for hours after the site pointed to a domain for sale landing page. You can monitor it using DNS Monitoring.

2) Storage service fault

Many companies recently experienced degraded performance following critical problems with storage services at a major public cloud provider. The service, which allows users to store large amounts of data via the cloud and run workloads across multiple zones to prevent downtime led to some customers experiencing connectivity issues and others unable to access online. While the particular problem associated with this incident wasn’t well publicized, the case perhaps highlights the problem with SaaS~in which the provider is the keeper of the entire application stack.

3) Http load balancer blip

A~popular software analytics company in the US recently went down due to an unexpected network “blip”. Caused by an http load-balancer – a method that involves spreading service~loads across multiple resources in order to optimize performance and resource use – the site brought critical services down for 8 minutes. Using a cloud-based service with advanced node health monitoring and failover protection can prevent downtime by picking up on potential problems in advance.~ BTW, you could use the Site24x7 On-Premise poller to monitor your internal network and~ all instances of your application server .

4) Errors during data migration

Migrating to NoSQL databases~ like MongoDB or Cassandra from RDBMS can create issues if the migration isn’t well planned. Planning for extended downtime is essential as depending on the size of your database the migration may take several hours or days. A detailed migration plan~listing all the resources you need once you go offline is essential. You should also ensure the work can be done in batches and that migration can resume from the last successful batch in case any part of the migration fails. You don’t want to spend five hours migrating data only to find the system crashes.

5) DRP testing error

Making sure your disaster recovery plan works is critical for any business dependent on online services. But sometimes testing can lead to disaster in itself. A small business, personal finance and tax software company recently had their knuckles rapped by customers when an error led to a synchronization gap while exercising the company’s disaster recovery capabilities. This mistake led to some customers being unable to access their data or losing transactions they’d input.

 

 

 

 


Comments (0)