IT Best Practices - Troubleshooting infrastructure constraints

Slow Web applications, timeouts and errors are due, in some cases, to problems on your back-end infrastructure (e.g. an overloaded Web server, a strained DB or a connectivity issue on a mail farm).

With hundreds of performance metrics that IT can track and oversee, there is no surprise that industry studies show that 80% to 90% of downtime is spent locating root cause. To aid your troubleshooting efforts we have compiled a list of common datacenter issues and constraints and key performance metrics that IT should monitor and analyze on a regular basis. ~If you already use Site24x7, make sure that you have downloaded the Site24x7 agent on each server that is part of your Web application delivery chain (Web servers, application servers, Database, load balancers, mail servers) to get alerted at the first sign of infrastructure trouble. If you use other monitoring tools you can still use the examples below to tune your monitoring strategy.

  • Overloaded servers: Baseline CPU usage, memory and disk I/O metrics during normal and peak operations to identify problematic trends that indicate capacity overload. For example, if CPU usage is high, memory is low and disk I/O is deviating from your baseline operations you have a capacity constraint that needs to be addressed.
  • Disk full: You will be surprised at the number of times that an application malfunctions because a key server, such as a Web server or load balancer, is running out of disk space. With Site24x7 you can be proactively alerted as soon as disk space has reached a certain percentage of utilization, so you can proactively react much earlier, before is too late and you have a problem. A good rule of thumb is to setup a warning threshold of 80% for disk utilization.

diskutilization

  • Mail farm problems:~ Site24x7 also gives you visibility into the way your Exchange mail servers are working in real-time. For example, if “SMTP Out” metrics decrease for your Hub transport you have a connectivity problem with no email going out, an increase of “delayed call count” metric for your Unified Messaging system indicates bottlenecks that should be addressed, and a higher value of “RPC response time to Client Access” is a problematic trend that will impact user experience. You can track them all with Site24x7.
  • Failing services: When key Windows/Linux server services (e.g. HTTP(S), FTP(S), DNS, PING, TCP, SSL, SMTP, POP, etc.) are down or not properly responding, Websites may not even be able to load at all. Fix the problem with Site24x7. You can start / stop services from a single Site24x7 console as needed, and view all the information that you need in an integrated dashboard.
  • Strained Databases(DB): With a Site24x7 agent you can track key performance indicators for your databases (such as CPU, Disk, Memory, Process, Services and Network Utilization). However, since Website performance problems are traced back to databases in over 50% of the cases, you might want to get deeper visibility into your Database operations with Site24x7 APM Insight. It is an IT/DevOps handy tool to better visualize web transactions end to end, with performance metrics of all components starting from URLs to SQL queries. Get detailed performance metrics to identify slow database calls, database usage and overall performance of the database furnished with detailed graphical representations to aid your troubleshooting. For example if INSERT, UPDATE and DELETE statements take longer to run you might have incorrect, too few or too many indexes on your DB, a problem that a DBA needs to correct.

In addition, don’t forget to monitor your DNS services in real-time. They are a frequent bottleneck point, and Site24x7 can also help. Just sign-up for a free Site24x7 trial.

Are you looking for additional IT troubleshooting tips? Check out our blogs - Understanding Web Performance Waterfall charts ,~Troubleshooting problems with SQL queries or Troubleshooting issues in the cloud.

Good luck with your troubleshooting efforts!

 

Comments (0)