Case: Cloud Monitoring
Cloud Monitoring Case Overview
When operating in a cloud environment such as Amazon EC2, there is no longer just one or two servers to set up. It is easy to deploy automatically scaling groups of instances, some of which may be short lived. This changes the requirements on your monitoring tools and Observu has taken these requirements into considerations from the ground up.
The Observu cloud monitoring agent submits data from the server to the Observu monitoring platform over HTTPS. This means that you do not have to open a port on your server or firewall to remotely view the collected server statistics. Neither do you have to provide your server login credentials to our service.
Multiple servers can share the same API key, this means the Observu Cloud Monitoring Agent can easily be deployed automatically in situations where new machines are booted up automatically. (e.g. Auto Scaling on Amazon EC2) A Chef recipe is used in this case to facilatate automated installation and configuration of the monitoring agent. (this recipe is available as open-source) Read more about this in the cluster monitoring documentation.
As servers may only be booted up for a few hours to perform a specific task or handle extra load, Observu also offers to automatically archive the monitor again after data stops coming in after a (configurable) amount of time. This prevents your cloud monitoring dashboard from getting cluttered by the 20 monitors that were booted up during a traffic peak and that are no longer relevant.
Monitoring A High-traffic Website In The Cloud
Besides the normal monitoring of a group of servers, in this case special attention was paid to monitoring MySQL. As a fully redundant installation was set-up using MySQL/Galera it can tolerate failure of one MySQL server without any disruption of database service. By properly detecting this state and providing notifications, Observu allows for an administrator to handle the case before anyone ever noticed anything was wrong.
Another particulary useful feature in this case is the aggregation of log files. As multiple instances were involved to server traffic, Observu allowed to check logs on all servers with a few clicks, instead of having to login into each one individually. In this case Observu monitors the system log, the Apache/PHP error log as well as the applications specfic logs that contain for example slow pages and queries. By selecting a time range, logs can be browsed at exactly the times that are relevant and by developers that would normally not have access to the servers.
By monitoring server load metrics on web, Memcache, Redis and MySQL database servers, it can easily be identified which parts of the application are under stress and need further optimization.
As a serious operation involves multiple environments such as development, testing, staging and production, there are also different requirements and response policies. As Observu lets you set up notification rules at global, group and monitor level it allows you to fine tune who gets notified and when, without the need to maintain rules on each and every server and monitor. In this case all monitors share a base level notification configuration based on E-mail and SMS. The group of production website has additional stricter rules that do not allow for any down-time and also notify using a phone call to multiple persons.
Examples Of Real Results
Using Observu we discovered an issue with our Elastic Load Balancer configuration. It would timeout in 0.5% of the requests, even though the instances themselves responded fine and without load. This would have gone unnoticed without the high-frequency availability monitoring provided by Observu. This shows that even in case you've set up everything redundantly there may still be unexpected failure paths that can only be discovered by obsessively monitoring your real performance metrics.
Furthermore, by overlaying response time statistics with server load graphs it the impact of traffic surges and the efectiveness of auto-scaling can be shown. This allows to take well-informed decisions about the scaling policies. For example the first peak in traffic showed a too-conservative scaling approach. Scaling up agressively and letting instances back-off after an hour proved to be much more effective in getting response times back to acceptable levels as quickly as possible.