Taming MongoDB with server monitoring

This post was originally posted on the Server Density blog here

People like to write bad things about MongoDB. On a few occasions I’ve been up to my elbows in troubleshooting and have been tempted to do so myself, but in general that wouldn’t be fair – my overall experience with MongoDB has been good.

Now there are lots of excellent guides that exist to show you how to troubleshoot most issues you might experience with MongoDB, so let’s not go round in circles. Instead I want to spend some time outlining how we used our monitoring tools to identify and fix some tricky issues with our database(s); and some practical advice on diagnosing problems with your own infrastructure.

Recently a customer of Contrail Innovations experienced a sudden growth in use and users of their service and their MongoDB cluster was straining to cope with the new load. They were running a single shard as a 3 node replica set. The primary node did all reads and writes, the secondary existed solely for failover, and the third was for snapshot backups to S3 and was never able to become primary.

Mongo Socket Exceptions

The first problem that came to our attention was that the MongoDB primary started to toggle between the two nodes. One node was set to be preferred primary but it was losing connectivity with the other nodes in the replication set. It would then step down, which triggered an election and caused MongoDB to go offline for a short while until connections were restored and another primary was elected.

Apart from a SocketException there was nothing obvious in the MongoDB logs. I had recently implemented the ELK stack so viewing the logs via Logstash and Kibana made it very easy to get an overview of what was going on. We could see that app and processing nodes were losing connections to the primary MongoDB.

Checking the MongoDB replication logs gave a lot more information and you can do this from the MongoDB shell

production1:PRIMARY> show log rs.

which highlighted that some entries started like this:

‘couldn't connect to server’

In this log I could see the rsHealthPoll saying that the other nodes in the set were down. Sometimes it would just report that one node was down but as soon as it lost contact with both of them, the primary stepped down so that MongoDB did not enter into a split brain state as a result of the apparent network partition. To get a handle on what was happening I went through all the graphs and stats gathered through Server Density and their MongoDB monitoring, which highlighted lots of areas that needed our attention.

Firstly, the primary node had a very high CPU and CPU steal – on EC2 this means that you’re almost certainly suffering from a ‘noisy neighbour’ problem, in which another instance running on the same hardware is stealing resources from your instance. Simply stopping your instance and starting it again should force it onto new hardware – on which you can hope for quieter neighbours. By doing so CPU improved, but we were still seeing connection problems.

By looking in the Linux kernel we noticed the ulimit settings were not set to the values corresponding to the current docs for the MongoDB version (you should always check these every time you upgrade MongoDB). By changing them to the recommended settings and doing a rolling cluster restart, we resolved the connection problems.

We also changed the TCP keepalive as recommended by my friends from Bottlenose who work in our Hackers & Founders co-working space in here in Amsterdam. You can find more info on these settings here.

High IOWait and low CPU

Server Density Dashboard - Primary MongoDB

Server Density Dashboard - Primary MongoDB

In the days following these changes, our customer continued to grow and as did the amount of data in their MongoDB instances. Server Density alerted us to the fact that the primary MongoDB was experiencing high IOWaits, so it was time to go deeper into the world of MongoDB stats. For this I used the excellent MMS monitoring service to dive into the specifics.

The first thing we looked at was page faults. These were occurring when MongoDB needed to read from disk. The sudden jump in users and data meant we were starting to see a lot of them. Checking that against the opcounters gave a very small ratio, which only told us that MongoDB is busy with disk IO (which we already knew from the IOWaits). I then checked the working set – there was plenty of space for more pages so it couldn’t have been a shortage of RAM.

db.runCommand({serverStatus: 1, workingSet: 1})

Nothing from the graphs stuck out, except for the high number of page faults. However, running db.currentOp() several times indicated that there were a couple of queries that were running a lot. Time to start profiling MongoDB.

production1:PRIMARY> db.setProfilingLevel(1, 700)

Setting the profiling to only log queries that were taking more than 700ms, I immediately saw a query that was taking three minutes to complete. On checking the Scala code for the indexes in MongoDB I could see that one of the fields in the query was not indexed and MongoDB was doing a full collection scan and reading all the data from disk. I kept profiling and found three others – wow! After adding the missing indexes the page faults & IOWaits went away –problem solved.

Look Ma, no index!

To cut a long story short. If you are experiencing high IO waits, low CPU and high page faults in MongoDB your best bet is to profile your database and see if you are missing any indexes. As your code base grows it’s very easy to miss an index on a query, and if the results of the query are not user facing then they’re really hard to track down unless they’re already causing you problems.

There’s no need to be afraid of MongoDB, if it suits your use case it’s very reliable. However to scale out you need to use monitoring tools like Server Density and MMS to keep on top of any problems that may be going on behind the scenes.

Posted on January 14, 2015 and filed under devops.

HAL - the devops EC2 bot

​Last year I built a prize winning EC2 Deployment Bot at a Hack day in NYC. Hal now does all of our EC2 deployments for us and has been vital in our daily continuous deployment.  Here are a couple of decks on Hal. It seems that since I gave the presentations some of the embedded Youtube videos have been removed.

Posted on April 23, 2013 .

Ubuntu RAID Pain Relief

server_rack1.jpg

Recently while working on a client project we upgraded our Ubuntu version of our base Amazon EC2 AMI to 12.04.01 LTS (Precise Pangolin). However some time after we hit a problem when replacing a failed ElasticSearch node using this image. When the instance rebooted the md-device changed from /md0 to /md127 and our raiddrive mount did not come back up as fstab expected it to be on the md0. Why this changed is a subject of much debate if you search the interwebs and its a problem that that has bitten many.

At the time we really wanted to get the node going fast so a work around was to just mounted it on device md127 instead. This is a bad smell that would probably come back and bite us later if left too long.

We soon had to replace all of our entire ElasticSearch clusters to upgrade from 0.19.9 to 0.20.6 so time to solve the problem. Firstly a upgrade of Ubuntu to 12.04.02 and then some more config to fix this mounting problem.

This is the bash script to mount 4 x ESBs volumes to a RAID 0 array.

#!/bin/bash
set -e -x

sudo apt-get install -y mdadm xfsprogs

# Give the device a name - this is important.
sudo mdadm —create /dev/md0 —name esdata —level 0 —chunk=256 —metadata=1.1 —raid-devices=4 /dev/xvdh /dev/xvdi /dev/xvdj /dev/xvdk

echo DEVICE /dev/xvdh /dev/xvdi /dev/xvdj /dev/xvdk | sudo tee /etc/mdadm/mdadm.conf

sudo mdadm —detail —scan | sudo tee -a /etc/mdadm/mdadm.conf

# Update the initramfs - very important after mdadm change
sudo update-initramfs -u

sudo mkfs.xfs /dev/md0

# Don’t hang the EC2 instance start if its not ready
echo “/dev/md0 /raiddrive xfs nodiratime,nobootwait 0 0” | sudo tee -a /etc/fstab

sudo mkdir -p /raiddrive
sudo mount /raiddrive

Volia your RAID will now survive a restart and stay on the same device number.

Posted on April 22, 2013 .