MongoDB on VMWare

We’ve been running MongoDB on the VMWare vSphere virtualisation platform in production for 18 months across MongoDB v1.6, 1.8 and recently 2.0. Originally deployed on CentOS, we’ve almost finishing migrating all our VMs to Ubuntu 10.04 LTS. We run on ESXi v4.0.0 on the Terremark Enterprise Cloud.
This is different from the VMWare CloudFoundry support of MongoDB which is a Platform as a Service product. VMWare vSphere is a full data centre tool in the same way that Amazon EC2 (and others) run on top of Xen for virtualisation.
There is nothing specific that you need to do get MongoDB running on VMWare but there are a few things we’ve learnt over time that are worth understanding:
- Overcommitting resources is very bad, particularly memory. This is because VMWare will swap them around guest VMs on the host and your guest VMs will suddenly not have memory available.
- CPU performance can be difficult to predict if the host is shared by other VMs you have no control over. There’s no concept of guaranteed resources so you may find sudden load spikes where other users are requesting resources which have to be re-allocated from your VMs. The symptom of this is increased load average because processes are having to wait longer to access the CPU. On the host level, this is represented by CPU ready time.
- VMWare manages memory using ballooning. Make sure you read the VMWare guide on this.
We have also implemented some tweaks recommended by Terremark support and VMWare for VMs generally:
- Time sync managed by VMWare is disabled because the guest OS can do it better. We use ntp locally to handle this because it allows sync forwards and backwards whereas the VMWare sync only goes forwards. This is done by executing
/usr/bin/vmware-toolbox-cmd timesync disable - Ensure the VMWare Tools are always up to date. We use the Open Virtual Machine Tools packages to allow them to stay up to date via the OS package management rather than the way VMWare recommends installing them via a manual script accessible through a mounted CD.
Handling Memcached failover
Unlike a database with built in failover (master/slave model), you can usually only connect a client to a single Memcached server. If you specify multiple servers then these are used as part of the hashing to determine where the data gets stored, but there’s no concept of replication. This means if one Memcached node goes down, you lose the keys on that node. If you’re only connecting to a single node then you lose all Memcached.
The commercial product, Membase, handles this by providing replicated Memcached and failover functionality so if one node goes down, you can still access the other node(s) without any impact to the application. However, the clients are not built to handle this (it’s not standard Memcached functionality) so you still have to connect to a single node. You can’t use a pool of servers because Membase handles the distribution of data.
Instead, you can use the Moxi Memcached proxy. This allows your application servers to connect to what looks like a single Memcached host but Moxi handles sending the queries to the correct Membase (or Memcached) node. It also communicates with Membase to determine the health of a node for failover purposes.
We have recently deployed Moxi to elimiate Memcached as a single point of failure. Our web nodes now connect to one of several local Moxi instances (one for each Memcached bucket) which proxy the connections out to the cluster. If one of the Memcached cluster nodes fails, our application never needs to know because Moxi will silently handle the failover.
Alternatively, with Couchbase 1.8 (which is what Membase has been renamed to), you can use their client libraries to connect directly to your Couchbase instances with the failover support built into the libraries.
Building our London office – part 2
At the end of last year, I introduced the construction of our new office in London. Over a month later and significant progress has been made. The first few weeks always seem like not much is being done but that is when the core foundations are built – the structure, flooring, electrics, plumbing and everything else that gets hidden away. With that now almost completed, we’re almost ready for perhaps the most exciting bit – kitting everything out!
I visited the office last week – the first and second floors are mostly completed with flooring just being finished and the whole interior plastered. The electrics were completed towards the end of last week and this week we’ll see the internal dividers finished, decorating completed with the walls and ceiling painted, the spiral staircase fitted, internet connected, kitchen plumbed in and the initial arrival of furniture.
We’ve ordered some extremely nice custom desks, Herman Miller chairs, a great meeting table and more Herman Miller chairs. Plus some massive beanbags already sitting at my home, ready to be transported. The next, and final post in this series will go into more detail about the choices behind these. We’ve spent a lot of time deciding upon the right work environment. If you’re spending a significant portion of your time in one place, it has to be comfortable and enjoyable!
We’re expecting to move in properly in 2 weeks!

Looking in. Floor being put down, storage cupboard on the left and spiral staircase divider to the right.
Sysadmin Sunday 63
This is Sysadmin Sunday, a post of interesting links from throughout the previous week.
- FreeBSD now on all EC2 instance types
- Building the next generation file system for Windows: ReFS
- JSON as a core type in Postgres
- Java packages in Partner archive to be removed on 2012-02-16
- World IPv6 Launch on June 6, 2012
- The New MongoDB Aggregation Framework
- Linux 3.1 end of life
- bash autocomplete for SSH
Subscribe to our RSS feed and follow us on Twitter for interesting links throughout the week.
Sysadmin Sunday 62
This is Sysadmin Sunday, a post of interesting links from throughout the previous week.
Subscribe to our RSS feed and follow us on Twitter for interesting links throughout the week.
Improved graphing and API 1.4
Our longest running engineering project has just come to a close with the release of our new graphing backend. This provides much greater granularity, longer data retention and improved performance. The headline is: data plotted every 1 minute and stored forever! This is available on all paid/trial accounts now, for no extra cost.
To be more specific:
Graphs Before: Data was plotted every 5 minutes for 72 hours then summarised into the daily mean average.
Graphs Now: Data is plotted every minute for the last hour, every 5 minutes for the last 72 hours, every hour for the last 6 months and daily after that. The 5 minute, hourly and daily values are mean averages for that period e.g. the 5 minute value is the average for the last 5 minutes, the hourly value is the average for the last hour.
Snapshots Before: Snapshots were only available for 72 hours.
Snapshots Now: Snapshots are now available forever. This applies to clickable points on the graphs and alert snapshots.
Services Graphs Before: Data was only available for the last 24 hours.
Services Graphs Now: Data is available on the same granularity and retention as all other graphs.
You will also find that the performance is much better – graphs will load significantly faster. We’re now storing all data in memory for the last 24 hours so you should find the default view (last hour) extremely fast to load. Older data has to be read from disk but will still be faster than before.
API 1.4
The new metrics are also available via the API in an updated metrics/getRange method. We have improved how this method is implemented to be more useful and simple – you now call it with a metric name and it’ll return all the fields for that metric, removing the need for metric groups and names.
We also took the opportunity to fix servers/getByName and users/delete so they are POST instead of GET methods. Otherwise, the 1.4 API is a drop in replacement for the existing 1.3 API, which has been deprecated and will be disabled on 6th Feb 2012.
Sysadmin Sunday 61
This is Sysadmin Sunday – Christmas/New Year Edition. After a long, 2 week break, there’s quite a lot to be reading!
- How to build a 40TB file server
- HipHop for PHP at Hyves
- Ops meta metrics
- Graylog2 released replaces the MongoDB message store with Elastic Search.
- Merry Christmas from the FreeBSD Security Team
- Global Netflix Platform
- PlentyOfFish Update – 6 Billion Pageviews And 32 Billion Images A Month
- How Twitter Stores 250 Million Tweets A Day Using MySQL
- The lost Van Jacobson paper that could save the Internet
- MongoDB’s Write Lock – benchmarks of MongoDB 1.8 vs 2.0 with pagefaulting and non-pagefaulting reads/writes
- New Year’s Resolution: Full Disk Encryption on Every Computer You Own
- node.js DoS
- Scaling High-Availability Infrastructure in the Cloud
- The Future of CouchDB
- What’s new in Linux 3.2
- Happy Birthday High Replication Datastore: 1 year, 100,000 apps, 0% downtime, 3bn requests per day
- Clint, Command Line Library for Python
- Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities
Subscribe to our RSS feed and follow us on Twitter for interesting links throughout the week.
CPU load becomes more relevant with virtualisation
On Linux, the decimal CPU load you can see with tools like top can generally be thought of as a queue indicator. If you have a single CPU[1] then if CPU load is below 1, nothing is having to wait for CPU time. Above 1 indicates that processes are having to wait for CPU time. This manifests itself as a slower overall response time or even requests backing up and timing out as the queue gets longer.
When we were using physical hardware, CPU load was generally ignored. This was probably a combination of fairly low load on the servers in those days anyway, but also the high spec, multi-socket, multi-core CPUs we had installed. On physical hardware those CPUs were 100% dedicated to serving our requests.
The same principle still applies with virtualisation, however there is another factor that comes into play – the host workload. As a VM, you rely on the host virtualisation layer to share out the physical CPU resources amongst all the guests, including yourself. If you don’t have control over the host such as with a VPS or on a cloud provider like Amazon EC2, this means you may be affected by their usage in unpredictable ways.
This is nothing new and one of the caveats of public virtualised environments but it means CPU load becomes relevant again. You might see low % utilisation but high CPU load because your requests for CPU time are being queued up.
Linux also has a metric called CPU steal – the st section in the top output. This indicates how much time is spent by the hypervisor servicing requests other than to your VM. It’s generally associated with the Xen hypervisor (which Amazon EC2 uses) but is also valid on VMWare platforms (where the equivalent metric in VMWare terms is CPU ready time). You can therefore see your usage (e.g. User %) as low but a high CPU steal %, which results in a high CPU load value.

The combination of these 2 metrics allows you to see if VM performance problems are related to your host, which may require you to upgrade your instance type or get dedicated hardware.
[1] As a very simple explanation, the ratio changes based on the number of CPUs. See this wikipedia article for more details.
Sysadmin Sunday 60
This is Sysadmin Sunday, a post of interesting links from throughout the previous week.
- A console built into Chrome?
- The false promises of dedicated IPs
- GNU/Linux distro timeline
- Tips for Remote Unix Work (SSH, screen, and VNC)
Subscribe to our RSS feed and follow us on Twitter for interesting links throughout the week.
Sysadmin Sunday 59
This is Sysadmin Sunday, a post of interesting links from throughout the previous week.
- Real-Time Log Collection with Fluentd and MongoDB
- Network Link Conditioner in Xcode 4.1, Lion
- Fear and Loathing in Debian/Ubuntu (or: who needs /etc/motd)
Subscribe to our RSS feed and follow us on Twitter for interesting links throughout the week.
















