How we handle real time alerting on large server log datasets
Our server monitoring service, Server Density, is now processing over 10GB of incoming statistical data every day. Server Density is used to help troubleshoot problems by providing detailed historical data but also to get notifications when things start to go wrong. As such, the data we collect can be said to be for two primary uses::
- Graphing the data over specific date ranges
- Triggering custom user alerts based on the current, “real time” values
These two use cases are actually quite different in terms of how they need to be implemented. The first can exist as a large data store with a focus on being able to quickly retrieve a relatively large result set over a specific time range. The second needs to return the very latest values as quickly as possible. Although this might sound like a difficult problem, it is actually quite simple when you consider that all you need to do is split them out.
Latest value cache
Our monitoring agent reports back server statistics every 60 seconds. It’s not “true real time” but is sufficient for monitoring purposes. We recieve this data and it is then stored in two locations.
The first is inserting the latest values into what can essentially be called the “latest value cache”. This stores the most up to date values for each server. MongoDB is memory mapped so it is likely that this collection will be in memory because it is frequently used. As it only ever contains one document per server it means lookups are extremely fast – just a few milliseconds. It is this collection that we run our alert threshold checks on, grab the latest values for the dashboard and the API (and as a result, the stats view in the iPhone app). Getting the latest data for a specific server is very, very fast.
Larger, historical collections
In addition to storing the latest value in the cache, the data is also stored in the historical store. However, we currently only store this data in 5 minute intervals. We originally stored every dataset posted back to us – every 60 seconds – but found that this was generally unnecessary. It presented problems with the amount of data we stored and plotting data every 60 seconds on graphs resulted in huge data sets. We also found that looking at historical trends does not require such granular sample rates. The alerting is where we need the latest values and that is already handled.
For a single metric (e.g. load average):
- 1 hour = 12 documents
- 24 hours = 288 documents
- 30 days = 8,640
- 365 days = 105,120 documents
Many collections run into the millions of rows, especially where users are monitoring multiple servers. The above document counts don’t include the extensive process list information we store, along with data for each network interface and disk volume, all of which are stored as separate documents. This data is stored with indexes on the date range and server ID and is actually still surprisingly quick. A query for the last 24 hours of data (the default graph time range) will typically take around 50ms to return because the data is still in memory. This increases to around 150 – 250ms when the data is not in memory.
These are the collections that are being queried when the graphs are generated through the web UI and our API. They are much less frequently accessed for reads when compared to the latest value cache.




Trackbacks