Intranet Computing: A metrics monitoring system
• Mark Eschbach
I will admit my home servers are not well instrumented. None of the services report statistics. I am completely blind within that realm. So this is my search for a way to record these data points and analyze them.
Going through the pipeline there are several requirements. Many of the application I have deployed at home are written in either NodeJS or Ruby. For home applications I am concerned about productivity of production, not ideological purity of implementation; if I was they would probably all be implemented in Erlang. The applications should be able to offload arbitrary metrics into a data store. The monitoring system should also be able to pickup metrics form the host systems and aggregate them.
I do not have strong feelings on ways to visualize or alert on the data, however this should be fairly simple. For the initial implementation I would like to use Grafana since it is the suite du jour in the industry. Grafana’s architecture is built to consume multiple data sources, even for a single graph. Unfortunately this does not make the choice set smaller for the storage aspects.
There has been a lot of buzz surrounding InfluxDB and was implemented as a default metrics store for application metrics at work. InfluxDB has worked well however I do not have experience in an operational context. The deployment scenario seems fairly simple with a Docker or a simple package to install on the host. For now I will consider Docker for simplicity.
Backup and restore scenarios are fairly simple. According to the InfluxDB official documentation one just runs a command for either scenario. Looks like clustering is only available for commercial support which is a bit disappointing however not a deal breaker. High availability creates complex use cases anyway.
After a bit of searching I was unable to locate any information regarding limits of the number of rows or of data store
sizes for InfluxDB. I tend to be a bit neurotic about retaining details and logs for a while. The retention policies
appear to be configurable at two levels: shard groups and databases. A database is the recommended level
for each retention policy. The documentation for
does not provide a great description on how one would disable automated destruction of data.
CREATE RETENTION POLICY
provides a possible value of
INF which is promising.
node-influx has a fairly simple API. Connecting to the database is straight forward, however according to the
tutorial one should provide the schema you would write with. The
actual API doc examples
appear to hold a different opinion. The other options are fairly straight forward for user name, shared secret, and
Connections are handled out of band from the constructor. Underneath the hood connections are pooled and the client
will intelligently handle errors with back-offs. Written data points will promise a result using the
method which is pretty straight forward. Querying unfortunately does not provide an example of the resulting data
structures but is irrelevant as a data sink.
Ruby’s client may operate in two modes: asynchronous and synchronous. In sync mode the data points are blocking calls. In async they will enqueue the outgoing values to be written and optionally block if the queue is full. Connections are fairly straight forward just requiring network coordinates. Operational mode defaults to synchronous however the asynchronous options may be passed. By default values are written in seconds and it appears like there is a dance required to get sub-second times.
So it looks like overall using InfluxDB for storage and aggregation of values and Grafana as an analysis layer. I have some reservations regarding the default retention policy. It’s also reasonable to remove the data at some point I suppose. Initial implementations look reasonable to proceed with though.
From an architectural perspective I would love to build a service which aggregates the data to various sinks. This would resolve concerns surrounding data retention and remove the need to push out details to various applications. This could easily be setup as a WebSocket service or straight HTTP(s).