Monitoring Ephemeral Hosts/Tests

While I like the TICK architecture in general, Chronograf and Kapacitor deal with ephemeral host measurments poorly. Because of this one is forced to delete trend data to simplify configuration through the GUI. As an example, let me describe one of our main use cases.

We are an AI company. We use Tensorflow and GPU cards to create neural network models. We do this in Kubernetes on AWS, Google and Azure. The Kubernetes clusters are created on a per project-run basis and are then torn down. A project can run for several days or weeks. While it is running, we want to know that the hosts are working and we want to monitor the trend data for performance analysis. After it is running, we want to keep the trend data to compare with new runs: to see the impact of changes like a new version of Tensorflow or different GPU card or to compare performance of the servers from one cloud vendor to another.

The problem is the old trend data pollutes the Chronograf GUI that is centered on monitoring active systems. We don’t want to see old hosts listed in the host view. We don’t want to specifically pick the active hosts (and update the list) in Kapacitor alerts to monitor new runs. There needs to be a mechanism to differentiate active data from archived data that the GUI can use. I have not seen one documented.

One possible solution would be to put an end-cap on a measurement. A last record that signals that this particular stream is no longer live and should not be displayed on the Hosts List, or in the Alert Rule Builder Measurements and Tags. I’m sure there are better solutions, whatever is used, Chronograf needs to support active and inactive systems.

Note that we also need an extension to Telegraf to tag measurements with a project/run value that can be selected to view old trend data. (But that is a different Category.)