Need Advice Influxdb - Telegraf Design

Hi, I’m thinking to monitor more than 500+ servers with telegraf some of them are linux some are windows. Also I have telegraf container instances for plugins like ping, http get requests and sql collector. I’ll then display the data on grafana. Here are my questions

  • Should I create separate databases for linux and windows servers
  • Should I create separate databases for sql, ping and http collectors

I think I should separate 'em because when I query hosts from grafana it’ll return all hosts and some metrics will be empty for some hosts.

Also, any good documents for starters is appreciated.

Hi @Mert, thanks for posting your questions!

Overall I think your goal here is pretty typical of the workflow for a number of users. The beauty of the telegraf binary is that each of the collection plugins output the same line protocol when sending to influxdb regardless of the OS for the given host. You can also specify custom tags in your modules that you could use as template variables in grafana/chronograf. In this case if you know you’re deploying a telegraf agent on a windows + linux host you might have a custom tag called os=<windows|linux> as necessary and use that as part of your queries.

It usually comes down to personal preference on how to store this into influxdb itself, whether you split into different databases based on your particular needs or keep them in one contiguous unit.

While there isn’t any prescription on how best to store this data, some things to keep in mind include:

  1. Retention Policy – think about how long you want to keep data around for in your influxdb instance
  2. Cardinality – a rough estimate of the uniqueness of your data, you’ll want to make sure you don’t define too many measurement + tag + field combinations that would drive your cardinality too high as this would impact query performance
  3. Enabling windows-specific collection agents – there are two plugins that are only capable of collecting on Windows machines, the win_perf_counters and win_services plugins. Following the guide there in the README might give better insight into how to configure your nodes
  4. Use grafana templates – the folks at grafana have a public repo of dashboard templates that might help you get started

Hope that helps to get you going, definitely post back any questions as you get along and let us know what you build!

2 Likes

Hi @sebito91,

Yes, it should be pretty typical. I’m currently testing on a test server by using different databases. I think keeping all in one db should be better, please correct me if there is any disadvantages that comes to your mind.

I believe, I should configure the .conf file like this:

[inputs.system]
   [inputs.system.tags]
     os = "linux"

I’ve read about RPs and Downsampling before. Yet, on my test build I haven’t configured any. Again, thanks for the answer and the links you share. I’ll definitely write back in next days.

1 Like

That certainly looks good to me! You can define tags within each plugin as you’ve done in your example, or in the global tags set at the top of the config file. Either way I think you can find some benefit from that in your queries if you truly want to distinguish between OS hosts.

I’ve set windows_perf_counters plugin to collect windows processes. For only 4 server I’ve around 900 series (server x tag x field) in win_proc measurement. I’m afraid the number will escalate quickly in production and will lead bad performance. I really don’t need to collect all that information for each process yet I need to know the top processes that cause or may cause a bottleneck.

For example, a server was unable to respond today. Metrics showed me that it’s memory and cpu usage went 100% then services stopped responding. I wasn’t able to extract the cause. Therefore, keeping process information is necessary.

How should we handle this kind of trade offs?

I would just look to be careful as your data grows. Running a periodic check on the cardinality will help, but make sure to use the tsi1 instead of inmem store for influxdb! With inmem you’re limited to roughly 1M series before you may start seeing query degradation; using tsi1 that number jumps up dramatically but is not quite unbounded.