Hi
We are using Kubernetes based platform. etcd is very key component in the Kubernetes cluster. I would like to collect metrics about etcd health, read, write speeds, latency, throughput etc information. Is there any best way to collect this information using telegraf ?
Srinivas Kotaru
@Srinivas_Kotaru The way I’ve done it in the past is to use the Prometheus plugin to export the metrics in that format. You can also use Kapacitor’s service discovery and scraping to do this exact same thing.
If you are running TICK in kubernetes I would suggest you check out tick-charts
!
Hope that helps,
Jack
@jackzampolin Thanks as usual.
Can we pull Kube API server metrics instead of running Prometheus server? I mean every Kuburnet exposed metrics under /metrics URL.
Srinivas Kotaru
@Srinivas_Kotaru Yup! But for kubelet metrics telegraf
has a plugin that pulls those. The telegraf-ds
is configured with that plugin by default.
Readiing the FAQ of etcd they recommend monitoring the p99 of backend_commit_duration_seconds
and wal_fsync_duration_seconds
.
The data exported using the metrics endpoint is like:
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 3.522449e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 1.0488103e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 1.3733184e+07
Have created a graph for that values? How?
Thanks!
My approximation is to draw three lines with queries:
SELECT mean("0.016")/mean("count") FROM "etcd_disk_backend_commit_duration_seconds" WHERE $timeFilter GROUP BY time($__interval) fill(null)
SELECT mean("0.032")/mean("count") FROM "etcd_disk_backend_commit_duration_seconds" WHERE $timeFilter GROUP BY time($__interval) fill(null)
And draw a threshold line at 0.99.
According to etcd docs, if disk is fast enough, 32ms line should be above 99% threshold line.
Similar for wal_fsync_duration_seconds.