Telegraf-Vsphere

I am using telegraf to collect the vsphere metrics. Looks like the data is collected every 300s.

Is there a way to decrease the default sampling time “300s” to get even more granular data. I was unable to find it in the documentation.

The reason for this is, when I collect data using telegraf’s vsphere plugin is totally different when compared to similar data collected via https://github.com/Oxalide/vsphere-influxdb-go:

Here is the data for past 1 hour for same set of servers via Telegrafs vsphere plugin (sampling interval left at default 300s):

Here is the data for past 1 hour for same set of servers as used above using the GitHub - Oxalide/vsphere-influxdb-go: Collect VMware vSphere, vCenter and ESXi performance metrics and send them to InfluxDB plugin (sampling time 60s):

There are two options to modify, you can use the object_discovery_interval = "300s" to change how often new objects are discovered. This would allow the plugin to notice a new VM more quickly. The other option is interval and it can be added to any input plugin to control the frequency that the metrics are collected. By default the Telegraf agent interval is used which is normally 10s.

That said, it looks to me like you have more problems than just the interval being wrong.

Thank you I will try the interval option and let you know how it goes.

Here is my complete telegraf configuration, do you see anything wrong with it:

telegraf.txt (8.0 KB)

Why do you suspect something more is wrong? Please elaborate.

The config file looks okay to me but the image you posted above from Grafana is completely flat.

I was able to fix the flat graph issue. It was a query issue in grafana.

However, After fixing the graphing issue I still see the collection times are atleast 6 min apart even after lowering my collection intervals. Here is my configuration:

[[inputs.vsphere]]

List of vCenter URLs to be monitored. These three lines must be uncommented

and edited for the plugin to work.

vcenters = [ “https://vcenter.local/sdk” ]

vcenters = [ “xxxx” ]
username = “xxxx”
password = “xxx”
interval = “30s”

number of objects to retreive per query for realtime resources (vms and hosts)

set to 64 for vCenter 5.5 and 6.0 (default: 256)

max_query_objects = 256

max_query_objects = 64

number of metrics to retreive per query for non-realtime resources (clusters and datastores)

set to 64 for vCenter 5.5 and 6.0 (default: 256)

max_query_metrics = 256

max_query_metrics = 64

number of go routines to use for collection and discovery of objects and metrics

collect_concurrency = 1
discover_concurrency = 1

whether or not to force discovery of new objects on initial gather call before collecting metrics

when true for large environments this may cause errors for time elapsed while collecting metrics

when false (default) the first collection cycle may result in no or limited metrics while objects are discovered

force_discover_on_init = true

the interval before (re)discovering objects subject to metrics collection (default: 300s)

object_discovery_interval = “30s”

timeout applies to any of the api request made to vcenter

timeout = “20s”

Optional SSL Config

ssl_ca = “/path/to/cafile”

ssl_cert = “/path/to/certfile”

ssl_key = “/path/to/keyfile”

Use SSL but skip chain & host verification

insecure_skip_verify = true

Here is the graph i get when using telegraf’s vsphere plugin:

Here is the graph using the vsphere-influxdb plugin listed earlier:

Why is it collecting only after 6 min? How can I lower is to lets say 2 min? I didn’t have this issue with the plugin I stated earlier in my comments.

Please help.

Are there any messages in the Telegraf log?

I see lot of these messages in the logs:

2018-11-21T16:16:50Z E! Error in plugin [inputs.vsphere]: took longer to collect than collection interval (30s)

So it is just taking 7 minutes to do a full collection. There are a two ways to reduce the amount of time this takes, try one change a time and watch how it effects the time between points after each change.

  1. You can increase the collect_concurrency, you may want to set this to ~4. This will allow the plugin to make multiple requests to vCenter at the same time.
  2. You can decrease the number of metrics gathered. This can be done either by setting the instance, cluster_instances, datastore_instances, etc to false. You can also restrict the metrics gathered by using the include/exclude options such as host_metric_include, host_metric_exclude, vm_metric_include. Use the telegraf --input-filter vsphere --test command to see what is being collected and if some of it is unneeded trim it out.

After setting the collect_concurrency=4 the graph looks a little better. However there are still 2 min gaps in between, see below:

Glad it is getting better, you will have to reduce the amount of data using the options I listed above until vCenter can respond in time, or find a way to speed your vCenter/vSphere’s responses. Keep reducing until you don’t see anymore errors in the Telegraf logs.

Since there is a new telegraf version available, I upgraded from 1.8.3 to 1.9.0. We are on vCenter version is: 6.5.0

After the upgrade I don’t see data and the graphs they are empty.

The only thing I see in the logs is:

2018-11-28T21:13:00Z W! [agent] input “inputs.vsphere” did not complete within its interval

I have seen above error on 1.8.3 too, but that didn’t stop it from generating data and graphs look good. For some reason, on the new version 1.9.0 once we see the above error, the graphs become empty.

Configurtation is still same:

[[inputs.vsphere]]
vcenters = [ “rzzz” ]
username = “zzzz”
password = “zzz”
interval = “30s”

vm_metric_include = [
“sys.uptime.latest” ,
“cpu.usage.average” ,
“cpu.ready.summation” ,
“cpu.readiness.average” ,
“cpu.usagemhz.average” ,
“cpu.wait.summation” ,
“cpu.system.summation” ,
“cpu.used.summation” ,
“mem.usage.average” ,
“mem.consumed.average” ,
“mem.active.average” ,
“mem.vmmemctl.average” ,
“mem.swapused.average” ,
“mem.swapIn.average” ,
“mem.swapOut.average” ,
“disk.maxTotalLatency.latest” ,
“net.usage.average” ,
“net.bytesRx.average” ,
“net.bytesTx.average” ,
“net.packetsRx.summation” ,
“net.packetsTx.summation” ,
“net.received.average” ,
“net.transmitted.average” ,
“virtualDisk.read.average” ,
“virtualDisk.write.average” ,
“virtualDisk.totalWriteLatency.average” ,
“virtualDisk.totalReadLatency.average” ,
“virtualDisk.numberReadAveraged.average” ,
“virtualDisk.numberWriteAveraged.average” ,
“virtualDisk.readOIO.latest” ,
“virtualDisk.writeOIO.latest”
]
vm_metric_exclude = []
vm_instances = true ## true by default

host_metric_include = [
“cpu.usagemhz.average” ,
“cpu.usage.average” ,
“cpu.corecount.provisioned.average” ,
“mem.capacity.provisioned.average” ,
“mem.active.average” ,
“net.throughput.usage.average” ,
“net.throughput.contention.summation” ,
“vmop.numSVMotion.latest” ,
“vmop.numVMotion.latest” ,
“vmop.numXVMotion.latest” ,
“storageAdapter.numberReadAveraged.average” ,
“storageAdapter.numberWriteAveraged.average” ,
“storageAdapter.read.average” ,
“storageAdapter.write.average” ,
“storageAdapter.totalReadLatency.average” ,
“storageAdapter.totalWriteLatency.average” ,
“cpu.utilization.average” ,
“cpu.readiness.average” ,
“cpu.ready.summation” ,
“net.bytesRx.average” ,
“net.bytesTx.average” ,
“virtualDisk.totalWriteLatency.average” ,
“virtualDisk.totalReadLatency.average” ,
“net.received.average” ,
“net.transmitted.average” ,
“net.packetsRx.summation” ,
“net.packetsTx.summation” ,
“mem.consumed.average” ,
“mem.totalmb.average”
]
host_metric_exclude = []
host_instances = true ## true by default

datastore_metric_include = [
“datastore.numberReadAveraged.average” ,
“datastore.numberWriteAveraged.average” ,
“datastore.read.average” ,
“datastore.write.average” ,
“datastore.totalReadLatency.average” ,
“datastore.totalWriteLatency.average” ,
“datastore.datastoreVMObservedLatency.latest” ,
“disk.capacity.latest” ,
“disk.used.latest” ,
“disk.numberReadAveraged.average” ,
“disk.numberWriteAveraged.average”
] ## if omitted or empty, all metrics are collected
datastore_metric_exclude = []
datastore_instances = true ## false by default for Datastores only

datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
datacenter_metric_exclude = [] ## Datacenters are not collected by default.
datacenter_instances = false

cluster_metric_include = [] ## if omitted or empty, all metrics are collected
cluster_metric_exclude = [] ## Nothing excluded by default
cluster_instances = false ## true by default

separator = “_”
max_query_objects = 70
max_query_metrics = 70
collect_concurrency = 4
discover_concurrency = 1
force_discover_on_init = true
object_discovery_interval = “30s”
timeout = “20s”
insecure_skip_verify = true

Can you make a new issue on github?

1 Like

Is it safe to upgrade to 1.9.3? I am still at 1.8.3 for now. Please advise.

-Naresh