Telegraf-Influxdb-Missing Vsphere Data

Data is missing, when collecting vsphere data via telegraf’s vsphere input.

  • Telegraf Version: 1.21.4
  • Infludb Version: 1.8.10
  • Vsphere Version: 7.0

There are no errors in the telegraf logs indicating any error connecting to vsphere. A special user “telegraf” is created to grant access to vsphere data in vsphere itself. The data is collected for most of them VMs while missing for some.

Telegraf Configuration:

#Global tags can be specified here in key="value" format.
[global_tags]

#Configuration for telegraf agent
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
 #metric_buffer_limit = 10000##old setting
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  #debug = true
  debug = false
  #quiet = false
  quiet = true
  logfile = "/sites/tigstack/telegraf/logs/telegraf.log"
  hostname = ""
  omit_hostname = false

###############################################################################
####                          OUTPUT PLUGINS                                   #
###############################################################################

##Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
  urls = ["http://localhost:8086"]
  database = "telegraf"
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"

###############################################################################
####                  INPUT PLUGINS                                    #
###############################################################################

[[inputs.vsphere]]
 vcenters = [ "https://server/sdk" ]
 username = "telegraf"
 password = "xxxxxxx"
 interval = "60s" ## should never be less than 20s

 vm_metric_include = [
   "sys.uptime.latest" ,
   "cpu.usage.average" ,
   "cpu.ready.summation" ,
   "cpu.readiness.average" ,
   "cpu.costop.summation" ,
   "mem.usage.average" ,
   "net.usage.average" ,
   "net.received.average" ,
   "net.transmitted.average" ,
   "virtualDisk.read.average" ,
   "virtualDisk.write.average" ,
   "virtualDisk.totalWriteLatency.average" ,
   "virtualDisk.totalReadLatency.average" ,
   "virtualDisk.numberReadAveraged.average" ,
   "virtualDisk.numberWriteAveraged.average" ,
   "virtualDisk.readOIO.latest" ,
   "virtualDisk.writeOIO.latest"
]

  host_metric_include = [
  "cpu.usage.average" ,
  "cpu.costop.summation" ,
  "cpu.readiness.average" ,
  "cpu.ready.summation" ,
  "storageAdapter.numberReadAveraged.average" ,
  "storageAdapter.numberWriteAveraged.average" ,
  "storageAdapter.read.average" ,
  "storageAdapter.write.average" ,
  "storageAdapter.totalReadLatency.average" ,
  "storageAdapter.totalWriteLatency.average" ,
  "virtualDisk.totalWriteLatency.average" ,
  "virtualDisk.totalReadLatency.average" ,
  "net.received.average" ,
  "net.transmitted.average" ,
  "net.packetsRx.summation" ,
  "net.packetsTx.summation" ,
  "mem.consumed.average" ,
  "mem.totalmb.average"
]

########################## Exclude all historical metrics###########################

 datastore_metric_exclude = ["*"]
 cluster_metric_exclude = ["*"]
 datacenter_metric_exclude = ["*"]

####################################################################################

 separator = "_"
#max_query_objects = 150
#max_query_metrics = 150

############following parameter values should always match each other, they offer performance boost. Max value for each is 8 ###############

#collect_concurrency = 6
#discover_concurrency = 6

############################################################################################################

 #force_discover_on_init = false
 #object_discovery_interval = "60s"
 #timeout = "60s"
 insecure_skip_verify = true

#############################################################END-OF-VSPHERE-INPUT-CONFIGURATION################################################################################

Can you please run telegraf with --debug --test to see if you can receive any data!?! It would be also helpful to see the log-messages of that run if no metric is shown on the console…

Here is the debug log generated with the same configuration listed above, by enabling “debug”:

telegraf.log_debug_20221205.txt (41.8 KB)

As noted earlier the issue is we are able to see the vsphere data for most of the servers, but the data is missing for some of them. There are no errors in the logs related to missing data.

Are all your servers the same vesion?

In the plugin there is a metric_lookback option that may help with this node:

 ## The number of vSphere 5 minute metric collection cycles to look back for non-realtime metrics. In
  ## some versions (6.7, 7.0 and possible more), certain metrics, such as cluster metrics, may be reported
  ## with a significant delay (>30min). If this happens, try increasing this number. Please note that increasing
  ## it too much may cause performance issues.

Can you please try setting that to a higher number.

@jpowers appreciate your feedback on this issue.

Are all your servers the same vesion? Can you please elaborate, not sure I understand this question.

Our telegraf agent is configured to collect only host and VM specific data via the vsphere plugin. We are excluding all the cluster metrics via the “cluster_metric_exclude = [”*“]”.

The metric_lookback looks specific to cluster data collection, am i wrong?

Do you still want me to try this option, I ask as I am hesitant to make changes to our prod environment without making sure.

Sorry wondering what your vsphere version is across all servers

The metric_lookback looks specific to cluster data collection, am i wrong?

My understanding is it is across all metrics: cluster, host, vms, etc.

Do you still want me to try this option, I ask as I am hesitant to make changes to our prod environment without making sure.

Please do or if possible run a 2nd telegraf with just this change

@jpowers our missing data issue is resolved.It was a user permission issue in VSphere.

In order for the telegraf agent to collect data from our vsphere (v7), a “telegraf” user was created with “Read-Only” access. This user was then granted access at the cluster level which consisted of various hosts and VMs. In this particular case the VMs XX and YY were part of a cluster “Hosting”. Although the “telegraf” user had read-only access to these servers (XX & YY), data was not collected by the agent. In order to fix the issue our admin had to manually grant the “telegraf” user read-only access to these servers individually (instead of at cluster level). After a minute data started showing up.

Looks like the vsphere user access restrictions are complicated and are corrupted (for a lack of better word) at times.

Just wanted to share this information to others who may have similar issue.

Thank you again for your time and feedback regarding this issue. I have left the configuration as is and not implemented your requested changes.

1 Like