Telegraf SNMP counters missing on Prometheus endpoint

Raid · October 16, 2024, 9:44am

Hello,
I’m facing this issue which I haven’t been able to figure out yet. This is my first time using Telegraf and basically I’m using the SNMP plugin to collect various counters from some network elements that have custom vendor MIBs. I’m just collecting the fields and tables I want and expose them in a Prometheus endpoint with the Prometheus output plugin which can then be scraped by our internal Thanos/Prometheus cluster. I’m not doing any transformations on the data on Telegraf side. Everything is working normally except for these specific counters:

Telegraf version: Telegraf 1.22.4

[[inputs.snmp.field]]
    oid = "JINNY-JAR-MIB::mc-fe-alive.0"
    name = "mc_fe_alive"

  [[inputs.snmp.field]]
    oid = "JINNY-JAR-MIB::mc-fe-total.0"
    name = "mc_fe_total"
  
  [[inputs.snmp.field]]
    oid = "JINNY-JAR-MIB::mc-fe-total-a.0"
    name = "mc_fe_total_a"
  
  [[inputs.snmp.field]]
    oid = "JINNY-JAR-MIB::mc-fe-delay.0"
    name = "mc_fe_delay"

  [[inputs.snmp.table]]
    oid = "JINNY-JAR-MIB::mcTable"
    name = "jinny_smsc_mcTable"
    inherit_tags = ["sysName"]
    index_as_tag = true
    [[inputs.snmp.table.field]]
      oid = "JINNY-JAR-MIB::mc-name"
      name = "mc_name"

These are defined inside an inputs.snmp scope normally and I have many other counters in the same scope from the same MIB that are being collected and exposed in prometheus as expected. However, for some reason these specific mc* counters are being collected if I run telegraf in test mode, but are not being exposed in prometheus endpoint. See test output below:

> jinny_smsc_jmg,host=telegraf-jinny-smsc-5cc67878df-gbbpk,source=[hidden],sysName=[hidden],in-alive=2347498068i,in-charge=0i,in-charge-a=0i,in-charge-delay=0i,in-commit=0i,in-commit-a=0i,in-commit-delay=0i,in-reserve=135i,in-reserve-a=22i,in-reserve-delay=1188984i,ismppsai-alive=0i,ismppsai-total-dlvmo=0i,ismppsai-total-dlvmo-a=0i,ismppsai-total-dr=0i,ismppsai-total-dr-a=0i,ismppsai-total-sub=0i,ismppsai-total-sub-a=0i,ismppsi-alive=0i,ismppsi-total-data=0i,ismppsi-total-data-a=0i,ismppsi-total-dlvmo=0i,ismppsi-total-dlvmo-a=0i,ismppsi-total-dr=0i,ismppsi-total-dr-a=0i,ismppsi-total-sub=0i,ismppsi-total-sub-a=0i,mc_fe_alive=96337362i,mc_fe_delay=3143086i,mc_fe_total=21224i,mc_fe_total_a=1265i,smppsai-alive=921725286i,smppsai-spam-b=0i,smppsai-total-dlvmo=3326i,smppsai-total-dlvmo-a=3326i,smppsai-total-dr=68165i,smppsai-total-dr-a=68165i,smppsai-total-sub=84617i,smppsai-total-sub-a=84166i,sysUptime=3522008.9 1729069363000000000

When I check the Prometheus endpoint either via Thanos/prometheus UI or by doing a curl on the endpoint I have exposed I have every other counter except those mc* ones.
Prometheus output config:

[[outputs.prometheus_client]]
  listen = ":9126"
  metric_version = 2
  export_timestamp  = true
  path = "/metrics"
  expiration_interval = 0

MIB definitions are all in the same manner:

mc-fe-alive OBJECT-TYPE
SYNTAX      Gauge
ACCESS      read-only
STATUS      mandatory
DESCRIPTION "Message Control alive counter" 
::= { jar 254 }

mc-fe-total OBJECT-TYPE
SYNTAX      Counter
ACCESS      read-only
STATUS      mandatory
DESCRIPTION "Message Control total submitted messages" 
::= { jar 255 }

mc-fe-total-a   OBJECT-TYPE
SYNTAX      Counter
ACCESS      read-only
STATUS      mandatory
DESCRIPTION "Message Control total submitted (ack) messages" 
::= { jar 256 }

mc-fe-delay OBJECT-TYPE
SYNTAX      Counter
ACCESS      read-only
STATUS      mandatory
DESCRIPTION "Message Control delay when connection to jfe and/or jade" 
::= { jar 257 }

I have tried inserting this fields in a different inputs.snmp scope and still they don’t appear on prometheus side, even though I can also add to that new scope some other counter like a sysName from RFC1213-MIB and that works fine, so it doesn’t seem like a scope problem. Also tried increasing metric_batch_size and metric_buffer_limit from the default values, but none worked. I’m not sure how to troubleshoot this issue deeper, any suggestions ?

Thank you

Hipska · October 16, 2024, 11:25am

The fact you see them in test mode, tells you the issue is not with the inputs.snmp config.

I would look into the Prometheus output. Try the file output with Prometheus serializer for example. Or try the different settings.

Raid · October 17, 2024, 12:14pm

After testing with your suggestion of outputting to a file with prometheus serializer I found that if I run these inputs in a new .conf file and execute telegraf with only that conf file I can have them outputting to the file. I also tested outputting them to prometheus_client on a different port. However when I’m running my service as a whole with all the other counters (5 .conf files, one for each network element) those mc* counters still don’t appear on the prometheus endpoint. Could this be some sort of buffer limit issue? It doesn’t seem to be in the inputs side though because I can see it being collected in test mode, but I also don’t see much I can configure in the outputs.prometheus_client plugin that could be related to this.

Raid · October 23, 2024, 5:39pm

I found the root cause of the problem. It was the provider MIB - there was these mc-* counters as fields and then there was an mcTable declared with dynamic counters but it’s entries had the same name declaration, (e.g mc-fe-alive object and mc-fe-alive table entry) so probably gosmi was having trouble translating this, although netsnmp can translate it to correct OIDs. Even in latest telegraf version it wouldn’t work. I just removed that mcTable and its entries from the MIB and it works fine with gosmi now.
Thanks to @Hipska that helped me troubleshoot via slack.

Hipska · October 24, 2024, 8:44am

Yeah, sometimes the MIBs need some fixes to be correct according to the specs. GoSMI is more strictly following the specs, but it might be good to report that to their repo as it normally should give errors or warnings on that. If it would’ve done that, we would have identified this problem much faster.

Topic		Replies	Views
SNMP Metric as tag to other metrics (gnmi, ping) Telegraf telegraf , prometheus , snmp	5	33	November 22, 2024
Telegraf -1.15.3 RH7 unable to use some customs mibs Welcome & Getting Started telegraf	0	549	October 8, 2020
Telegraf not collecting on all snmp inputs Telegraf	7	6000	March 9, 2018
Noobie, More I read the more I am confused Telegraf influxdb , telegraf	0	767	July 15, 2020
Snmp table data using telegraf Telegraf	2	795	January 1, 2020

Telegraf SNMP counters missing on Prometheus endpoint

Related topics