Missing fields from some Telegraf metrics

Hi Community!
I am using the gnmi input plugin for some Juniper devices, and am noticing missing field for a particular measurement (bgp_neighbors_enabled).

Consider the following output, which consists of all expected tags and fields

bgp_neighbors_enabled{auth_password="(null)", description="(null)", instance="telegraf:9126", instance_name="xxx", job="telegraf", messages_received_last_notification_error_code="CEASE", messages_sent_last_notification_error_subcode="UNSPECIFIC", neighbor_address="x.x.x.x", path="/network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor", peer_group="group1", peer_type="EXTERNAL", session_admin_status="START", session_state="ESTABLISHED", session_status="RUNNING", source="xx-br1"}

In contract, the same measurement for another BGP peer is missing the fields session_state and peer_type

bgp_neighbors_enabled{instance="telegraf:9126", instance_name="xxx", job="telegraf", messages_received_last_notification_error_code="NONE", messages_received_last_notification_error_subcode="UNSPECIFIC", neighbor_address="x.x.x.x", path="/network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor", peer_group="group2", session_admin_status="START", session_status="RUNNING", source="xx-br1"}

A few other observations:

  • I can confirm the polled device is returning the complete fields, as verified by running gnmic on the telegraf server. The issue appears to be on Telegraf-end.
  • I have tried bumping up/reducing the various intervals within the Telegraf config.
  • The issue appears to be intermittent. There will be different BGP peers with no session_state & peer_type at different times.

My telegraf config is as follows:

[agent]
interval = "10s"
round_interval = true
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = "0s"
omit_hostname = true
snmp_translator = "gosmi"
debug = true

[[outputs.prometheus_client]]
listen = ":9126"
metric_version = 2
path = "/metrics"
string_as_label = true
export_timestamp = true

[[inputs.gnmi]]
addresses = [
    "xxxx",
    ".....",
    ".....",
]
encoding = "proto"

[[inputs.gnmi.subscription]]
name = "bgp_neighbors"
origin = "openconfig-network-instance"
path = "/network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor/state/"
subscription_mode = "sample"
sample_interval = "30s"

@ssumsam,
have you tried:

  • increasing sample_interval
  • or subscription_mode = “on_change”

That might help.
Otherwise @srebhan might be able to offer some better suggestions

Please add a outputs.file output and record the metrics in line-protocol format so we can see if the fields are missing on the input or output side… Furthermore, this helps to reproduce the issue…

Please also check the Telegraf logs for anything special around the lost fields…

Thanks for getting back on this!

@Anaisdg I tested both the options you mentioned, but still seeing the same issue.

  • increasing sample_interval
  • using subscription_mode = on_change

@srebhan I did some further digging on this using dump_responses = true , and can somewhat see the cause. The notifications coming from the Juniper device are sometimes split into two different notifications, each with a different timestamp, and different labels - and hence Telegraf is not co-relating both. This case of split notifications is causing Telegraf to drop one of the notifications, which in turn is resulting in missing labels.

Is there any config needed on the Telegraf end to help remediate this?

I guess the only chance then is to use the starlark processor and merge the two metrics manually.