Help in tracking down performance issue in telegraf + opcua input plugin (1s interval not keeping up)

jminardi · March 10, 2025, 10:08pm

I am using telegraf with the opcua input plugin. I am reading around 10,000 tags from a Kepware server at a 1 second interval (contractually obligated for warranty reasons).

I started by reading around 100 tags and all went well. When I tried scaling up to the full 10,000 tags kepware would reject my connection. To work around this I configured multiple instances of the plugin with 500 tags each. This worked and I was able to read in all 10,000 tags. However my 1 second intervals are not keeping up. I am storing the data in Victoria Metrics and when I query for raw samples while most are 1 second apart, some are 2 seconds apart. The amount that are 2 seconds apart is growing over time. When I first started, I was averaging around 1.1 seconds per scrape, but around 24 hours later the average is 1.7 seconds per scrape. Note that the interval is always either exactly 1000ms or 2000ms, nothing else, which I find a little suspicious.

I am looking for suggestions in tracking this down. How do I know if this is telegraf struggling to keep up, Victoria metrics, or kepware struggling to respond? I’ve used Victoria metrics with much much higher write loads so I don’t think the problem is there. I am also using this telegraf internal stats dashboard to dig into telegraf:

On there I can see:

0 dropped metrics
719 “gather errors”
10 gather errors per second
gather time is hovering around 1.2 seconds.
buffer limit is 100k
buffer size is hovering just under 10k.
metric write time is hovering under 30ms.

All of this leads me to believe that its kepware response times that are the issue, specifically the gather errors and the gather time of 1.2 seconds.

Is there anything else I can do to dig into this? I’d love for something more concrete if I need to bring it up with kepware support.

Here is my config:

[agent]
    collection_jitter = "0s"
    debug = true
    flush_interval = "10s"
    flush_jitter = "0s"
    hostname = "$HOSTNAME"
    interval = "10s"
    logfile = ""
    metric_batch_size = 10000
    metric_buffer_limit = 100000
    omit_hostname = false
    precision = ""
    quiet = false
    round_interval = true
[[processors.enum]]
    [[processors.enum.mapping]]
    dest = "status_code"
    field = "status"
    [processors.enum.mapping.value_mappings]
        critical = 3
        healthy = 1
        problem = 2
[[outputs.influxdb]]
    database = "telegraf"
    urls = [
    "http://insights-vmagent-cluster.insights-preproduction.svc.cluster.local:8429"
    ]

[[inputs.statsd]]
    allowed_pending_messages = 10000
    metric_separator = "_"
    percentile_limit = 1000
    percentiles = [
    50.0,
    95.0,
    99.0
    ]
    service_address = ":8125"

[[inputs.internal]]
    collect_memstats = true
    collect_gostats = true

And here is my plugin configuration (truncated, but repeated around 20x):

[[inputs.opcua]]
  name = "kepware_0"
  endpoint = "opc.tcp://ts2-kep-01.test.<>.com:49320"
  connect_timeout = "10s"
  request_timeout = "5s"
  session_timeout = "20m"
  interval = "1s"
  security_policy = "Basic256Sha256"
  security_mode = "SignAndEncrypt"
  auth_method = "UserName"
  username = "redacted"
  password = "redacted"
  timestamp = "server"
  client_trace = true

  nodes = [
    {name="PB_BATT_CMD_KW[1]", namespace="2", identifier_type="s", identifier="PB_BCI[1].BCI[1].PB_BATT_CMD_KW[1]"},
    {name="PB_BATT_SP_KW[1]", namespace="2", identifier_type="s", identifier="PB_BCI[1].BCI[1].PB_BATT_SP_KW[1]"},
...

annamooseity · March 10, 2025, 11:41pm

Have you tried setting use_unregistered_reads to true? I believe that’s been mentioned on a couple of other threads re: Kepware.

jminardi · March 13, 2025, 2:48pm

I tried both ways and didn’t see an effect.

From what I can read online it seems like registered reads are 10x faster than unregistered reads so I would expect setting that to false to be the best way.

I am beginning to suspect that the culprit is Kepware in this case. I see the CPU pegged at 100%. It’s possible it just can’t keep up. I have a support request out to them to see if they can assist.

R290 · March 14, 2025, 3:39pm

Is the 1s interval a hard requirement? Otherwise you could try the telegraf/plugins/inputs/opcua_listener at master · influxdata/telegraf · GitHub instead? Telegraf is able to repeat non-frequent data changes if needed, see: Replication of Non frequent data change - #6 by JeroenVH

jminardi · March 17, 2025, 4:56pm

Is the 1s interval a hard requirement?

Yes, like I said about it is contractually obligated for warranty reasons

I don’t really like replicating data automatically with a separate process, I feel there are too many ways that could end up putting incorrect entries into the DB.

I’ve been reading about “Poll” update mode subscriptions that still send all tags on a set interval, not just those that have changed:

In Poll Mode, an asynchronous read is performed on all subscription tags at the rate of the publishing interval.

I am wondering if the telegraf opcua listener plugin supports setting up these “Poll” mode subscriptions. Has anyone done this with telegraf before?

R290 · March 18, 2025, 3:59pm

Apologies, I somehow missed the 1 s statement in the topic start. I have no experience with the polling subscription mode. Maybe you can delay each request by 10 ms compared to the previous one with collection_offset to avoid flooding the server with requests at the exact second? See: telegraf/docs/CONFIGURATION.md at master · influxdata/telegraf · GitHub

JeroenVH · March 19, 2025, 7:49pm

Since you keep mentioning the contractual obligation, wouldn’t it make more sense getting a paid, reliable and supported SCADA system?

That being said, OPCUA has one of 2 modes per client connection. You can either use a monitoring or a subscription connection. The first one in telegraf is just called opcua and the second one opcua_listener.

If you need be able to capture every significant change without loss of data, I’d strongly recommend using OPCUA listener with sensible deadbands (e.g., a 0.1 change might be irrelevant on a 0-1000 scale). This approach offloads the sampling load to the OPCUA server, allowing telegraf to focus on processing changes instead of constantly sampling data.

I have used both and found listener the most useful as it also logs spikes in the measured values. I sometimes get more than 10 updates of a value per second when something happens in our process. Which is great for diagnoses when a threshold is triggered for only a very short amount of time. And after the event the data rate slows down again just logging when changes occur. I set my publishing interval to the fastest the server can provide because it does not have the ability to timestamp the data in the queue.

You might run into this problem when using encryption
[opcua_listener] 1000 Nodes will get EOF[failed to start monitoring items: EOF]

Replicating the datapoints when no changes are reported is just to make the graphs look pretty and have a datapoint in every aggregated window. This is also what the major SCADA vendors do for their historians btw, they don’t want to use more client connections as they are often limited.

My solution with the watchdog is pretty solid; just create a single metric on the server side that in your case changes every second and link all metrics that use the same connection. This way you get every change as normal with the plugin as well as additional datapoints when there is no change, it stops when communication is interrupted giving you no data and a reason for an alert.

I am in the process of creating a pull request to add my replication script to the starlark examples.

jminardi · March 21, 2025, 2:57pm

Thank you for the detailed reply. I am working with my management to see if the once per second contractual obligation can be switched to only on change. One issue I am worried about is if a datapoint doesn’t change in 5 minutes then it will be marked stale in Victoria Metrics. I think I can solve that by configuring telegraf to also scan and read all tags every minute at the same time as subscribing to changes to all tags. Any thoughts on this hybrid approach compared to your replication approach?

You might run into this problem when using encryption
[opcua_listener] 1000 Nodes will get EOF[failed to start monitoring items: EOF]

I definitely have run into this problem already (my limit was closer to 2000 nodes). I figured it was a limitation in Kepware, I am surprised to see its on the client side. My workaround was to configure multiple instances of the opcua plugin, each less than the limit. This worked for my needs.

The link you posted didn’t have a definitive solution. What have you done to work around it?

JeroenVH · March 21, 2025, 9:21pm

One issue I am worried about is if a datapoint doesn’t change in 5 minutes then it will be marked stale in Victoria Metrics.

This is exactly what my replication solution is for. Just put the replication period on 1m and you are good to go.

I currently have the hybrid approach running in my production environment at work. So yes having a monitoring connection at 1m interval and a subscription connection at the same time works fine. However it is not maintainable in my opinion. Having to teach other people they have to add the same metric to multiple config files will result in unintended misconfiguration from the original design. So I think it is better to have a single aggregator function in the main config that takes care of the shortcomings of the subscription model.

I am in the process of creating a PR for my example (replication.star), however today I had a new idea to reduce the complexity by a lot. Currently I did some monitoring of the connection with a watchdog metric, but any metric will suffice to indicate the connection state you just need one tag to create some kind of group to keep track of the last update in the group. Also maybe for your case if you can’t add a watchdog for periodic metric in the datasource, you can define just 1 with the regular OPCUA monitoring connection and just give it the same group identifier that passes it in the replication function.
So expect an update on that next week.

jminardi · March 22, 2025, 2:36am

Thank you, I will keep an eye out for your replication PR.

Do you know anything more about the 1000 Nodes EOF issue? How did you work around it in your case?

JeroenVH · March 25, 2025, 6:52pm

Haven’t turned on encryption yet. Hope it get fixed in the mean time, although it seems nobody knows the exact problem. I have about 2000 nodes per subscription.

jminardi · March 25, 2025, 7:49pm

Here is the issue on the gopcua repo tracking the problem:

github.com/gopcua/opcua

EOF in readChunk when reading 1500 nodes

opened 03:30AM - 22 Mar 25 UTC

magiconair

Moved from https://github.com/gopcua/opcua/issues/658#issuecomment-1903473439 @…magiconair Hello I use Telegraf OPC UA plugin, which uses gopcua. I am currently encountering the same situation. Using security and OPCUA to subscribe, EOF occurs when Nodes exceeds 1000 or 1500. the logs.. ``` 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] Connecting OPC UA Client to server 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp: connecting to opc.tcp://192.168.1.XX:48050 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 1: start HEL/ACK handshake 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 1: sent HELF with 60 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 1: recv ACKF with 28 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 1: recv &uacp.Acknowledge{Version:0x0, ReceiveBufSize:0xffff, SendBufSize:0xffff, MaxMessageSize:0xffff00, MaxChunkCount:0x100} 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1/1: send *ua.OpenSecureChannelRequest with 132 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 1: recv OPNF with 136 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1/1: recv OPNF with 136 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1/1: recv *ua.OpenSecureChannelResponse 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1/1: sending *ua.OpenSecureChannelResponse to handler 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1: received security token. channelID=2437948123 tokenID=1 createdAt=2024-01-19T06:57:35Z lifetime=1h0m0s 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1: security token is refreshed at 2024-01-19T07:42:41Z (45m0s). channelID=2437948123 tokenID=1 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1: security token expires at 2024-01-19T08:12:35Z. channelID=2437948123 tokenID=1 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1/2: send *ua.GetEndpointsRequest with 97 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 1: recv MSGF with 3690 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1/2: recv MSGF with 3690 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1/2: recv *ua.GetEndpointsResponse 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1/2: sending *ua.GetEndpointsResponse to handler 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1: Close() 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1/3: send *ua.CloseSecureChannelRequest with 57 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 1: close 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 1: readChunk EOF 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] Configuring OPC UA connection options 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] Loading cert/key from /etc/telegraf/ua.pem//etc/telegraf/ua_key.pem 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] security policy from configuration 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp: connecting to opc.tcp://192.168.1.XX:48050 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 2: start HEL/ACK handshake 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 2: sent HELF with 60 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 2: recv ACKF with 28 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 2: recv &uacp.Acknowledge{Version:0x0, ReceiveBufSize:0xffff, SendBufSize:0xffff, MaxMessageSize:0xffff00, MaxChunkCount:0x100} 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/1: send *ua.OpenSecureChannelRequest with 1905 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 2: recv OPNF with 1901 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2: setting securityPolicy to http://opcfoundation.org/UA/SecurityPolicy#Basic256Sha256 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/1: recv OPNF with 1901 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/1: recv *ua.OpenSecureChannelResponse 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/1: sending *ua.OpenSecureChannelResponse to handler 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2: received security token. channelID=2437948124 tokenID=1 createdAt=2024-01-19T06:57:35Z lifetime=1h0m0s 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2: security token is refreshed at 2024-01-19T07:42:41Z (45m0s). channelID=2437948124 tokenID=1 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2: security token expires at 2024-01-19T08:12:35Z. channelID=2437948124 tokenID=1 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/2: send *ua.CreateSessionRequest with 1584 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 2: recv MSGF with 5408 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/2: recv MSGF with 5408 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/2: recv *ua.CreateSessionResponse 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/2: sending *ua.CreateSessionResponse to handler 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/3: send *ua.ActivateSessionRequest with 464 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 2: recv MSGF with 144 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/3: recv MSGF with 144 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/3: recv *ua.ActivateSessionResponse 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/3: sending *ua.ActivateSessionResponse to handler 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/4: send *ua.ReadRequest with 144 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 2: recv MSGF with 480 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/4: recv MSGF with 480 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/4: recv *ua.ReadResponse 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/4: sending *ua.ReadResponse to handler 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] Connected to OPC UA Server 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] Creating OPC UA subscription 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/5: send *ua.CreateSubscriptionRequest with 128 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 2: recv MSGF with 112 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/5: recv MSGF with 112 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/5: recv *ua.CreateSubscriptionResponse 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/5: sending *ua.CreateSubscriptionResponse to handler 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] Subscribed with subscription ID 2432947172 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/6: send *ua.CreateMonitoredItemsRequest with 65536 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/6: send *ua.CreateMonitoredItemsRequest with 20544 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2/7: send *ua.PublishRequest with 112 bytes 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2: readChunk EOF 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uacp 2: close 2024-01-19T14:57:41+08:00 D! [inputs.opcua_listener] uasc 2: Close() 2024-01-19T14:57:41+08:00 E! [telegraf] Error running agent: starting input inputs.opcua_listener: failed to start monitoring items: EOF ``` Is there currently a solution to this problem?

I am not sure it’s an issue with gopcua and not the opcua server though.

Topic		Replies	Views
Telegraf service to read data from kepware Telegraf influxdb , telegraf	3	28	April 15, 2025
Telegraf inputs.opcua, KEPServerEX, KEPWARE, COLLECTION BUG/PROBLEM Telegraf influxdb , telegraf , iot , plugin	3	711	March 27, 2023
Collection took longer than expected; not complete after interval of 500m Telegraf telegraf	6	1846	January 11, 2023
Telegraf (OPC UA) data incoming with 1ms speed - but InfluxDB only showing 10 sec intervals InfluxDB 2 telegraf	5	2055	December 7, 2021
Opcua_listener subscription interval Telegraf	3	622	September 8, 2023

Help in tracking down performance issue in telegraf + opcua input plugin (1s interval not keeping up)

Related topics