Help in tracking down performance issue in telegraf + opcua input plugin (1s interval not keeping up)

I am using telegraf with the opcua input plugin. I am reading around 10,000 tags from a Kepware server at a 1 second interval (contractually obligated for warranty reasons).

I started by reading around 100 tags and all went well. When I tried scaling up to the full 10,000 tags kepware would reject my connection. To work around this I configured multiple instances of the plugin with 500 tags each. This worked and I was able to read in all 10,000 tags. However my 1 second intervals are not keeping up. I am storing the data in Victoria Metrics and when I query for raw samples while most are 1 second apart, some are 2 seconds apart. The amount that are 2 seconds apart is growing over time. When I first started, I was averaging around 1.1 seconds per scrape, but around 24 hours later the average is 1.7 seconds per scrape. Note that the interval is always either exactly 1000ms or 2000ms, nothing else, which I find a little suspicious.

I am looking for suggestions in tracking this down. How do I know if this is telegraf struggling to keep up, Victoria metrics, or kepware struggling to respond? I’ve used Victoria metrics with much much higher write loads so I don’t think the problem is there. I am also using this telegraf internal stats dashboard to dig into telegraf:

On there I can see:

  • 0 dropped metrics
  • 719 “gather errors”
  • 10 gather errors per second
  • gather time is hovering around 1.2 seconds.
  • buffer limit is 100k
  • buffer size is hovering just under 10k.
  • metric write time is hovering under 30ms.

All of this leads me to believe that its kepware response times that are the issue, specifically the gather errors and the gather time of 1.2 seconds.

Is there anything else I can do to dig into this? I’d love for something more concrete if I need to bring it up with kepware support.

Here is my config:

[agent]
    collection_jitter = "0s"
    debug = true
    flush_interval = "10s"
    flush_jitter = "0s"
    hostname = "$HOSTNAME"
    interval = "10s"
    logfile = ""
    metric_batch_size = 10000
    metric_buffer_limit = 100000
    omit_hostname = false
    precision = ""
    quiet = false
    round_interval = true
[[processors.enum]]
    [[processors.enum.mapping]]
    dest = "status_code"
    field = "status"
    [processors.enum.mapping.value_mappings]
        critical = 3
        healthy = 1
        problem = 2
[[outputs.influxdb]]
    database = "telegraf"
    urls = [
    "http://insights-vmagent-cluster.insights-preproduction.svc.cluster.local:8429"
    ]

[[inputs.statsd]]
    allowed_pending_messages = 10000
    metric_separator = "_"
    percentile_limit = 1000
    percentiles = [
    50.0,
    95.0,
    99.0
    ]
    service_address = ":8125"

[[inputs.internal]]
    collect_memstats = true
    collect_gostats = true

And here is my plugin configuration (truncated, but repeated around 20x):

[[inputs.opcua]]
  name = "kepware_0"
  endpoint = "opc.tcp://ts2-kep-01.test.<>.com:49320"
  connect_timeout = "10s"
  request_timeout = "5s"
  session_timeout = "20m"
  interval = "1s"
  security_policy = "Basic256Sha256"
  security_mode = "SignAndEncrypt"
  auth_method = "UserName"
  username = "redacted"
  password = "redacted"
  timestamp = "server"
  client_trace = true

  nodes = [
    {name="PB_BATT_CMD_KW[1]", namespace="2", identifier_type="s", identifier="PB_BCI[1].BCI[1].PB_BATT_CMD_KW[1]"},
    {name="PB_BATT_SP_KW[1]", namespace="2", identifier_type="s", identifier="PB_BCI[1].BCI[1].PB_BATT_SP_KW[1]"},
...

Have you tried setting use_unregistered_reads to true? I believe that’s been mentioned on a couple of other threads re: Kepware.

I tried both ways and didn’t see an effect.

From what I can read online it seems like registered reads are 10x faster than unregistered reads so I would expect setting that to false to be the best way.

I am beginning to suspect that the culprit is Kepware in this case. I see the CPU pegged at 100%. It’s possible it just can’t keep up. I have a support request out to them to see if they can assist.

Is the 1s interval a hard requirement? Otherwise you could try the telegraf/plugins/inputs/opcua_listener at master · influxdata/telegraf · GitHub instead? Telegraf is able to repeat non-frequent data changes if needed, see: Replication of Non frequent data change - #6 by JeroenVH

Is the 1s interval a hard requirement?

Yes, like I said about it is contractually obligated for warranty reasons

I don’t really like replicating data automatically with a separate process, I feel there are too many ways that could end up putting incorrect entries into the DB.

I’ve been reading about “Poll” update mode subscriptions that still send all tags on a set interval, not just those that have changed:

In Poll Mode, an asynchronous read is performed on all subscription tags at the rate of the publishing interval.

I am wondering if the telegraf opcua listener plugin supports setting up these “Poll” mode subscriptions. Has anyone done this with telegraf before?

Apologies, I somehow missed the 1 s statement in the topic start. I have no experience with the polling subscription mode. Maybe you can delay each request by 10 ms compared to the previous one with collection_offset to avoid flooding the server with requests at the exact second? See: telegraf/docs/CONFIGURATION.md at master · influxdata/telegraf · GitHub

Since you keep mentioning the contractual obligation, wouldn’t it make more sense getting a paid, reliable and supported SCADA system?

That being said, OPCUA has one of 2 modes per client connection. You can either use a monitoring or a subscription connection. The first one in telegraf is just called opcua and the second one opcua_listener.

If you need be able to capture every significant change without loss of data, I’d strongly recommend using OPCUA listener with sensible deadbands (e.g., a 0.1 change might be irrelevant on a 0-1000 scale). This approach offloads the sampling load to the OPCUA server, allowing telegraf to focus on processing changes instead of constantly sampling data.

I have used both and found listener the most useful as it also logs spikes in the measured values. I sometimes get more than 10 updates of a value per second when something happens in our process. Which is great for diagnoses when a threshold is triggered for only a very short amount of time. And after the event the data rate slows down again just logging when changes occur. I set my publishing interval to the fastest the server can provide because it does not have the ability to timestamp the data in the queue.

You might run into this problem when using encryption
[opcua_listener] 1000 Nodes will get EOF[failed to start monitoring items: EOF]

Replicating the datapoints when no changes are reported is just to make the graphs look pretty and have a datapoint in every aggregated window. This is also what the major SCADA vendors do for their historians btw, they don’t want to use more client connections as they are often limited.

My solution with the watchdog is pretty solid; just create a single metric on the server side that in your case changes every second and link all metrics that use the same connection. This way you get every change as normal with the plugin as well as additional datapoints when there is no change, it stops when communication is interrupted giving you no data and a reason for an alert.

I am in the process of creating a pull request to add my replication script to the starlark examples.

Thank you for the detailed reply. I am working with my management to see if the once per second contractual obligation can be switched to only on change. One issue I am worried about is if a datapoint doesn’t change in 5 minutes then it will be marked stale in Victoria Metrics. I think I can solve that by configuring telegraf to also scan and read all tags every minute at the same time as subscribing to changes to all tags. Any thoughts on this hybrid approach compared to your replication approach?

You might run into this problem when using encryption
[opcua_listener] 1000 Nodes will get EOF[failed to start monitoring items: EOF]

I definitely have run into this problem already (my limit was closer to 2000 nodes). I figured it was a limitation in Kepware, I am surprised to see its on the client side. My workaround was to configure multiple instances of the opcua plugin, each less than the limit. This worked for my needs.

The link you posted didn’t have a definitive solution. What have you done to work around it?

One issue I am worried about is if a datapoint doesn’t change in 5 minutes then it will be marked stale in Victoria Metrics.

This is exactly what my replication solution is for. Just put the replication period on 1m and you are good to go.

I currently have the hybrid approach running in my production environment at work. So yes having a monitoring connection at 1m interval and a subscription connection at the same time works fine. However it is not maintainable in my opinion. Having to teach other people they have to add the same metric to multiple config files will result in unintended misconfiguration from the original design. So I think it is better to have a single aggregator function in the main config that takes care of the shortcomings of the subscription model.

I am in the process of creating a PR for my example (replication.star), however today I had a new idea to reduce the complexity by a lot. Currently I did some monitoring of the connection with a watchdog metric, but any metric will suffice to indicate the connection state you just need one tag to create some kind of group to keep track of the last update in the group. Also maybe for your case if you can’t add a watchdog for periodic metric in the datasource, you can define just 1 with the regular OPCUA monitoring connection and just give it the same group identifier that passes it in the replication function.
So expect an update on that next week.

1 Like

Thank you, I will keep an eye out for your replication PR.

Do you know anything more about the 1000 Nodes EOF issue? How did you work around it in your case?

1 Like

Haven’t turned on encryption yet. Hope it get fixed in the mean time, although it seems nobody knows the exact problem. I have about 2000 nodes per subscription.

Here is the issue on the gopcua repo tracking the problem:

I am not sure it’s an issue with gopcua and not the opcua server though.