Help in tracking down performance issue in telegraf + opcua input plugin (1s interval not keeping up)

I am using telegraf with the opcua input plugin. I am reading around 10,000 tags from a Kepware server at a 1 second interval (contractually obligated for warranty reasons).

I started by reading around 100 tags and all went well. When I tried scaling up to the full 10,000 tags kepware would reject my connection. To work around this I configured multiple instances of the plugin with 500 tags each. This worked and I was able to read in all 10,000 tags. However my 1 second intervals are not keeping up. I am storing the data in Victoria Metrics and when I query for raw samples while most are 1 second apart, some are 2 seconds apart. The amount that are 2 seconds apart is growing over time. When I first started, I was averaging around 1.1 seconds per scrape, but around 24 hours later the average is 1.7 seconds per scrape. Note that the interval is always either exactly 1000ms or 2000ms, nothing else, which I find a little suspicious.

I am looking for suggestions in tracking this down. How do I know if this is telegraf struggling to keep up, Victoria metrics, or kepware struggling to respond? I’ve used Victoria metrics with much much higher write loads so I don’t think the problem is there. I am also using this telegraf internal stats dashboard to dig into telegraf:

On there I can see:

  • 0 dropped metrics
  • 719 “gather errors”
  • 10 gather errors per second
  • gather time is hovering around 1.2 seconds.
  • buffer limit is 100k
  • buffer size is hovering just under 10k.
  • metric write time is hovering under 30ms.

All of this leads me to believe that its kepware response times that are the issue, specifically the gather errors and the gather time of 1.2 seconds.

Is there anything else I can do to dig into this? I’d love for something more concrete if I need to bring it up with kepware support.

Here is my config:

[agent]
    collection_jitter = "0s"
    debug = true
    flush_interval = "10s"
    flush_jitter = "0s"
    hostname = "$HOSTNAME"
    interval = "10s"
    logfile = ""
    metric_batch_size = 10000
    metric_buffer_limit = 100000
    omit_hostname = false
    precision = ""
    quiet = false
    round_interval = true
[[processors.enum]]
    [[processors.enum.mapping]]
    dest = "status_code"
    field = "status"
    [processors.enum.mapping.value_mappings]
        critical = 3
        healthy = 1
        problem = 2
[[outputs.influxdb]]
    database = "telegraf"
    urls = [
    "http://insights-vmagent-cluster.insights-preproduction.svc.cluster.local:8429"
    ]

[[inputs.statsd]]
    allowed_pending_messages = 10000
    metric_separator = "_"
    percentile_limit = 1000
    percentiles = [
    50.0,
    95.0,
    99.0
    ]
    service_address = ":8125"

[[inputs.internal]]
    collect_memstats = true
    collect_gostats = true

And here is my plugin configuration (truncated, but repeated around 20x):

[[inputs.opcua]]
  name = "kepware_0"
  endpoint = "opc.tcp://ts2-kep-01.test.<>.com:49320"
  connect_timeout = "10s"
  request_timeout = "5s"
  session_timeout = "20m"
  interval = "1s"
  security_policy = "Basic256Sha256"
  security_mode = "SignAndEncrypt"
  auth_method = "UserName"
  username = "redacted"
  password = "redacted"
  timestamp = "server"
  client_trace = true

  nodes = [
    {name="PB_BATT_CMD_KW[1]", namespace="2", identifier_type="s", identifier="PB_BCI[1].BCI[1].PB_BATT_CMD_KW[1]"},
    {name="PB_BATT_SP_KW[1]", namespace="2", identifier_type="s", identifier="PB_BCI[1].BCI[1].PB_BATT_SP_KW[1]"},
...

Have you tried setting use_unregistered_reads to true? I believe that’s been mentioned on a couple of other threads re: Kepware.

I tried both ways and didn’t see an effect.

From what I can read online it seems like registered reads are 10x faster than unregistered reads so I would expect setting that to false to be the best way.

I am beginning to suspect that the culprit is Kepware in this case. I see the CPU pegged at 100%. It’s possible it just can’t keep up. I have a support request out to them to see if they can assist.

Is the 1s interval a hard requirement? Otherwise you could try the telegraf/plugins/inputs/opcua_listener at master · influxdata/telegraf · GitHub instead? Telegraf is able to repeat non-frequent data changes if needed, see: Replication of Non frequent data change - #6 by JeroenVH

Is the 1s interval a hard requirement?

Yes, like I said about it is contractually obligated for warranty reasons

I don’t really like replicating data automatically with a separate process, I feel there are too many ways that could end up putting incorrect entries into the DB.

I’ve been reading about “Poll” update mode subscriptions that still send all tags on a set interval, not just those that have changed:

In Poll Mode, an asynchronous read is performed on all subscription tags at the rate of the publishing interval.

I am wondering if the telegraf opcua listener plugin supports setting up these “Poll” mode subscriptions. Has anyone done this with telegraf before?

Apologies, I somehow missed the 1 s statement in the topic start. I have no experience with the polling subscription mode. Maybe you can delay each request by 10 ms compared to the previous one with collection_offset to avoid flooding the server with requests at the exact second? See: telegraf/docs/CONFIGURATION.md at master · influxdata/telegraf · GitHub

Since you keep mentioning the contractual obligation, wouldn’t it make more sense getting a paid, reliable and supported SCADA system?

That being said, OPCUA has one of 2 modes per client connection. You can either use a monitoring or a subscription connection. The first one in telegraf is just called opcua and the second one opcua_listener.

If you need be able to capture every significant change without loss of data, I’d strongly recommend using OPCUA listener with sensible deadbands (e.g., a 0.1 change might be irrelevant on a 0-1000 scale). This approach offloads the sampling load to the OPCUA server, allowing telegraf to focus on processing changes instead of constantly sampling data.

I have used both and found listener the most useful as it also logs spikes in the measured values. I sometimes get more than 10 updates of a value per second when something happens in our process. Which is great for diagnoses when a threshold is triggered for only a very short amount of time. And after the event the data rate slows down again just logging when changes occur. I set my publishing interval to the fastest the server can provide because it does not have the ability to timestamp the data in the queue.

You might run into this problem when using encryption
[opcua_listener] 1000 Nodes will get EOF[failed to start monitoring items: EOF]

Replicating the datapoints when no changes are reported is just to make the graphs look pretty and have a datapoint in every aggregated window. This is also what the major SCADA vendors do for their historians btw, they don’t want to use more client connections as they are often limited.

My solution with the watchdog is pretty solid; just create a single metric on the server side that in your case changes every second and link all metrics that use the same connection. This way you get every change as normal with the plugin as well as additional datapoints when there is no change, it stops when communication is interrupted giving you no data and a reason for an alert.

I am in the process of creating a pull request to add my replication script to the starlark examples.