I am using telegraf with the opcua input plugin. I am reading around 10,000 tags from a Kepware server at a 1 second interval (contractually obligated for warranty reasons).
I started by reading around 100 tags and all went well. When I tried scaling up to the full 10,000 tags kepware would reject my connection. To work around this I configured multiple instances of the plugin with 500 tags each. This worked and I was able to read in all 10,000 tags. However my 1 second intervals are not keeping up. I am storing the data in Victoria Metrics and when I query for raw samples while most are 1 second apart, some are 2 seconds apart. The amount that are 2 seconds apart is growing over time. When I first started, I was averaging around 1.1 seconds per scrape, but around 24 hours later the average is 1.7 seconds per scrape. Note that the interval is always either exactly 1000ms or 2000ms, nothing else, which I find a little suspicious.
I am looking for suggestions in tracking this down. How do I know if this is telegraf struggling to keep up, Victoria metrics, or kepware struggling to respond? I’ve used Victoria metrics with much much higher write loads so I don’t think the problem is there. I am also using this telegraf internal stats dashboard to dig into telegraf:
On there I can see:
- 0 dropped metrics
- 719 “gather errors”
- 10 gather errors per second
- gather time is hovering around 1.2 seconds.
- buffer limit is 100k
- buffer size is hovering just under 10k.
- metric write time is hovering under 30ms.
All of this leads me to believe that its kepware response times that are the issue, specifically the gather errors and the gather time of 1.2 seconds.
Is there anything else I can do to dig into this? I’d love for something more concrete if I need to bring it up with kepware support.
Here is my config:
[agent]
collection_jitter = "0s"
debug = true
flush_interval = "10s"
flush_jitter = "0s"
hostname = "$HOSTNAME"
interval = "10s"
logfile = ""
metric_batch_size = 10000
metric_buffer_limit = 100000
omit_hostname = false
precision = ""
quiet = false
round_interval = true
[[processors.enum]]
[[processors.enum.mapping]]
dest = "status_code"
field = "status"
[processors.enum.mapping.value_mappings]
critical = 3
healthy = 1
problem = 2
[[outputs.influxdb]]
database = "telegraf"
urls = [
"http://insights-vmagent-cluster.insights-preproduction.svc.cluster.local:8429"
]
[[inputs.statsd]]
allowed_pending_messages = 10000
metric_separator = "_"
percentile_limit = 1000
percentiles = [
50.0,
95.0,
99.0
]
service_address = ":8125"
[[inputs.internal]]
collect_memstats = true
collect_gostats = true
And here is my plugin configuration (truncated, but repeated around 20x):
[[inputs.opcua]]
name = "kepware_0"
endpoint = "opc.tcp://ts2-kep-01.test.<>.com:49320"
connect_timeout = "10s"
request_timeout = "5s"
session_timeout = "20m"
interval = "1s"
security_policy = "Basic256Sha256"
security_mode = "SignAndEncrypt"
auth_method = "UserName"
username = "redacted"
password = "redacted"
timestamp = "server"
client_trace = true
nodes = [
{name="PB_BATT_CMD_KW[1]", namespace="2", identifier_type="s", identifier="PB_BCI[1].BCI[1].PB_BATT_CMD_KW[1]"},
{name="PB_BATT_SP_KW[1]", namespace="2", identifier_type="s", identifier="PB_BCI[1].BCI[1].PB_BATT_SP_KW[1]"},
...