I asked first on stackoverflow at https://stackoverflow.com/questions/65143072/telegraf-1-16-inputs-modbus-plugin-timeout-problem
I am reading with Telegraf 1.16 some Janitza devices through the inputs.modbus
plugin.
Telegraf is started manually and not as a service to ease tests and debugging.
Unit1 is a UMG604 that acts as a Gateway: it receives Modbus/TCP messages, and if they don’t match its modbus address number, relays them to the following units. These are linked through a RS485 line. That means the communication is half-duplex and the line is quite busy because we are trying to read 350+ registers at any tick (50 registers per device).
These units are read without any problem using two loggers I wrote, one in C, the other in Python/pymodbus, so I can exclude any hardware issue. Both these loggers read serially the units, one after the other. Go concurrency could be an issue.
Settings are straightforward, and here is a skeleton of Telegraf configuration file:
[agent]
interval="5s" # sample time
round_interval=true # sample at rounded intervals :00, :05, :10, etc
metric_batch_size=1000
metric_buffer_limit=10000
[[inputs.modbus]]
name = "UMG604_Gateway_unit1"
slave_id = 1
timeout = "2s"
busy_retries = 10
busy_retries_wait = "200ms"
controller = "tcp://192.168.2.100:502"
holding_registers = [
{ measurement="Unit1", name="Strom-1", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
# <another 24 non consecutive registers here>
]
[[inputs.modbus]]
name = "UMG103_unit2"
slave_id = 2
timeout = "2s"
busy_retries = 10
busy_retries_wait = "200ms"
controller = "tcp://192.168.2.100:502"
holding_registers = [
{ measurement="Unit2", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
# <another 24 non consecutive registers here>
]
[[inputs.modbus]]
name = "UMG103_unit3"
slave_id = 3
timeout = "2s"
busy_retries = 10
busy_retries_wait = "200ms"
controller = "tcp://192.168.2.100:502"
holding_registers = [
{ measurement="Unit3", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
# <another 24 non consecutive registers here>
]
[[inputs.modbus]]
name = "UMG103_unit4"
slave_id = 4
timeout = "2s"
busy_retries = 10
busy_retries_wait = "200ms"
controller = "tcp://192.168.2.100:502"
holding_registers = [
{ measurement="Unit4", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
# <another 24 non consecutive registers here>
]
[[inputs.modbus]]
name = "UMG103_unit5"
slave_id = 5
timeout = "2s"
busy_retries = 10
busy_retries_wait = "200ms"
controller = "tcp://192.168.2.100:502"
holding_registers = [
{ measurement="Unit5", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
# <another 24 non consecutive registers here>
]
[[inputs.modbus]]
name = "UMG103_unit6"
slave_id = 6
timeout = "2s"
busy_retries = 10
busy_retries_wait = "200ms"
controller = "tcp://192.168.2.100:502"
holding_registers = [
{ measurement="Unit6", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
# <another 24 non consecutive registers here>
]
[[inputs.modbus]]
name = "UMG103_unit7"
slave_id = 7
timeout = "2s"
busy_retries = 10
busy_retries_wait = "200ms"
controller = "tcp://192.168.2.100:502"
holding_registers = [
{ measurement="Unit7", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
# <another 24 non consecutive registers here>
]
[[inputs.modbus]]
name = "UMG103_unit8"
slave_id = 8
timeout = "2s"
busy_retries = 10
busy_retries_wait = "200ms"
controller = "tcp://192.168.2.100:502"
holding_registers = [
{ measurement="Unit8", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
# <another 24 non consecutive registers here>
]
[[outputs.influxdb_v2]]
urls = ["http://localhost:8086"]
token = "XXXXXXX"
organization = "demo_org"
bucket = "demo_bucket"
The problem
The first units in the config file are read quite regularly, but units 5…8 manifest almost always a timeout:
read tcp 192.168.2.XX:XXXX->192.168.2.10 0:502: i/o timeout
There are not so many parameters to tweak (timeout
, busy_retry
and busy_retry_wait
has been increased), so I don’t know if what I experience is a wrong setting or a problem in the modbus plugin.
I thought the culprit being UMG604 that accepts only 4 modbus connections.
As a test I launched 3 Telegraf services at the same time, so ideally I was trying to read 24 devices (8 read 3 times) at the same time: I didn’t see the dramatic increase of timeouts that I expected (for each Telegraf instance, the first 4 units were always read, the latter 4 no) so I would exclude any TCP stack problem in UMG604.
Second test: I added a delay parameter before each connection and reading, thinking that there is a kind of overload on the RS485 line. No changes.
Stripping down the module.go
code (it is the first time I play with go code, so my knowledge is quite limited), I see that in the faulty units there an is error without ExceptionCode
after getFields
(ok=false
).
This means Gather()
in modbus.go
plugin just exits without even retrying to read again:
# after getFields() err is not nil
if err != nil {
mberr, ok := err.(*mb.ModbusError) # <-- ok is false!
# only 1 type of error is managed here and the read tried again; in any other case the attempts are stopped and there is not retry
if ok && mberr.ExceptionCode == mb.ExceptionCodeServerDeviceBusy && retry < m.Retries {
...
time.Sleep(m.RetriesWaitTime.Duration)
continue
}
# ok is false, so we jump here!
disconnect(m)
m.isConnected = false
return err
}
For testing purposes, I removed the check on that specific ExceptionCode, requesting a repeat any time err != nil
. No changes at all: always error with unknown ExceptionCode.
As a last attempt I tried to close and reopen the connection before following repeats: no change. After the first error all further readings are unsuccessful.
Any idea I could try?
(As a workaround I wrote a minimal input.exec
that reads and print out a JSON that is fed to Telegraf, but if possible I would like to use a standard solution based only on the input.modbus
plugin.)