[input.modbus] Multiple slaves fails after reading 4 units

I asked first on stackoverflow at https://stackoverflow.com/questions/65143072/telegraf-1-16-inputs-modbus-plugin-timeout-problem

I am reading with Telegraf 1.16 some Janitza devices through the inputs.modbus plugin.
Telegraf is started manually and not as a service to ease tests and debugging.

This is the configuration:
system image

Unit1 is a UMG604 that acts as a Gateway: it receives Modbus/TCP messages, and if they don’t match its modbus address number, relays them to the following units. These are linked through a RS485 line. That means the communication is half-duplex and the line is quite busy because we are trying to read 350+ registers at any tick (50 registers per device).

These units are read without any problem using two loggers I wrote, one in C, the other in Python/pymodbus, so I can exclude any hardware issue. Both these loggers read serially the units, one after the other. Go concurrency could be an issue.

Settings are straightforward, and here is a skeleton of Telegraf configuration file:


[agent]
  interval="5s"   # sample time
  round_interval=true  # sample at rounded intervals :00, :05, :10, etc
  metric_batch_size=1000  
  metric_buffer_limit=10000  


[[inputs.modbus]]
  name = "UMG604_Gateway_unit1"
  slave_id = 1
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit1", name="Strom-1", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit2"
  slave_id = 2
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit2", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit3"
  slave_id = 3
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit3", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit4"
  slave_id = 4
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit4", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit5"
  slave_id = 5
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit5", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit6"
  slave_id = 6
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit6", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit7"
  slave_id = 7
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit7", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit8"
  slave_id = 8
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit8", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[outputs.influxdb_v2]]
    urls = ["http://localhost:8086"]
    token = "XXXXXXX"
    organization = "demo_org"
    bucket = "demo_bucket"

The problem

The first units in the config file are read quite regularly, but units 5…8 manifest almost always a timeout:
read tcp 192.168.2.XX:XXXX->192.168.2.10 0:502: i/o timeout

There are not so many parameters to tweak (timeout, busy_retry and busy_retry_wait has been increased), so I don’t know if what I experience is a wrong setting or a problem in the modbus plugin.

I thought the culprit being UMG604 that accepts only 4 modbus connections.
As a test I launched 3 Telegraf services at the same time, so ideally I was trying to read 24 devices (8 read 3 times) at the same time: I didn’t see the dramatic increase of timeouts that I expected (for each Telegraf instance, the first 4 units were always read, the latter 4 no) so I would exclude any TCP stack problem in UMG604.

Second test: I added a delay parameter before each connection and reading, thinking that there is a kind of overload on the RS485 line. No changes.

Stripping down the module.go code (it is the first time I play with go code, so my knowledge is quite limited), I see that in the faulty units there an is error without ExceptionCode after getFields (ok=false).

This means Gather() in modbus.go plugin just exits without even retrying to read again:

# after getFields() err is not nil
if err != nil {
    mberr, ok := err.(*mb.ModbusError)  # <-- ok is false!

    # only 1 type of error is managed here and the read tried again; in any other case the attempts are stopped and there is not retry
    if ok && mberr.ExceptionCode == mb.ExceptionCodeServerDeviceBusy && retry < m.Retries {
        ...
        time.Sleep(m.RetriesWaitTime.Duration)
 	    continue  
    }

    # ok is false, so we jump here!
    disconnect(m)
    m.isConnected = false
    return err
}

For testing purposes, I removed the check on that specific ExceptionCode, requesting a repeat any time err != nil. No changes at all: always error with unknown ExceptionCode.

As a last attempt I tried to close and reopen the connection before following repeats: no change. After the first error all further readings are unsuccessful.

Any idea I could try?

(As a workaround I wrote a minimal input.exec that reads and print out a JSON that is fed to Telegraf, but if possible I would like to use a standard solution based only on the input.modbus plugin.)

Hello @Nemecsek,
I’m not sure. Im forwarding your question to the Telegraf team. Thanks in advance for your patience.

Is it possible to run multiple instances of Telegraf , all staggered by different sampling intervals - each instance acquiring the data for a single slave and reporting. This way , the concurrency issue can be addressed.

I am thinking of the a similar setup for multiple RS485 slaves - 32 nos. With 32 instances running.
Sampling interval - 40 seconds
reporting interval - 40 seconds
start 32 instances of Telegraf with a separate config file for each slave … each spaced out by 1 second

@Samy, it is a possibility, but quite clumsy.
I wrote my own Python modbus logger that saves directly to influxdb to avoid the issue.

It would be better there was the possibility to add a delay parameter in the standard configuration to avoid concurrency.
Thx for your answer.

Thanks for the input. One other option that we tried was enabling multiple instances of modbus plugin in a single config file & reusing the existing plugin as much.

Increase the sampling frequency and have only one instance of Telegraf sample one of the slaves per sampling interval

config file changes.
[[inputs.modbus]]
##Introduce one more configuration parameter :
##Highest Slave ID in the link - required to sample one slave at a time. THIS SHOULD BE THE SAME ACROSS ALL INSTANCES … eg slave ID 14 is the highest no for my appliacation
##Range: 1 - 247 [0 = broadcast; 248 - 255 = reserved]

highest_slave_id_in_link= 14**

changes to modbus.go plugin.
// Modbus holds all data relevant to the plugin
type Modbus struct {
… add the following two variables…
slotCounter uint32
HighestSlaveIDinLink uint32 toml:"highest_slave_id_in_link"
}

func (m *Modbus) Gather(acc telegraf.Accumulator) error {
m.slotCounter++
current_slot := (m.slotCounter % m.HighestSlaveIDinLink) + 1
if current_slot != uint32(m.SlaveID) {
return nil
}
… original code as-is

// Add this plugin to telegraf. —> Initialized slotCounter to 0 in init function
func init() {
inputs.Add(“modbus”, func() telegraf.Input { return &Modbus{slotCounter: 0} })
}