[input.modbus] Multiple slaves fails after reading 4 units

Nemecsek · December 8, 2020, 9:08am

I asked first on stackoverflow at https://stackoverflow.com/questions/65143072/telegraf-1-16-inputs-modbus-plugin-timeout-problem

I am reading with Telegraf 1.16 some Janitza devices through the inputs.modbus plugin.
Telegraf is started manually and not as a service to ease tests and debugging.

This is the configuration:

Unit1 is a UMG604 that acts as a Gateway: it receives Modbus/TCP messages, and if they don’t match its modbus address number, relays them to the following units. These are linked through a RS485 line. That means the communication is half-duplex and the line is quite busy because we are trying to read 350+ registers at any tick (50 registers per device).

These units are read without any problem using two loggers I wrote, one in C, the other in Python/pymodbus, so I can exclude any hardware issue. Both these loggers read serially the units, one after the other. Go concurrency could be an issue.

Settings are straightforward, and here is a skeleton of Telegraf configuration file:


[agent]
  interval="5s"   # sample time
  round_interval=true  # sample at rounded intervals :00, :05, :10, etc
  metric_batch_size=1000  
  metric_buffer_limit=10000  


[[inputs.modbus]]
  name = "UMG604_Gateway_unit1"
  slave_id = 1
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit1", name="Strom-1", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit2"
  slave_id = 2
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit2", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit3"
  slave_id = 3
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit3", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit4"
  slave_id = 4
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit4", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit5"
  slave_id = 5
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit5", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit6"
  slave_id = 6
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit6", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit7"
  slave_id = 7
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit7", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[inputs.modbus]]
  name = "UMG103_unit8"
  slave_id = 8
  timeout = "2s"
  busy_retries = 10
  busy_retries_wait = "200ms"
  controller = "tcp://192.168.2.100:502"

  holding_registers = [
    { measurement="Unit8", name="", byte_order="ABCD", data_type="FLOAT32-IEEE", scale=1.0, address=[1325,1326]},
    # <another 24 non consecutive registers here>
  ]

[[outputs.influxdb_v2]]
    urls = ["http://localhost:8086"]
    token = "XXXXXXX"
    organization = "demo_org"
    bucket = "demo_bucket"

The problem

The first units in the config file are read quite regularly, but units 5…8 manifest almost always a timeout:
read tcp 192.168.2.XX:XXXX->192.168.2.10 0:502: i/o timeout

There are not so many parameters to tweak (timeout, busy_retry and busy_retry_wait has been increased), so I don’t know if what I experience is a wrong setting or a problem in the modbus plugin.

I thought the culprit being UMG604 that accepts only 4 modbus connections.
As a test I launched 3 Telegraf services at the same time, so ideally I was trying to read 24 devices (8 read 3 times) at the same time: I didn’t see the dramatic increase of timeouts that I expected (for each Telegraf instance, the first 4 units were always read, the latter 4 no) so I would exclude any TCP stack problem in UMG604.

Second test: I added a delay parameter before each connection and reading, thinking that there is a kind of overload on the RS485 line. No changes.

Stripping down the module.go code (it is the first time I play with go code, so my knowledge is quite limited), I see that in the faulty units there an is error without ExceptionCode after getFields (ok=false).

This means Gather() in modbus.go plugin just exits without even retrying to read again:

# after getFields() err is not nil
if err != nil {
    mberr, ok := err.(*mb.ModbusError)  # <-- ok is false!

    # only 1 type of error is managed here and the read tried again; in any other case the attempts are stopped and there is not retry
    if ok && mberr.ExceptionCode == mb.ExceptionCodeServerDeviceBusy && retry < m.Retries {
        ...
        time.Sleep(m.RetriesWaitTime.Duration)
 	    continue  
    }

    # ok is false, so we jump here!
    disconnect(m)
    m.isConnected = false
    return err
}

For testing purposes, I removed the check on that specific ExceptionCode, requesting a repeat any time err != nil. No changes at all: always error with unknown ExceptionCode.

As a last attempt I tried to close and reopen the connection before following repeats: no change. After the first error all further readings are unsuccessful.

Any idea I could try?

(As a workaround I wrote a minimal input.exec that reads and print out a JSON that is fed to Telegraf, but if possible I would like to use a standard solution based only on the input.modbus plugin.)

Anaisdg · December 11, 2020, 9:15pm

Hello @Nemecsek,
I’m not sure. Im forwarding your question to the Telegraf team. Thanks in advance for your patience.

Samy · January 4, 2021, 2:24pm

Is it possible to run multiple instances of Telegraf , all staggered by different sampling intervals - each instance acquiring the data for a single slave and reporting. This way , the concurrency issue can be addressed.

I am thinking of the a similar setup for multiple RS485 slaves - 32 nos. With 32 instances running.
Sampling interval - 40 seconds
reporting interval - 40 seconds
start 32 instances of Telegraf with a separate config file for each slave … each spaced out by 1 second

Nemecsek · January 5, 2021, 10:58am

@Samy, it is a possibility, but quite clumsy.
I wrote my own Python modbus logger that saves directly to influxdb to avoid the issue.

It would be better there was the possibility to add a delay parameter in the standard configuration to avoid concurrency.
Thx for your answer.

Samy · January 5, 2021, 11:56am

Thanks for the input. One other option that we tried was enabling multiple instances of modbus plugin in a single config file & reusing the existing plugin as much.

Increase the sampling frequency and have only one instance of Telegraf sample one of the slaves per sampling interval

config file changes.
[[inputs.modbus]]
##Introduce one more configuration parameter :
##Highest Slave ID in the link - required to sample one slave at a time. THIS SHOULD BE THE SAME ACROSS ALL INSTANCES … eg slave ID 14 is the highest no for my appliacation
##Range: 1 - 247 [0 = broadcast; 248 - 255 = reserved]

highest_slave_id_in_link= 14**

changes to modbus.go plugin.
// Modbus holds all data relevant to the plugin
type Modbus struct {
… add the following two variables…
slotCounter uint32
HighestSlaveIDinLink uint32 toml:"highest_slave_id_in_link"
}

func (m *Modbus) Gather(acc telegraf.Accumulator) error {
m.slotCounter++
current_slot := (m.slotCounter % m.HighestSlaveIDinLink) + 1
if current_slot != uint32(m.SlaveID) {
return nil
}
… original code as-is

// Add this plugin to telegraf. —> Initialized slotCounter to 0 in init function
func init() {
inputs.Add(“modbus”, func() telegraf.Input { return &Modbus{slotCounter: 0} })
}

Topic		Replies	Views
Telegraf requesting from multiple modbus slaves telegraf	5	2891	January 20, 2021
Error in modbus-plugin: "serial: timeout" Telegraf	15	2455	December 8, 2022
Modbus TCP timeout Telegraf plugin , modbus	5	1137	December 8, 2022
[inputs.modbus] Error in plugin: slave 1: EOF Telegraf	3	111	August 30, 2024
Modbus RTU requests read more slaves Telegraf modbus	4	1342	September 14, 2022

[input.modbus] Multiple slaves fails after reading 4 units

Related topics