System goes haywire when some amount of telegraf errors occur

Hello,

Firstly I will explain my setup-up: I have an Intel NUC with Ubuntu 22.04 that has telegraf, influxdb and Node-red installed. The connection to the different hardware goes by Wifi to a Modbus RTU-TCP converter “USR-W610”. Then I have connected over Modbus, 3 energy meters “Circutor MK LCD”, 2 “STEPS Data Loggers”, 1 “Seneca Z-key-mbus” (transforms mbus to modbus) that reads 2 “Sharky 775” energy meters, 1 Seneca Z-D-Out and a Heat Pump.

I’m having an issue with Telegraf in where the data is being read correctly but time to time the “Error in plugin: slave 10: modbus: response data size ‘11’ does not match count '18” will pop up witch doesn’t affect the data collection or at least not in an important way, the real issue comes after a random period of time that could be of hours or a day when somehow the system goes haywire and telegraf only receives Error in plugin: read tcp i/o timeout, disconnecting all the equipment’s 1 by 1 I found out that the haywire hardware was 1 of the Circutor energy meters but I replaced it and the issue happens again with the replacement of the energy meter or one of the remaining 2.
I really don’t know if the cause of the error is something not program correctly in the telegraf.conf, or even if the error has any relation with the energy meter going haywire. Below, I will add my telegraf.conf maybe someone might be able to help or give any idea.
Thanks a lot beforehand. And sorry, this is my first project using this softwares

interval = “10s”
round_interval = true
metric_batch_size = 10000
metric_buffer_limit = 10000
collection_jitter = “0s”
flush_interval = “11s”
flush_jitter = “1s”
precision = “0s”
hostname = “”
omit_hostname = false
urls = [“http://localhost:8086”]
token = “OwLI7aJ5DB08lbDZdweA4Z0qhEBMr-HFH-9-6YpA8UstkrdPJV55K1q1guidWp8a8TZab-_Xu4dIXzO55232tw==”
organization = “test”
bucket = “test”

[[inputs.modbus]]
name = “DLs”
slave_id = 1
timeout = “2s”
controller = “tcp://10.10.100.254:8899”
configuration_type = “request”

[[inputs.modbus.request]]
slave_id = 9
byte_order = “ABCD”
register = “input”
fields = [
{ address=3, name=“T-10”, type=“INT16”, scale=0.1 },
{ address=4, name=“T-11”, type=“INT16”, scale=0.1 },
{ address=6, name=“Pd-01”, type=“INT16”, scale=0.1 },
{ address=12, name=“Int_Hum”, type=“INT16”, scale=0.1 },
{ address=14, name=“Int_Temp”, type=“INT16”, scale=0.1 },
]
[[inputs.modbus.request]]
slave_id = 9
byte_order = “ABCD”
register = “holding”
fields = [
{ address=164, name=“Exterior_lights”, type=“INT16”, scale=1.0 },
{ address=168, name=“PM02”, type=“INT16”, scale=1.0 },
]
[[inputs.modbus.request]]
slave_id = 5
byte_order = “ABCD”
register = “coil”
fields = [
{ address=0, name=“FanCoilHighSpeed”, type=“INT16”, scale=1.0 },
]
[[inputs.modbus.request]]
slave_id = 10
byte_order = “ABCD”
register = “input”

fields = [
  { address=3, name="T-01",      type="INT16",   scale=0.1   },
  { address=4, name="T-02",      type="INT16",   scale=0.1   },
  { address=5, name="T-03",      type="INT16",   scale=0.1   },
  { address=6, name="T-04",      type="INT16",   scale=0.1   },
  { address=7, name="T-05",      type="INT16",   scale=0.1   },
  { address=8, name="T-06",      type="INT16",   scale=0.1   },
  { address=9, name="T-07",      type="INT16",   scale=0.1   },
  { address=10, name="T-08",      type="INT16",   scale=0.1   },
  { address=11, name="T-09",      type="INT16",   scale=0.1   },
]  

[[inputs.modbus.request]]
slave_id = 10
byte_order = “ABCD”
register = “holding”
fields = [
{ address=164, name=“EV01”, type=“INT16”, scale=1.0 },
{ address=167, name=“EV02”, type=“INT16”, scale=1.0 },
]

[[inputs.modbus.request]]
slave_id = 20
byte_order = “ABCD”
register = “holding”
fields = [
{ address=0, name=“Grid_Voltage”, type=“INT32”, scale=0.1 },
{ address=2, name=“Grid_Current”, type=“INT32”, scale=0.001 },
{ address=4, name=“Grid_Energy”, type=“INT32”, scale=1.0 },
{ address=6, name=“Grid_Frequency”, type=“INT32”, scale=0.1 },
]
[[inputs.modbus.request]]
slave_id = 2
byte_order = “ABCD”
register = “holding”
fields = [
{ address=0, name=“PV_Voltage”, type=“INT32”, scale=0.1 },
{ address=2, name=“PV_Current”, type=“INT32”, scale=0.001 },
{ address=4, name=“PV_Energy”, type=“INT32”, scale=1.0 },
{ address=6, name=“PV_Frequency”, type=“INT32”, scale=0.1 },
]
[[inputs.modbus.request]]
slave_id = 64
byte_order = “ABCD”
register = “holding”
optimization = “shrink”
fields = [
{ address=0, name=“Load_Voltage”, type=“INT32”, scale=0.1 },
{ address=2, name=“Load_Current”, type=“INT32”, scale=0.001 },
{ address=4, name=“Load_Energy”, type=“INT32”, scale=1.0 },
{ address=6, name=“Load_Frequency”, type=“INT32”, scale=0.1 },
]
[[inputs.modbus.request]]
slave_id = 11
byte_order = “ABCD”
register = “holding”
optimization = “shrink”
fields = [
{ address=200, name=“Pump_mode”, type=“INT16” },
{ address=444, name=“Water_flow”, type=“INT16”, scale=0.1},
{ address=1001, name=“Set_point_1_cooling”, type=“INT16”, scale=0.1 },
{ address=1002, name=“Set_point_1_heating”, type=“INT16”, scale=0.1 },
{ address=1004, name=“Set_point_2_cooling”, type=“INT16”, scale=0.1 },
{ address=1005, name=“Set_point_2_heating”, type=“INT16”, scale=0.1 },
{ address=253, name=“Evaporation_temperature”, type=“INT16” },
{ address=254, name=“Condesation_temperature”, type=“INT16” },
{ address=305, name=“Compressor_working_hours”, type=“INT16” },
{ address=406, name=“High_pressure”, type=“INT16” },
{ address=414, name=“Low_pressure”, type=“INT16” },
{ address=400, name=“Water_inlet_temp”, type=“INT16”, scale=0.1 },
{ address=401, name=“Water_outlet_temp”, type=“INT16”, scale=0.1 },
{ address=428, name=“HP_External_temp”, type=“INT16”, scale=0.1 },
{ address=949, name=“Alarms_1”, type=“INT16” },
{ address=951, name=“Alarms_2”, type=“INT16” },
{ address=952, name=“Alarms_3”, type=“INT16” },
{ address=953, name=“Alarms_4”, type=“INT16” },
{ address=954, name=“Alarms_5”, type=“INT16” },
{ address=955, name=“Alarms_6”, type=“INT16” },
{ address=956, name=“Alarms_7”, type=“INT16” },
]
[[inputs.modbus.request]]
slave_id = 1
byte_order = “ABCD”
register = “holding”
optimization = “shrink”
fields = [
{ address=0, name=“HP_Meter_Energy”, type=“INT16”, scale=1.0 },
{ address=2, name=“HP_Meter_Volume”, type=“INT16”, scale=0.001 },
{ address=6, name=“HP_Meter_Error”, type=“INT16”, scale=1.0 },
{ address=7, name=“HP_Meter_Volume_Flow”, type=“INT16”, scale=0.001 },
{ address=8, name=“HP_Meter_Power”, type=“INT16”, scale=0.001 },
{ address=9, name=“HP_Meter_Flow_Temperature”, type=“INT16”, scale=0.1 },
{ address=10, name=“HP_Meter_Return_Temperature”, type=“INT16”, scale=0.1 },
{ address=11, name=“HP_Meter_Temperature_Difference”, type=“INT16”, scale=0.1 },
{ address=12, name=“FC_Meter_Energy”, type=“INT16”, scale=1.0 },
{ address=14, name=“FC_Meter_Volume”, type=“INT16”, scale=0.001 },
{ address=18, name=“FC_Meter_Error”, type=“INT16”, scale=1.0 },
{ address=19, name=“FC_Meter_Volume_Flow”, type=“INT16”, scale=0.001 },
{ address=20, name=“FC_Meter_Power”, type=“INT16”, scale=0.001 },
{ address=21, name=“FC_Meter_Flow_Temperature”, type=“INT16”, scale=0.1 },
{ address=22, name=“FC_Meter_Return_Temperature”, type=“INT16”, scale=0.1 },
{ address=23, name=“FC_Meter_Temperature_Difference”, type=“INT16”, scale=0.1 },
]
[inputs.modbus.workarounds]
pause_between_requests = “50ms”
pause_after_connect = “50ms”
close_connection_after_gather = true

Using an USR-DR302 myself I found that the “read i/o timeouts” come from sending requests to quickly. You can try to play with the pause_between_requests option to wait a little before sending the next request. For my system I’ve set

  [inputs.modbus.workarounds]
    ## Pause after connect delays the first request by the specified time.
    ## This might be necessary for (slow) devices.
    # pause_after_connect = "0ms"

    ## Pause between read requests sent to the device.
    ## This might be necessary for (slow) serial devices.
    pause_between_requests = "100ms"

But shorter times might also work. Furthermore, I suggest to optimize the requests as sending one request and throw away some registers is usually cheaper than multiple requests. In your config every skipped address means a new request (I counted 5 requests in your config).
So setting

  [[inputs.modbus.request]]
    ...
    optimization = "max_insert"
    optimization_max_register_fill = 10
    ...

Would reduce it to 1 request…

Thanks for the reply.
I’m out of town for 2 weeks but will surely try this up the moment I get back home and give some feedback about it.
Thanks once again

1 Like

Srebhan,
I tried your suggestions, unfortunately they dint work, changing the pause_between_request dint do anything to the system at all, time to time there was the error " [inputs.modbus] Error in plugin: slave 10: modbus: response data size ‘11’ does not match count ‘18’ " or different number and eventually leaded to the Circutors energy meters going haywire once again and all the request became timeouts. I then tried changing the optimization method but it actually increased exponentially the amount of errors of data size not matching so the system went haywire even faster. I added this optimization method to all request made first and then just in case I dint understood what you meant I tried adding it it to the different request one by one and still the issue was the same. Hopefully you might have some other idea of what can I do to fix this. Thanks

Can you please post a link to the datasheet/manual of the meter? Does it say something about the maximum request size or anything?

1 Like

Hello srebhan,
This is the link to the meters manual, https://docs.circutor.com/docs/M98175201-03.pdf. There is no information about a maximum request size for this meters, but maybe you find some info there that could solve this. Thanks again for helping me with this

Hmmm there is no useful info saying that the meter is requiring special handling. However, your meter is not happy with what we send (even though our messages conform to the spec). Try playing around with the workaround parameters or ask the manufacturer on the issue.

The

[inputs.modbus] Error in plugin: slave 10: modbus: response data size ‘11’ does not match count ‘18’

error indicates that we expect a responds of 18 bytes from the register but it only sends 11. To me this looks like a full-buffer or anything in the meter. Maybe close_connection_after_gather = true helps!?

Already tried with close_connection_after_gather = true, unfortunately no difference in the result.

Did you try to increase the interval for this device? I think without getting a clue on what is wrong with your device I don’t see any way forward. Do you think you can contact the manufacturer of the device and ask what is wrong?!? I would love to add a workaround for your device if we know what to do…

I tried changing the parameter of pause_between_request into a lot of values like 50 ms,100ms,250 ms, 500ms, 50 us,100 us, 250 us, even modified the telegraf gather interval to 20 seconds and added the pause between request in 1s. From all of this testing I still got the error but with the interval of (us) it takes longer to crash the system. I will try to reach out to the manufacturer and see if they can provide some type of assistance. Never the less your comments about the request size and full-buffer gave me an experiment idea to see if I can at least by pass the issue if I cant solve it, so today I implemented a bash script that will every 3 hours close telegraf and open it again to see if by any means this could (and forgive if in my concept I’m assuming stuff that just makes no sense) clear some buffer or cashe that telegraf could be creating that would exceed the amount the meter could handle.
In any case will keep you posted on this.
Thanks

If you find a way to avoid the issue, please let me know! However, I fear that this is not in Telegraf but rather in the device triggered by Telegraf. But anyhow, if you find a way, we will probably find a way to workaround the issue. :wink:

Hello,
I have sent an email to the manufacturer of the meters, so far no response. Doing the test I mentioned before, the system run about 5 days without crashing with eventual errors but no big deals until it eventually crashed with the same type of error. Later I will try to set up the telegraf restart to every 1 hour instead of 3 and see if this gives another improvement.

1 Like

Hi again,
So I don’t know if this could help you figure a work around but yesterday when I electrically rebooted the system and started it again to try out this new interval of turning on and off telegraf, the first error I received the was " [inputs.modbus] Error in plugin: slave 10: modbus: response data size ‘1’ does not match count ‘1536’ ", and then afterwards, same error but but saying " [inputs.modbus] Error in plugin: slave 1: modbus: response data size ‘1536’ does not match count ‘1537’ "; so the errors dint start from low values and this time the system dint hold up 6 hours working. So I think this definitely has something to do with the buffer of the meter that when I restarted the system it dint completely clear it. I don’t really know what else to do.

@Javier9269 can it be that you do have multiple clients connecting to the device (e.g. multiple Telegraf instances or multiple [[inputs.modbus]] sections or another program reading the data parallel to Telegraf)? Then this could lead to collisions and your device might not handle them correctly!?

No, I only make 1 telegraf instance and like you can see from the conf file from the beguining only 1 inputs.modbus with some amount of request to the different devices