We are batch uploading a portion of our data from another system into Influx on a weekly basis. The volume is not particularly large, but we seem to be causing some trouble for one of our Kapacitor tasks and Kapacitor itself.
The Kapacitor task in question joins two streams, calculates a new third stream, and writes it back to Influx. The task is reliable and robust when it’s processing incoming data from our continuous measurement system. New values come in multiple times a minute, and it happily calculates and writes the derived values back to Influx.
However, when we run the batch upload we end up with multiple gaps in the third stream of data. And Kapacitor usually crashes one or more times. We’re able to reliably fill in the data gaps afterwards with a Kapacitor replay-live job that uses the very same Kapacitor task that seems to be failing during the upload.
Ultimately, we will be replacing the batch upload by expanding our continuous measurement system, and this will cease to be a problem. But, in the meantime, we’d sure like to avoid the gaps and rework.
- How might we best go about troubleshooting / debugging this problem?
- Given the behavior it seems almost certain that we’re simply overloading Kapacitor during these upload jobs. The task works fine in regular use and for replay-live jobs. We can slow the rate of items within the upload job being written to Influx. Beyond just trial-and-error is there a known benchmark or threshold to be mindful of?