Appears there is a Kapacitor memory leak with child process UDF

Hello. I manage an InfluxDB which is currently handling about 3MB per second / 25000 points per second. I have been experimenting with implementing a filters and re-formatting for some of the incoming metrics by inserting Kapacitor between our hundreds of Telegraf instances and and our InfluxDb instance.

Currently we have Telegraf posting metrics to InfluxDb, and Kapacitor subscribing over UDP to get some metrics for alerts.

But I was experimenting with having the hundreds of Telegraf instances hit the /write endpoint on Kapacitor first, then define a task which can filter / modify the data before sending to Influx.

I started out with the following task because I needed to be able to filter on the measurement name:

stream |from() .database('aws') @createTagFromMeasurementName() |delete() .tag('measurementName') |influxDBOut() .database('aws') .retentionPolicy('autogen')

the source code for the UDF is here: createTagFromMeasurementName UDF · GitHub

It worked, but kapacitor started allocating about 100MB of RAM per second and not releasing it, eventually resulting in a crash. This was the version 1.2.1 docker image of kapacitor with the following patch applied to kapacitord: Comparing influxdata:master...forestjohnsonpeoplenet:patch-inputvalidation · influxdata/kapacitor · GitHub

Also, during the time it was allocating all that ram, the HTTP api would respond with 404 for an unknown url, but any real methods would hang forever, such as kapacitor list tasks or kapacitor show xyz

During this time I looked at top and the kapacitord process was the one mongling memory, not the UDF process.

I tried removing the measurementName UDF from the script, yielding:

stream |from() .database('aws') |influxDBOut() .database('aws') .retentionPolicy('autogen')

That script seemed to be able to run just fine, allocating about 10 MB of RAM per second and topping out at about 400MB. Also the HTTP api remained perfectly available while running this task.

I don’t know if this matters but we are running kapacitor in Rancher as a docker container along with a couple configuration containers. The same host is running HAProxy and InfluxDb. Its a 16 core host with 64GB of RAM (m4.4xlarge in AWS) .

Thanks for the detailed write up. Would it be possible to grab a memprofile of the process when it is over allocating?

If you start the server with a -memprofile path/to/file args it will write out the profile.

You can also use the /kapacitor/v1/debug/pprof endpoints but those had a bug prior to version 1.3 which may prevent them form working.

1 Like