InfluxDB Python Query Memory Leak?

Hello,

I’m relatively new to InfluxDB and I’m running into some issues querying my database from python. Some background on my software versions/OS:

influxdb v. 1.6.3
python2.7
python-influxdb v. 5.2.0
Ubuntu 18.04

The measurement I’m working with has ~300M points of robotics data (IMU, GPS, commands etc., ~100 field values in each point). The data is sparse over time in that we only get a new log maybe once a week, but it’s very dense in a log. There’s around 3000 series in the data and using the default shard duration, I end up with about 70 shards each containing ~4-5M points (~100k points/series). If something seems very wrong with this schema, please let me know how I might improve it, but onto the issue.

I query this measurement with the basic InfluxDBClient query call. For example I might query for all the GPS data from a specific log name tag. This result would return maybe ~100k points and the variables in python might take up ~10 MB, but I’ve noticed that after a query call I’m left with 100MB’s or 1GB’s of memory used by python. I’m able to check what variables are present with a whos command and after systematically deleting every variable and imported module, I’m still left with a massive amount of memory use until I stop the python kernel and restart it. I’ve tried garbage collection calls as well in python, but these don’t help either.

I initially started with a smaller measurement size and query’s were fast and didn’t take up too much RAM. When I increased my measurement from my test size up to the full size (~3M points to ~300M points), I’ve noticed that the RAM required to run even basic queries is huge. In fact even basic query’s over the full measurement that should return say 10M points end up running out of RAM, into SWAP, then out of SWAP (> 20 GB). I suspect that this issue and the leftover memory in python are related, and I strongly suspect that it has to do with my schema design. So:

  1. Does this sound like an issue anyone else has seen?
  2. Does my schema design seem like a problem?
  3. Should I be able to query a 300M point measurement without running out of RAM?

Thanks,
-Austin

As an update I ran some more experiments. It looks like the memory that python appears to be using is just reserved. I see a similar behavior when I use the influx command line client. When I make a series of large queries, the memory the influx client appears to use accumulates rapidly. It seems to happen or is at least far more noticeable with large queries over GB’s of points. So, follow up questions:

  1. Is this expected behavior?
  2. Is there a way to free up that memory in python? Closing the client doesn’t do the trick, but is there something else I could try?

As a further update, I attempted to restructure my shard group duration so all the data would live in a single shard (~300M points in all ). This didn’t change the performance at all. I’m going to try to reducing my shard duration next to get closer to the recommended ~100k points/shard.

Thanks,
-Austin