Internal data and cache management

influxdb
time-series

#1

Hello everyone. I’m observing influxdb irrational performance My test database is only 36MB, while folder _internal is jumping up to 1GB down to 600MB. Writing snapshots on disk every 10 seconds, example:

2018-12-06T20:42:31.059544Z info Cache snapshot (start) {“log_id”: “0CDELfpW000”, “engine”: “tsm1”, “trace_id”: “0CDJyN4l000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
2018-12-06T20:42:32.281944Z info Snapshot for path written {“log_id”: “0CDELfpW000”, “engine”: “tsm1”, “trace_id”: “0CDJyN4l000”, “op_name”: “tsm1_cache_snapshot”, “path”: “C:\Users\JohnDoe\.influxdb\data\_internal\monitor\1”, “duration”: “1222.400ms”}
2018-12-06T20:42:32.281944Z info Cache snapshot (end) {“log_id”: “0CDELfpW000”, “engine”: “tsm1”, “trace_id”: “0CDJyN4l000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “1222.400ms”}

At the moment:

  1. nothing is being written to database.
  2. nothing is being read from database, no continuous queries are being used.

Nevertheless all this CPU activity and high memory use (1600MB). After I start server it loads all the shards into memory.

I’ve been running influxdb few months now on my laptop, inserting data into it almost every day for few hours and it always kept resources low. My database grew up to few GB, nevertheless influxdb would utilize resources sparringly. I wouldn’t tell if it was running at all for months in the background.
These days like out of the sudden performance worsened badly. And I haven’t touched conf for months.
Due to issues, I tried several steps: delete all measurements, drop continuous queries but I saw no gains. Then i completely wiped the database (delete .ifnluxdb folder) and started pushing data into database onto clean database. 1 hour of inserting data at about 100 points per second per device, 2 devices, no tags, only inserts of basic point-time pairs. And resource use goes bad already.
I noticed If i restart the server, it takes a minute or so to load all shards into memory if this info helps.
I also upgraded to latest version 1.7.1 but it doesn’t solve the problem.
Thanks for help.


#2

I’m sorry to hear you’ve been having performance issues. Unfortunately, it’s very hard for us to be able to diagnose an issue that we can’t reproduce. Nothing you described points to an obvious issue or easy solution.

Some things to keep in mind: the _internal database is used by InfluxDB to collect statistics about its behavior. This can be configured in the [monitoring] section of the configuration file. In production, we recommend that this functionality be disabled, since it places additional load on the database. It’s possible this is related to your issues, however I have a development instance of InfluxDB running with internal monitoring turned on with a similar workload to yours and I haven’t seen the kinds of issues you describe. Did you change the configuration to write this data to a different database, perhaps with a different retention policy, or change the storage interval for this functionality?

Another odd piece of information is that InfluxDB is still taking several minutes to load shards even after you deleted the .influxdb directory, which I’m assuming is where your InfluxDB data is stored. With all the data deleted, the database shouldn’t take long to load the shards.

Some additional information might be helpful to figure out what’s going on:

  1. What operating system are you using? How are you installing InfluxDB?
  2. Are you using an SSD?
  3. Can you share your configuration file?

You can also try the following commands to create some debug information:

curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all?cpu=true"

curl -o vars.txt "http://localhost:8086/debug/vars"
iostat -xd 1 30 > iostat.txt

Please note: It will take at least 30 seconds for the first cURL command above to return a response.
This is because it will run a CPU profile as part of its information gathering, which takes 30 seconds to collect.
Ideally you should run these commands when you’re experiencing problems, so we can capture the state of the system at that time.