Data Size Questions - influx much bigger than raw data

Hi everyone,

My InfluxDB 1.8 is running well since one year (with Grafana on top). I don’t have set any retention policy (except default autogen=infinite). I inject the data from CSV files on a daily basis.

But now my hard drive is getting full. The weird thing is, influxdb’s folder became huge, something like 9x the size of the row data (csv-files). Why is it like this?

Now the HD is quite full, up to 88%, and I really need to do something! Sure I could add some retention policies, but actually it should be fine now if influx’s data size ~ raw data size.

Why is influx’s data size so big compared to the raw data file? is there a kind of versioning or something similar? How can I prevent influxdb to grow crazy like this?

Thank you so much for any advice,

all the best,
Vince

Hello @vince,
Can you tell me about your schema design?
What does the CSV data you’re writing look like?

Hello @Anaisdg,

thank you for your reply!
I inject 2 CSV-files with influx’s python client. Each one has a 1s resolution and is a representation of the day (one file per day). One has a size of ~80mb and the other the size of ~150mb. So I inject ~ 230mb every night, at 5 AM.

The whole thing is really weird. Here a graph with the CPU load and the HD used space:

At 5 AM, the CPU is running the injection but the HD size doesn’t change (yellow circles)
At 9 AM (why?), the CPU is running again (why?) and the HD size changes by 55 GB (!) (red circles)

And this happens every day, so I’m running out of space in a couple of days :grimacing:

This is how I inject the data:
client.write_points(df, 'my_measurement', protocol='line', batch_size=3000)

Thank you very much for any advice!
All the best,

Vince

Hello @vince,
Does your DataFrame have timestamps? If it doesn’t, it’s possible you’re writing duplicate points.
Have you loaded data into InfluxDB from sources other than the CSV?
If your DataFrame has timestamps and the only data you’ve loaded into InfluxDB is from that CSV I’d suspect something in InfluxDB isn’t working right, like a cache never being cleared or temporary files not being deleted, maybe even something odd with the tsm engine causing it to use much bigger files than needed.

What does your schema look like from that CSV? How many tags and fields?

Hello @Anaisdg ,

yes, my csv files have timestamps.
I do collect some more data from telegraph but I inject them into another (influx)database, and this db is absolutely fine. Only the CSV_DB is growing like crazy.
So I would guess the same as you said with the cache or the temporary files. I had a look to the file structure of influx, and in data/CSV_DB/autogen/number/, I can see A LOT of ‘.tmp’ folders. And this is inside each number folder similar. Here a pic:

Each .tmp folder’s size is really close to the .tsm file size (~230mb, and actually really close to the raw data).

Is this a normal structure? what about these .tmp files? should they remain like this or should tey be deleted?

fyi: I’ve got 2 separated daily csv files series, which I inject into 2 different measurements.

  1. ~60 fields with 1s resolution
  2. ~150 fields with 1s resolution

fyi2: I do a portable backup every night. Could this cause any temp/cache problems? I found this, and it looks really similar (except I’m on Linux): During backup, the tmp file is not deleted. · Issue #20732 · influxdata/influxdb · GitHub

fyi3: yesterday I stopped the automatic (nightly) injection of the CSV files. Nothing happened overnight, the database didn’t change its size. l also triggered the injection manually at 1 PM and… 4.5h later, the HD usage got 55GB more data (exactly the same behaviour as nightly). So the CSV injection is definitely related to the problem.

All the best,
Vince

Hello @Anaisdg ,

The problem seems to be the partable backup not deleating the .tmp folders.

I found the solution here: Running influxd.exe backup -portable leaves <x>.tmp folders in data directory on Windows · Issue #16289 · influxdata/influxdb · GitHub
Restarting Influx cleaned up all .tmp folders.

I’ll upgrade from 1.8.0 to 1.8.9 (in the ubuntu repos) and hope this will work without periodically restart.

UPDATE: I upgraded to 1.8.9 but got a starting script issue which I could solve with this thread: Influxdb 1.8.7 will not start - #38 by vince

I hope this will do the job now deleting the .tmp folders without restarting influx. @Anaisdg thank you very much for you help!