What is the easiest way to fill a InfluxDB /data directory to a minimum size (for testing)?

influxdb
#1

I am testing a patch to Influx that I wrote to fix the 2 or 3GB data limit on 32 bit systems (see Github issues below). I’m having trouble testing the fix because I am having difficulty actually getting the data directory to a sufficient size. I filled it to a size of around 300 MB to a single measurement and have copied those data to other measurements using select * into X from Y. I need to do this multiple times to begin to get a good DB size, and it is sloooooow. As in, “I need to run it overnight and it’s still not done” slow. I’m running on an embedded system, so I expect it to be slower than a big ol’ server, but it’s almost as slow as just streaming random data to the DB (which is how I got the 300M to begin with). Certainly copying data from one measurement to another should be much faster, right?

Another problem is that Influx appears to be compacting the TSM files almost as fast as I can populate them in some instances, even when I set the cold duration to something huge like 40000h. I see compaction running as soon as I start Influx no matter what I set it to, and in top I can see influxd chugging along working hard. I’d like to disable compaction entirely, but I don’t seem to be tweaking the right dial.

So essentially I’m looking for guidance on how to fill up my database as efficiently as possible and keep it at a large size so I can test my fix before doing a PR. Any ideas anyone?


#2

Replying to my own question, there’s a utility called influx_stress that’s included with influx. I’m using it to put random data into the database, apparently in a more efficient way than I was before because the increase in MB/sec is much higher than either the randomized data stream or the select * into methods I was trying before. It appears to be a good fit for my test case.

Edit: heads up that influx_stress included with influxdb is different than influx-stress, also written and maintained by influxdata. The influx-stress tool has more features useful for my test case.

Edit2: Apparently the data in influx_stress as well as influx-stress are not random. They are simply the same values over and over. When compacted, they get squished to almost nothing due to run length encoding. This isn’t a solution after all.