Newbie Question: Optimization of large data set

Pardon the uninformed question, but I’m new to time-series databases in general and InfluxDB in particular and I’m looking for advice.

I’m in the process of implementing CHORDS, a meteorology application, that utilizes InfluxDB to store measurements. If you’re interested, you can see the demo at http://ncar.github.io/chords/ or the wiki at Home · earthcubeprojects-chords/chords Wiki · GitHub.

In my particular case, I have 8 active towers (6 collect every 15 minutes, all 8 collect every 24 hours). The data coming in ranges from 2 values (day of year and precipitation) to 50 values (temp at multiple levels, etc). In addition to the active towers, I have 7 towers with 15-minute and 24-hour data that I’d like to keep together.
I would guestimate based on splitting the data from a tower/time period providing a single row to each measurement being a single row that I’ll have around 1-2 billion rows. Hardware isn’t an issue as we’re running on a “cloud” of Virtual Machines and I can just ask for more memory, drive space, processors, etc.

The write load once the data is loaded will be minimal but queries could be reasonably heavy based on the time period and/or values. As an example, one of my monthly reports builds a list of precipitation during the month at any of the given active towers for those days that had rain.

As it stands now, I have data retention set to infinite (I need users to have access at any given time). Given that, what can I do to optimize and/or have the primary developers at NCAR do to allow flexibility of large and small data sets in a setting?

Thanks,

Dennis

@lanl_dlittle Thanks for the question! I’m not quite sure what you are asking and how your math adds up. It sounds like you need to store relatively low granularity data for 15 towers, however you are talking about billions of rows.

The key to optimization in usecases like this is proper schema design. That document should help you get started. Do you have an idea/sample of what the data looks like in InfluxDB line protocol?

You are correct that the data is pretty granular at 15-minute and 24-hour levels but there are quite a few measurements per tower (up to 50 per time period per tower). These are for “TA-54”, which has data that goes back to 01/29/1992 @ 2::15 AM for 15-minute data.

Here’s an example couple of rows…

tsdata,inst=18,site=6,test=true,var=133 value=29 696716100000000000
tsdata,inst=18,site=6,test=true,var=133 value=29 696717000000000000
tsdata,inst=18,site=6,test=true,var=133 value=29 696717900000000000
tsdata,inst=18,site=6,test=true,var=133 value=29 696718800000000000
tsdata,inst=18,site=6,test=true,var=133 value=29 696719700000000000

Note that I split each “value” out into it’s own row, so what would normally be var 133-155 and 181-185 are in different files. There are roughly 900K rows per file.

Unrelated question: How are people dealing with null values? Just as an example of where this is needed, we had a wind direction meter that had a bearing go bad from the time it was installed and when it was taken down (about 6 months). The Meteorologist made the determination that all values for that period for that instrument should be wiped out. I can’t make them 0 (a valid value AND it would show on a chart). A null is the best option but from what I’ve read, as of v0.90, InfluxDB doesn’t do nulls.

Side note: If you’re interested in building a CHORDS portal (it uses Ruby on Rails, MySQL for configuration info, InfluxDB for data storage, and Grafana all wrapped in a Docker container), I can pass along my configuration so you can see the actual towers/measurements. From there, I can pass on actual data values for at least a few of the measurement fields as .gz files to import. Because they’re text, they compress really nicely i.e. from 61000K down to 2900K.

Dennis