Do shorter field names save database space?

jbbarnes · June 30, 2018, 8:07pm

I plan to save secondly flight data for a large number of planes, with perhaps 50 different fields in each record. Does the actual name of each field get saved for every record written?

In other words, will the database be smaller by using short fields such as “lat”, “lon”, and “alt” instead of “latitude”, “longitude”, and “altitude”, etc.? Or do values get mapped to longer, more meaningful fields without storing the actual field name each time?

gunnar · July 2, 2018, 11:50pm

Field names are stored in the shard file once per shard per measurement.

That means field names will have minimal impact on space. If there are a bunch of measurements with the same fields, then those field names will be stored multiple times and will consume more space (again, one copy of the field names per measurement). However, the space used would still be negligible compared to the amount of data stored as field values.

jbbarnes · July 3, 2018, 4:32am

A thorough and clear answer. Thank you.

gunnar · July 3, 2018, 6:19pm

@jbbarnes I realized I made a big error in my previous answer.

The field key/name is actually recorded for each individual series (instead of each measurement) per each TSM file (instead of each shard). This means that long field names can have an appreciable impact on database space when there is high series cardinality and when a shard is not fully compacted into a single TSM file.

For example, if there was an InfluxDB database with:

1 million series
one fully compacted shard
each series has field keys named “latitude”, “longitude”, and “altitude”

Then the field names would take up (25 bytes * 1000000 series) = 25MB.

If the field names were “lat”, “lon”, and “alt”, they would require (9 bytes * 1000000 series) = 9MB.

In this case, a long-running InfluxDB instance would see 16MB savings with shorter field names for each shard. Also, any shards actively receiving a high write throughput could contain tens to hundreds of underlying TSM files. Even though those TSM files will eventually be compacted into one single TSM, the field names will be repeated for each series in each one.

In the context of an InfluxDB database with 1 million series, the space used by field names will still be a relatively small piece of the overall disk space used, but it will not be a negligible amount of space.

jbbarnes · July 3, 2018, 11:39pm

@gunnar :Thank you very much for the revision. It’s important to know. Pardon me if my InfluxDB vocabulary (series, TSM, measurement, point, shard) is imprecise, as I am brand new to this database and used to speaking in SQL terms.

In short, our database will have only one retention policy and one measurement: “flight_data”. The only tag is “plane_id” (there may be thousands of planes tracked). Once per second a new entry (a “point”?) will be saved with the timestamp, plane_id tag, and a few dozen fields, such as longitude and latitude. There are potentially hundreds of fields that might be used, though most of them change only occasionally and won’t be recorded unless they changed in the last second. So fields like “air_temperature” change only occasionally and ones like “gear_up” or “seatbelt_sign_on” would only happen a couple of times per flight.

In trying to make sense of the terms here:(InfluxDB key concepts | InfluxDB OSS 1.5 Documentation) I think that would constitute one series per unique plane_id if I’m understanding cardinality.

So in this case, it sounds like the more lengthy field names would be recorded only once per individual plane, and in fact only those fields that happen to apply to that particular aircraft. It sounds like, in this case, the more readable field names would cause a little extra overhead once per each plane, and not per each second that it reports in. Am I understanding that correctly? Thanks.

gunnar · July 5, 2018, 7:03pm

So in this case, it sounds like the more lengthy field names would be recorded only once per individual plane, and in fact only those fields that happen to apply to that particular aircraft. It sounds like, in this case, the more readable field names would cause a little extra overhead once per each plane, and not per each second that it reports in. Am I understanding that correctly? Thanks.

Yup, that is correct and sounds reasonable. The field names will have a minimal impact on overall space used in your case.

Overall, field name length will have much less impact on space used than some other big factors like, using correct precision, using correct data types, and setting longer shard group durations on retention policies.

jbbarnes · July 5, 2018, 9:38pm

@gunnar

Thanks for verifying that for me. Regarding your suggestions, the one optimization it looks like we can’t do is use the best precision for the time stamp. We only need it accurate to the second, not the nanosecond, which is the default.

We are utilizing the UDP interface to InfluxDB and have found it to be very fast. We plan on handling a lot of throughput and it doesn’t matter if a few points get dropped. But we have not found a way to either specify the precision in the UDP write command, or by setting the default for the entire database or measurement ahead of time.

I have searched through historical user requests and discussions among the InfluxDB team posted online, and found plenty of discussion about this issue going back several years, but it doesn’t look like it has been addressed yet. Do you have any extra insight on whether we could specify an “s” precision while using UDP now or in an upcoming release?

gunnar · July 6, 2018, 12:53am

I took a look at the code and the UDP interface does support a configurable precision as configuration option in the UDP section. Unfortunately, that config option was not documented in the example configuration file or the docs. I’ve opened up PRs to address that discrepancy.

jbbarnes · July 7, 2018, 2:14am

@gunnar

Thank you for going the extra mile. I added a precision entry to our influxdb.conf file and it worked.

Topic		Replies	Views
High number of fields per measurement - can it cause storage overhead? influxdb , time-series	1	464	January 21, 2021
Clarification on Data storage Store	0	543	May 10, 2018
What is the maximum string size for a field value? Store	7	7816	August 18, 2023
InfluxDB disk usage Store influxdb	2	5576	July 13, 2018
How does influx db store string internally?	3	351	August 4, 2023

Do shorter field names save database space?

Related Topics