Do shorter field names save database space?

I plan to save secondly flight data for a large number of planes, with perhaps 50 different fields in each record. Does the actual name of each field get saved for every record written?

In other words, will the database be smaller by using short fields such as “lat”, “lon”, and “alt” instead of “latitude”, “longitude”, and “altitude”, etc.? Or do values get mapped to longer, more meaningful fields without storing the actual field name each time?

Field names are stored in the shard file once per shard per measurement.

That means field names will have minimal impact on space. If there are a bunch of measurements with the same fields, then those field names will be stored multiple times and will consume more space (again, one copy of the field names per measurement). However, the space used would still be negligible compared to the amount of data stored as field values.

1 Like

A thorough and clear answer. Thank you.

1 Like

@jbbarnes I realized I made a big error in my previous answer.

The field key/name is actually recorded for each individual series (instead of each measurement) per each TSM file (instead of each shard). This means that long field names can have an appreciable impact on database space when there is high series cardinality and when a shard is not fully compacted into a single TSM file.

For example, if there was an InfluxDB database with:

  • 1 million series
  • one fully compacted shard
  • each series has field keys named “latitude”, “longitude”, and “altitude”

Then the field names would take up (25 bytes * 1000000 series) = 25MB.

If the field names were “lat”, “lon”, and “alt”, they would require (9 bytes * 1000000 series) = 9MB.

In this case, a long-running InfluxDB instance would see 16MB savings with shorter field names for each shard. Also, any shards actively receiving a high write throughput could contain tens to hundreds of underlying TSM files. Even though those TSM files will eventually be compacted into one single TSM, the field names will be repeated for each series in each one.

In the context of an InfluxDB database with 1 million series, the space used by field names will still be a relatively small piece of the overall disk space used, but it will not be a negligible amount of space.

1 Like

@gunnar :Thank you very much for the revision. It’s important to know. Pardon me if my InfluxDB vocabulary (series, TSM, measurement, point, shard) is imprecise, as I am brand new to this database and used to speaking in SQL terms.

In short, our database will have only one retention policy and one measurement: “flight_data”. The only tag is “plane_id” (there may be thousands of planes tracked). Once per second a new entry (a “point”?) will be saved with the timestamp, plane_id tag, and a few dozen fields, such as longitude and latitude. There are potentially hundreds of fields that might be used, though most of them change only occasionally and won’t be recorded unless they changed in the last second. So fields like “air_temperature” change only occasionally and ones like “gear_up” or “seatbelt_sign_on” would only happen a couple of times per flight.

In trying to make sense of the terms here:(https://docs.influxdata.com/influxdb/v1.5/concepts/key_concepts/) I think that would constitute one series per unique plane_id if I’m understanding cardinality.

So in this case, it sounds like the more lengthy field names would be recorded only once per individual plane, and in fact only those fields that happen to apply to that particular aircraft. It sounds like, in this case, the more readable field names would cause a little extra overhead once per each plane, and not per each second that it reports in. Am I understanding that correctly? Thanks.

So in this case, it sounds like the more lengthy field names would be recorded only once per individual plane, and in fact only those fields that happen to apply to that particular aircraft. It sounds like, in this case, the more readable field names would cause a little extra overhead once per each plane, and not per each second that it reports in. Am I understanding that correctly? Thanks.

Yup, that is correct and sounds reasonable. The field names will have a minimal impact on overall space used in your case.

Overall, field name length will have much less impact on space used than some other big factors like, using correct precision, using correct data types, and setting longer shard group durations on retention policies.

@gunnar

Thanks for verifying that for me. Regarding your suggestions, the one optimization it looks like we can’t do is use the best precision for the time stamp. We only need it accurate to the second, not the nanosecond, which is the default.

We are utilizing the UDP interface to InfluxDB and have found it to be very fast. We plan on handling a lot of throughput and it doesn’t matter if a few points get dropped. But we have not found a way to either specify the precision in the UDP write command, or by setting the default for the entire database or measurement ahead of time.

I have searched through historical user requests and discussions among the InfluxDB team posted online, and found plenty of discussion about this issue going back several years, but it doesn’t look like it has been addressed yet. Do you have any extra insight on whether we could specify an “s” precision while using UDP now or in an upcoming release?

I took a look at the code and the UDP interface does support a configurable precision as configuration option in the UDP section. Unfortunately, that config option was not documented in the example configuration file or the docs. I’ve opened up PRs to address that discrepancy.

1 Like

@gunnar

Thank you for going the extra mile. I added a precision entry to our influxdb.conf file and it worked.