Maintaining field relationships

Hi all, I’m new to working with InfluxDB and working to wrap my wetware around the storage model and best practices and would love some advice.

I have the following scenario: I’m storing a stream of event objects, each of which has the same set of attributes. As a reduced example, lets think of them as {time, measure, location_group, location, val1, val2, val3}.

These are originating from live sources with ms accuracy (tagged at the source), as well as batch imports from systems with much lower time granularity (minutes or ten’s of minutes - human survey info).

In an ideal world location_group and location would both be tags, however location is a string with a potentially high number of variations, making this not possible without severely blowing out series cardinality. As a result, it is currently mapped to a field along with val1, val2 and val3.

When querying with flux it’s possible to pivot on time and location_group to restore the original events… unless there are two with the same location_group at the same moment in time. This situation occurs frequently from the low time granularity source (e.g. a reading at 09:30am). When this happens events are overwritten at storage time as they share the same timestamp, measure and tagset.

To counteract this, one approach that I’ve been experimenting with is to hash the location into 20 bits noise that is appended to the timestamp prior to storage. This provides unique, deterministic time values for pivoting at the expense of up to ~1.04ms of error (which is fine for this use case).

This works, but feels hacky. It also looks as it may impact compression based on info here.

Are other approaches for handling this situation?

@kim, I’d have to look at some sample data to give you a definitive “yes,” but if I understand the nature of your data and what you’re trying to accomplish, it is possible with Flux. I don’t know that you need to pivot by both time and location_group though. The table groupings should already group by location_group after the pivot.

from(bucket: "example-bucket")
  |> range(start: -1h)
  |> pivot(rowKey: ["_time", "location_group"], columnKey: ["_field"], valueColumn: "_value")

Awesome. Thanks @scott.

The pivot uses both the time and location_group as keys to reduce the probability of a hash collision. It drops the collision space from all location identifiers, to just those within a location group (there’s a few other tags that are used for keying in the ‘real’ model too). This lets this setup get away with a smaller hash output (less timestamp noise), at the expense of a little more complexity when pivoting.