Query latest tag values from a long time ago is very slow

Currently I’m trying to plot some sensor locations on a map. This means that we need to query the sensor name and lat/long for each device which are stored as tags in influx and are updated infrequently (in many cases only once). These tags amongst others are stored together in a bucket alongside higher frequency data.

But in order to query sensors that have been up for a long time, we need to query back a long time (right now a year, could be a few years eventually) and this takes a really long time to return results - right now this query takes 20-30 seconds which is really unfortunate.

Here’s our current query

from(bucket: "live")
  |> range(start: -300d)
  |> filter(fn: (r) => r._measurement == "sensor")
  |> keep(columns: ["sensor_name", "lat", "lng"])
  |> distinct(column: "sensor_name")
  |> map(fn: (r) => ({sensor_name: r.sensor_name, lat: float(v: r.lat), lon: float(v: r.lng)}))
  |> group()

All I’m trying to get is the latest value for the sensor and its location.

My guess is that this is happening because it’s having to iterate over all of the high frequency sensor data until it finds the last tag from a year ago, but I thought that influxdb stored its data in a column format so if I could restrict the query to only the specific columns that don’t change often, perhaps this could be done muchhh faster because those columns would be much shorter.

My other option is just duplicating the data into another tag-only bucket or adding a caching layer using either redis or postgres, but I’m hoping that influxdb can handle this faster than it is and I won’t have to get hacky.

This is my first time working with influx queries so I’d love some advice and if I’m approaching this wrong please lmk, thanks!

Hi @beasteers,
So the issue might be down to runway cardinality when running a query at that range. If you have stored the lat + lon as tags you are correct these are indexed which in most instances greatly reduces time on processes like filtering.

In your case, we are generating an index for each lat and lon value plus the sensor name. So depending on the number of sensors you have and say the lat and lon periodically change even once a year you are storing vastly more keys within your index than you need to.

I would try to store your lat and lon values as fields and keep your sensor name as a tag. Then use pivot() within your query.

from(bucket: "live")
  |> range(start: -300d)
  |> filter(fn: (r) => r._measurement == "sensor")
  |> filter(fn: (r) => r._field == "lan" or r._field == "lon")
  |> distinct(column: "sensor_name")
  |> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
  |> map(fn: (r) => ({sensor_name: r.sensor_name, lat: float(v: r.lat), lon: float(v: r.lng)}))
  |> group()

@Anaisdg what do you think? I might be barking up the wrong tree on this one.

2 Likes

Good thought, that could make sense! I queried the number of unique values per tag, but I’m not sure if these numbers are indicative of the problem you described.

nunique       tag
0             _start
0             _stop
46            _field
5             _measurement
1             acc_unit
1             app_name
1             bw_hz
1             coding_rate
22            deployment_id
72            dev_addr
26            dev_eui
24            dev_id
1             error_flag
1             f_port
3             fw_ver
19            gw_1_id
17            gw_2_id
13            gw_3_id
7             gw_4_id
6             gw_5_id
3             gw_6_id
1             gw_7_id
51            lat
52            lng
1             network
7             sensor_address_borough
2             sensor_address_country
18            sensor_address_house_number
22            sensor_address_neighbourhood
1             sensor_address_state
23            sensor_address_street
23            sensor_address_zip
2             sensor_agg
73            sensor_id
2             sensor_meas_delta_ms
70            sensor_name
2             sensor_reading_count
2             sensor_sleep_s
2             sensor_state
3             sensor_types
1             sf
3             weather_group

An update for this morning - a coworker created a large test bucket and simulated 5 years’ worth of data with 30 sensors (with lat/lng as fields) and said that the query took 67 seconds to complete and didn’t see a significant speed difference between the test between storing as fields vs tags so it sounds like it’s potentially a different issue. Let me know if there is any other info that would be useful for me to provide.

Eventually, we could have a few hundred sensors with a few years of data so we’re hoping to have something that can scale to that :crossed_fingers:

@Anaisdg do you have any ideas on this one :slight_smile:

@beasteers
Rather than query for the past 300 days, is there anyway to execute this query once, write the results to a new bucket and then run this query over shorter periods periodically with a task and write the latest value for the sensor and its location to a new bucket?

2 Likes

Thanks both for the help! That is actually the alternative that we’ve been thinking about recently so I’m glad to have your endorsement on it :pray:

For my own understanding -

Since InfluxDB is a columnar database, I thought that meant that the columns would be stored in a compressed format where only value changes are stored. So that seems like it means that for any given column, getting the latest value would be very fast and easy because it would just be taking off the top value (as opposed to a row format).

I was just hoping that if we could restrict the range query to only a few columns that rarely change then it would only have to perform the grouping over a couple hundred data points instead of millions (which is where I assume the time is spent), but perhaps that is a misunderstanding?