Storing sensordata correctly

I work with a client that ingest roughly 1 billion events monthly from IoT devices.
This can be GPS positions, sensor readings etc.

Currently everything is streaming data in Kafka, but we are exploring the possibility to do timeseries analytics on this data too.

After our initial spike there are some confusion to how to properly store all this in InfluxDB.

There are currently ~200 000 devices with their own ID, and the number is expected to grow double over the next year.

To get started, we tried to store just GPS positions.

Measurement: position

Longitude float
Latitude float
DeviceID string TAG
Time timestamp

while trying to ingest this, Influx DB shuts down after a few hundred thousand records.
Complaining about that the number of values per tag is too high.

warn    max-values-per-tag limit may be exceeded soon   
{
    "log_id": "0PkVo5A0000", 
    "service": "store", 
    "perc": "100%", 
    "n": 100096, 
    "max": 100000, 
    "db_instance": "foo",
    "measurement": "position", 
    "tag": "simicc"
}

Are we going about this in the wrong way?
Are Tags supposed to be a finite, smaller number of things like colors, device types etc?

I also read this on the documentation page:

The measurement acts as a container for tags, fields, and the time column, and the measurement name is the description of the data that are stored in the associated fields. Measurement names are strings, and, for any SQL users out there, a measurement is conceptually similar to a table. The only measurement in the sample data is census . The name census tells us that the field values record the number of butterflies and honeybees - not their size, direction, or some sort of happiness index.

So are we trying to do old-school tabular data where we shouldn’t here?

TLDR; are we doing it wrong, or is InfluxDB simply the wrong tool for this?

@rogeralsing - I love that you’re evaluating InfluxDB at this scale. You have that right idea that tags are meant for more finite counted values. Device ids (or any guid really) are not good for tags. The reason is that tag values are indexed to find matching data quickly in queries. If the cardinality of the tags values is too large, then performance suffers in the index. There isn’t a hard limit here but max-values-per-tag (which is configurable) give an upper bound we expect “most average users” to be under. Your throughput is higher than many.

I suspect you aren’t planning on querying for individual devices that often and are more interested in aggregate results across all devices or some group of devices (regional, machine/vehicle gps is on, etc) which would be more finite. If this is your case, you won’t be impacted much by having device id as a field instead of a tag. (Regions, vehicle type, etc would potentially be good tags.)

Consider also what questions (queries) and analytics you want to perform. Deciding on the data schema is critical for leveraging InfluxDB and the schema needs to be informed by your expected queries (as well the data itself).

These schema help docs are for InfluxDB 1.8, but I suggest you check out 2.0 OSS (in release candidate right now). Our cloud offering might also be good for you (there’s a free tier and pay as you go, but also cardinality limits for both).

Let us know what you decide.

1 Like

We actually need to query per device. Devices belong to customers and we have no interest in aggregating data across customers. I don’t think it’s even legal.

@alexeyzimarev @rogeralsing - Ah! I took a guess on your use case. I can’t speak to any legal requirements.

I still recommend that you use a field value for the device id along the lines of dguid="98FF07E7-127A". Time based guids might compress better if you must use strings - numeric ids would compress even better. Querying for a specific device will have to data scan to find it but if you segment queries reasonably with filters (on time and/or other tags), I would expect this to meet needs. If query speed is still an issue, there are tricks like hashing IDs to a small number of buckets (~ <=100) and using this hash for a tag to segment searches to 1% of the data for any given device id. I’m assuming here you’ll always know what device you are looking for. If you don’t have the id, then you aren’t scanning over ids at least.

If you are willing to share what sorts of per-user/device analytics you are thinking about, I could hopefully advise further.

Let us know what you go with.

-p
(If you’re finished with the conversation, please check “solution”.)

1 Like