Schema design: peer review request


#1

Hi all,

I’m a total influxdb newbie and I would really appreciate a schema review from experts.

My (academic) project is based on some Green Button data (energy consumption in kWh) stored in CSV files. We have one CVS file per site, each containing the following columns: date time utc, consumption value, ‘estimated’ flag. On the other end, we have some metadata concerning the sites: industrial sector, latitude, longitude, time zone. Reading a previous post I decided to create a single measurement called ‘energy_usage’ containing points with the following format :

energy_usage_point = {
    ‘measurement’: ‘energy_usage’,
    ‘time’: string ,
    ‘fields’: {
        ‘value’: float
    },
    ‘tags’: {
        ‘site’: int,
        ‘industry’: string,
        ‘latitude’: float,
        ‘longitude’: float,
        'geohash': string,
        ‘time_zone’: string
    }
}

This scheme seems to provide acceptable performances but could we do better?

I had a look to the schema design recommandations provided by the (excellent) influxdb doc [1]. The very first recommandation tells us to limit the number of series. My schema generates 205 series [2] for the single measurement we have. Is that to much or acceptable ?

Thanks for your help.
Nicolas.

[1] https://docs.influxdata.com/influxdb/v1.6/concepts/schema_and_data_layout 1

[2] influx -database ‘mydatabase’ -format ‘csv’ -execute ‘SHOW SERIES’ | grep -v “name,_id,host” | wc -l


#2

I’m assuming that:

  • You have one sensor per site.
  • The industry, latitude, longitude, geohash and time_zone rarely or never changer per site.

So I suggest to use two series

energy_usage_point = {
    'measurement': 'energy_usage',
    'time': string,
    'fields': {
        'value': float
    },
    'tags': {
        'site': int
    }
}
energy_usage_site = {
    'measurement': 'energy_usage_site',
    'time': string,
    'fields': {
        'industry': string,
        'latitude': float,
        'longitude': float,
        'geohash': string,
        'time_zone': string
    },
    'tags': {
        'site': int
    }
}

You’ll import all your energy measurements in energy_usage.

You’ll import the site information only when it changes, so maybe just once.

The site tag will allow to group by “site”.

When you want to get the current site metadata, you just do a query such as

SELECT LAST(*) FROM energy_usage_site

#4

Thanks Samust.

I’m a bit confused cause my need is, for instance, to be able to place ‘energy usage’ data on a map. With the schema you propose, I don’t see how I can do that due to the following influxdb limitation. Clearly speaking, I don’t see how to execute an inner join on “energy_usage” and “energy_usage_site” measurements.


#5

The other option is to put everything in the same measurement but make industry, latitude, longitude, geohash and time_zone fields instead of tags, unless you’ve got some good reason to use them as tags.


#6

samaust,
industry, is one the entities I need to “GROUP BY”. Influxdb documentation says that such data should but a tag in this case. I’m a bit lost…


#7

That’s what I meant by “unless you’ve got some good reason to use them as tags”. Sorry if I was not explicit enough. I meant to suggest to place the data you want to group by in tags and everything else in fields.

energy_usage_point = {
    'measurement': 'energy_usage',
    'time': string,
    'fields': {
        'value': float
        'latitude': float,
        'longitude': float,
        'geohash': string,
        'time_zone': string
    },
    'tags': {
        'site': int,
        'industry': string
    }
}

#8

Ok. If we move geohash to the tags section then I think we have the ultimate schema.
Thanks a lot for your help.