Most efficient query to filter a large number of tag values while peforming a group by

I’d like to understand what would be the efficient query to filter a large number of tag values while performing a group by . Here’s an example

Let’s assume we have 5000 hosts which have telegraf running and all of them send their metrics to an influxdb located elsewhere . Lets take the “CPU” telegraf metric for example . Now I’d like to fetch the average CPU consumed for a list of hosts which can be say around 300 . Now “host” is a tag for each metric which is unique for each host.

What would be the most efficient way to query influx to just get the average cpu for the last 1 day for 300 specific hosts?
Lets say the tag values for host are host1,host2,host3 etc

I currently know of only 2 methods
1) Just chain the OR statements
select mean(usage_system) from cpu where time > now() - 1d where host=‘host1’ or host=‘host2’ or … (for 300 hosts) group by host
2)Use regex
select mean(usage_system) from cpu where time > now() - 1d where host ~= /host1|host2|host3|…(300hosts)/ group by host

I find both of them highly inefficient . Is there a better way to do this?

Hi,
I just wanted to post an update based on a few tests I ran. I ran these tests using FLUX queries

  1. If I run a similar query with a filter of around 3000 hosts . It takes approximately 15 seconds to fetch data.
  2. If I run the same query without the host filter (keeping the date range and everything else common) , it takes just about 1 second to fetch data.

It seems like add this where filter of a bunch of hosts is slowing things down.

NOTE : Using FLUX i do it via

host=['host1','host2',...]
contains(value: r.host, set: host)

Couple of more updates. I ran queries using InfluxQL instead of FLUX

1.If I run a similar query with a filter of around 3000 hosts . It takes approximately 5 seconds to fetch data.
2. If I run the same query without the host filter (keeping the date range and everything else common) , it takes just about 2.5 seconds to fetch data.

Looks like Flux is more efficient without a chain of WHERE host filters , but once the WHERE filters come in then InfluxQL seems to be performing better

@scott @Anaisdg would you be able to help me out with this?

Flux, up to this point has been left largely unoptimized, but the current focus of the Flux engineering team is optimization. What’s causing the slowdown in Flux is the compute time necessary to compare the host value for each row to the array of hosts. I’m not certain, but I don’t believe filters that use contains() are pushed down to the storage layer, so they have to pull the data into memory and operate on it there.

What version of InfluxDB/Flux are you using?

@scott Thanks for your reply ! I’m running InfluxDB 1.7 . What would you suggest would be the most optimal? I see 4 options

  1. Use InfluxQL with filters
  2. Use Flux with filters
  3. Use InfluxQL without filters and filter it in software ( software here is the business application which uses say an influx golang client to fetch data )
  4. Use Flux without filters and filter it in software

Note : “with filters” here means :

  1. InfluxQL :
where host='host1' or host='host2' ...
  1. Flux :
host=['host1','host2',...]
contains(value: r.host, set: host)

Any other alternatives?
Thanks,
Prashanth

Do you hostnames follow a pattern? If would be more efficient to match against a regular expression rather than an explicit set.

filter(fn: (r) => r.host =~ /host[0-9]/)

@scott unfortunately they do not. The list of hosts comes from a business application which in almost all cases would never have a common pattern

Unfortunately, since you’re using InfluxDB 1.7, you aren’t able to benefit from many of the Flux optimizations that are currently rolling out. Most depend on updates to the InfluxDB storage layer and these updates are unique to InfluxDB 2.0. Whether you use InfluxQL or Flux is really up to you and your use case. There are trade-offs either way.

Thanks @scott . Do you have any suggestions on whether the filtering of hosts (when you have a large number of hosts to filter ) should be done in software or in the flux query itself? Which would be more optimal?

I can’t answer that conclusively. You’d have to test to see. I do know that contains() performs better with smaller arrays, so if there were someway to limit the number of items in the hosts array, it would help.

Got it! I’ll perform a few tests and take a call. Thanks a lot @scott !

I ran a few tests and I observed that filtering of hosts in software is much more efficient (in the order of a second or two) as opposed to filtering in flux or influxql ( in the order of 15 seconds or so ) for my particular query and dataset . It seems like the “where” chain in InfluxQL or the “contains” in Flux is not that optimized and its best to just run query for all devices and filter in software!