Influx 2.0 performance versus Influx 1.7 on windows

I have been running the windows release of influxdb 2.05 and influxdb 1.7 on separate but similar servers. Logstash is pushing the same data to each using the V1 API in parallel so both DB have pretty much the same data. The influx 1.7 has more data since it has been running longer. Using Grafana to query the data, I am finding the 1.7 version to be a bit faster on larger queries. In one test pulling 137K records, the 1.7 version took 1.45 seconds and the 2.05 version 3.21 (per Grafana Stats). The query to both is the same and influxql, not flux. When I use flux, it is much, much slower and cannot handle large return record sets.

Anyone else compared per 2.0 and 2.0 versions for performance? I want to upgrade to stay current but worry about performance of the 2.0 version.

Hi @mg42561, would you mind sharing the InfluxQL and Flux queries you’re running? We just began working on a suite of performance tests to identify & fix areas that need more attention.

The data contains the duration of various transactions and the query is to return the mean of this grouped by time intervals scaled from milliseconds to seconds. There are about 100K data points per day. I had not attributed the data to contain the time of day or day of week as fields so I could filter for say business hours or work days, so I wanted to try to use Flux which provides greater date/time manipulation. Going forward, I do store the time of day and day of week as fields so I can perform these filters using Influxql. I am visualizing with Grafana 7.5

Flux Query:

import “date”
from(bucket: “Portals”)
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == “usage”)
|> filter(fn: (r) => r[“portal”] == “SalesPortal”)
|> filter(fn: (r) => r[“result”] == “Success”)
|> filter(fn: (r) => r["_field"] == “duration”)
|> filter(fn: (r) => (date.hour(t: r._time) >= 12) and (date.hour(t: r._time) <= 22))
|> drop(columns: [“app”, “action”, “portal”])
|> aggregateWindow(every:1h, fn: mean, createEmpty: false)
|> map(fn: (r) => ({
r with
_value: r._value / 1000.0
}))
|> yield(name: “mean”)

The Influxql version, ${time_interval} is a variable for the desired time interval (5m, 1h, 1d):

SELECT mean(“duration”) / 1000 from Portals.autogen.usage WHERE portal = ‘SalesPortal’ and $timeFilter and result = ‘Success’ GROUP BY time(${time_interval}) , app_action fill(null) tz(‘America/New_York’)

I am using the “Stats” feature in Grafana to get the performance information. I was going to try calling these directly with Curl to take Grafana out of the equation but have not done so yet.

The 2.0 and 1.7 are both running on EC2 but I just confirmed and not the same configuration. The 2.0 is running on a T2.xlarge and the 1.7 is running on a C5.2xlarge, but have 16 GB RAM. Not sure if this is contributing to the performance differences I am seeing between 1.7 and 2.0. I tested briefly on Linux (m5.xlarge/16 GB) and found the 1.7 version to be faster as well. The 1.7 version has 8 vCPUs whereas the 2.0 version were tested on 4 vCPU instances so perhaps that is contributing.

@mg42561 the differences in hardware are likely causing some of the perf diff you’re seeing, though I can’t say exactly how much. Updating your 2.0 deployment to use the same EC2 setup as your 1.7 deployment will help show a more apples-to-apples comparison.

I’ve done some preliminary investigation of your 2.0 query performance, and here’s what I’ve found:

  • In 2.0, InfluxQL queries are “transpiled” into Flux and submitted to the same underlying query engine; the results are then converted a 2nd time into the InfluxQL response format. This could account for the extra overhead you’re observing in the 2.0 system.
  • date.hour seems to be the primary cause of the slow performance in your Flux version of the query. We don’t currently support “pushing down” that function into the TSM engine, so the system has to read every row matching the previous filters into the Flux executor and process them one-by-one. This inefficiency cascades to the following nodes.

I’m going to investigate what it’d take to add push-down support to date.filter. In the meantime I think your suggestion of storing the hour-of-day as a tag would help you avoid this issue in Flux. You could filter on the tag instead of using date.hour, allowing the filter to be pushed down into the storage layer; this would also enable the aggregate-mean to be pushed down, letting most of the work be completed by TSM instead of Flux.

Thank you for investigating further. I have submitted a request to have the two environments use the same EC2 setup so I’ll try again when that is finished. The transpile explanation is helpful. Is there any way to view the transpiled query? When I removed the date.filter, the performance of the flux version improved dramatically. Anecdotally, it looks like the Flux version was slightly faster than the influxql 2.0 version. I had read about he push-down and suspected the date.filter() was impacting things however, I’m still wrapping my head around flux versus influxql. I was anxious to try some of the features in flux such as the data manipulation and joining with SQL data sources (which I have yet to try).

I read somewhere the 2.0 windows version is now considered “stable enough” and I can safely migrate to that, do you agree?

-Morris