SELECTing both mean + percentiles produces non-deterministic results

paulo · July 20, 2017, 9:21am

When I’m doing selects like:

SELECT mean(“Percent_Processor_Time”), sum(“Percent_Processor_Time”) FROM “CPU” WHERE …

I’m getting deterministic results as expected, like:

host | mean | sum
serverA | 12 | 34
serverB | 56 | 78

However, when I do selects like:

SELECT mean(“Percent_Processor_Time”), percentile(“Percent_Processor_Time”, 95) FROM “CPU” WHERE …

I’m getting an incorrect «table», and further the behavior is not deterministic:

host | mean | percentile
serverA | 12 | 34
serverB | 56 | __
serverB | __ | 78

^ For some of the GROUP BYs it shows all results aligned into one single row, for some other GROUP BYs it shows only some of the results. If I select mean + 3 percentiles, it appears I’m getting basically all possible combinations where for each row we have at least one result, seemingly at random.

DJDaveMark · July 26, 2017, 4:23pm

I’m seeing the same bug. Here’s the dummy data to reproduce the issue:

cpu,host=a value=1 1434155562000000000
cpu,host=a value=2 1434265562000000000
cpu,host=b value=3 1434375562000000000
cpu,host=b value=4 1422568543702900257

And the query which doesn’t work:

select sum(value), percentile(value, 75) from cpu group by host

(the sum function can be replaced with mean/median/mode/percentile and the query still wont work)

which returns

host | sum | percentile
a    | 3   | __
b    | 7   | 4
a    | __  | 2

.

{
    "results": [
        {
            "statement_id": 0,
            "series": [
                {
                    "name": "cpu",
                    "tags": {
                        "host": "a"
                    },
                    "columns": [
                        "time",
                        "sum",
                        "percentile"
                    ],
                    "values": [
                        [
                            "1970-01-01T00:00:00Z",
                            3,
                            null
                        ]
                    ]
                },
                {
                    "name": "cpu",
                    "tags": {
                        "host": "b"
                    },
                    "columns": [
                        "time",
                        "sum",
                        "percentile"
                    ],
                    "values": [
                        [
                            "1970-01-01T00:00:00Z",
                            7,
                            4
                        ]
                    ]
                },
                {
                    "name": "cpu",
                    "tags": {
                        "host": "a"
                    },
                    "columns": [
                        "time",
                        "sum",
                        "percentile"
                    ],
                    "values": [
                        [
                            "1970-01-01T00:00:00Z",
                            null,
                            2
                        ]
                    ]
                }
            ]
        }
    ]
}

However the query does work if the percentile is changed to be any value below 75 (i.e. 74):

    select sum(value), percentile(value, 74) from cpu group by host

Another way the bug doesn’t appear is when I used similar timestamps (i.e. omitting the timestamps when inserting the data):

cpu,host=a value=1 1234567890000000001
cpu,host=a value=2 1234567890000000002
cpu,host=b value=3 1234567890000000003
cpu,host=b value=4 1234567890000000004

Conclusion
So to sum up, it appears with random different timestamps when grouping with at least 2 aggregate functions (one being percentile) in the select clause.

My Workaround
I query percentiles separately (one at a time) and add them to the previous results manually

mark · July 26, 2017, 4:37pm

Thanks for identifying how to reproduce this bug. Would you mind opening an issue on GitHub?

DJDaveMark · July 26, 2017, 10:43pm

Hi Mark,

I actually found a similar issue on GitHub, so I posted a coment over there too:

github.com/influxdata/influxdb

GROUP BY clause not working

opened 10:35AM - 20 Apr 17 UTC

closed 05:29AM - 31 Jul 19 UTC

arjunkadayanthra

1.x wontfix

### Bug report __System info:__ [ * InfluxDB version : 1.2 * OS - Windows… 10 * Type - Local Instance] __Steps to reproduce:__ 1 - Create 2 measurements having the same schema and write data into one of them through line protocol(say Measurement A) and using the Java API to the other(say Measurement B) 2 - Query using GROUP BY clause on a string field in both the measurements The queries I used was :- "select * from summary_jm group by batch_no" - Measurement A "select * from summary_jm group by batch_no" - Measurement B ***Behaves as expected in Measurement A whereas the results are not grouped in Measurement B. __Expected behavior:__ Both results should be grouped based on the specified field __Actual behavior:__ Results from B are not grouped __Additional info:__ [Include gist of relevant config, logs, etc.] *Please note, the quickest way to fix a bug is to open a Pull Request.*

Topic		Replies	Views
How to do mean,count and 95 percentile on a single column using flux query Fluxlang time-series , grafana , query , flux	3	289	August 6, 2024
Mean() query issue Telegraf	3	873	October 25, 2018
Strange results from InfluxDB functions influxdb , grafana	1	691	July 2, 2017
Inconsistent aggregation results when grouping by time	8	1726	March 15, 2017
Query Problem - including GROUP BY time() acts weird? Store influxdb , influxql	3	2262	May 4, 2017

SELECTing both mean + percentiles produces non-deterministic results

Related topics