Help to translate SQL Query to flux script

Lovish_Mehta · September 26, 2022, 12:02pm

I want to write this QUERY
Select count(DOCUMENT),count(SUPPLIER) from poc GROUP BY CUSTOMERS in flux script.
Basically want to get count of unique document and supplier grouped by customers

I am able to get the total count of unique Document for given customers but unable to figure out how to get total count of both suppliers and documents when grouped by customers.

Below is the flux script I managed to build

scott · September 26, 2022, 4:32pm

@Lovish_Mehta I think the following will give you what you’re looking for. (This assumes DOCUMENT and SUPPLIER are tags)

import "join"

data = from(bucket: "poc")
    |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
    |> filter(fn: (r) => r["_measurement"] == "file")
    |> keep(columns: ["CUSTOMER","DOCUMENT","SUPPLIER"])
    |> group(columns: ["CUSTOMER"])

document = data
    |> unique(column: "DOCUMENT")
    |> count(column: "DOCUMENT")

supplier = data
    |> unique(column: "SUPPLIER")
    |> count(column: "SUPPLIER")

join.inner(
    left: document,
    right: supplier,
    on: (l,r) => l.CUSTOMER == r.CUSTOMER,
    as: (l,r) => ({CUSTOMER: l.customer, DOCUMENT: l.DOCUMENT, SUPPLIER: r.SUPPLIER})
)

Lovish_Mehta · September 28, 2022, 7:38am

Hi ,
Thank you @scott for the reply, I already tried this one, it gave me the results but performance wise joins are taking a lot of time for my use case , is there any other way possible where we can count both DOCUMENT and SUPPLIER in one query itself without the use of join.
And, yes the DOCUMENT and SUPPLIER are the tags.

scott · September 28, 2022, 2:16pm

@Lovish_Mehta There is another approach you could try. I don’t know how much it will differ in performance compared to joining the data, but it’s worth a shot.

This approach maps over the data and restructures each row so each row has a tag column and a tagCount column. The tag column is the name of the tag that you’re counting and the tagCount column is the number of unique values of that tag. You then union the streams together and pivot the new columns into rows:

data = from(bucket: "poc")
    |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
    |> filter(fn: (r) => r["_measurement"] == "file")
    |> keep(columns: ["CUSTOMER","DOCUMENT","SUPPLIER"])
    |> group(columns: ["CUSTOMER"])

document = data
    |> unique(column: "DOCUMENT")
    |> count(column: "DOCUMENT")
    |> map(fn: (r) => ({CUSTOMER: r.CUSTOMER, tag: "DOCUMENT", tagCount: r.DOCUMENT}))

supplier = data
    |> unique(column: "SUPPLIER")
    |> count(column: "SUPPLIER")
    |> map(fn: (r) => ({CUSTOMER: r.CUSTOMER, tag: "SUPPLIER", tagCount: r.SUPPLIER}))

union(tables: [document, supplier])
    |> pivot(rowKey: ["CUSTOMER"], columnKey: ["tag"], valueColumn: "tagCount")

Eric_Chastain · June 23, 2023, 2:20am

Flux scripts are inherently single-threaded and do not support multi-threading directly. However, InfluxDB’s task system can help you parallelize your workloads. You can create multiple tasks, each focusing on a subset of your data.

Here’s an example of how you can partition your query into multiple tasks. Each task will handle a specific subset of customers:

flux
Copy code
// Task 1: Process customers starting with A-M
data = from(bucket: “poc”)
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r[“_measurement”] == “file” and r[“CUSTOMER”] >= “A” and r[“CUSTOMER”] <= “M”)
|> keep(columns: [“CUSTOMER”,“DOCUMENT”,“SUPPLIER”])
|> group(columns: [“CUSTOMER”])

// rest of the script
flux
Copy code
// Task 2: Process customers starting with N-Z
data = from(bucket: “poc”)
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r[“_measurement”] == “file” and r[“CUSTOMER”] >= “N” and r[“CUSTOMER”] <= “Z”)
|> keep(columns: [“CUSTOMER”,“DOCUMENT”,“SUPPLIER”])
|> group(columns: [“CUSTOMER”])

// rest of the script
This is a simple example of how you can partition your data based on the first letter of the customer’s name. In practice, you can partition your data based on any criteria that make sense for your specific use case.

However, please note that this strategy will only improve performance if InfluxDB is running on a system with multiple cores, and the tasks can be run in parallel. If the system has only a single core, then running the tasks in parallel will not lead to any performance improvement.

Eric_Chastain · June 23, 2023, 2:21am

Count docs and supplies
“Count Docs & Suppliers”

Topic		Replies	Views
Counting number of groups Fluxlang influxdb , flux	3	7275	July 9, 2020
InfluxDB Flux - get count of distinct occurrences Fluxlang influxdb	4	8384	July 30, 2020
Distinct values counting in a join Fluxlang	20	4398	March 31, 2020
InfluxQL query to Flux Fluxlang influxql , flux	1	517	April 17, 2023
InfluxQL count with subquery in Flux InfluxDB 2 flux	3	1405	December 14, 2020

Help to translate SQL Query to flux script

Related topics