Help to translate SQL Query to flux script

I want to write this QUERY
Select count(DOCUMENT),count(SUPPLIER) from poc GROUP BY CUSTOMERS in flux script.
Basically want to get count of unique document and supplier grouped by customers

I am able to get the total count of unique Document for given customers but unable to figure out how to get total count of both suppliers and documents when grouped by customers.

Below is the flux script I managed to build

from(bucket: “poc”)
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r[“_measurement”] == “file”)
|> keep(columns: [“CUSTOMER”,“DOCUMENT”,“SUPPLIER”])
|> group(columns: [“CUSTOMER”])
|> unique(column: “DOCUMENT”)
|> count(column: “DOCUMENT”)

@Lovish_Mehta I think the following will give you what you’re looking for. (This assumes DOCUMENT and SUPPLIER are tags)

import "join"

data = from(bucket: "poc")
    |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
    |> filter(fn: (r) => r["_measurement"] == "file")
    |> keep(columns: ["CUSTOMER","DOCUMENT","SUPPLIER"])
    |> group(columns: ["CUSTOMER"])

document = data
    |> unique(column: "DOCUMENT")
    |> count(column: "DOCUMENT")

supplier = data
    |> unique(column: "SUPPLIER")
    |> count(column: "SUPPLIER")

join.inner(
    left: document,
    right: supplier,
    on: (l,r) => l.CUSTOMER == r.CUSTOMER,
    as: (l,r) => ({CUSTOMER: l.customer, DOCUMENT: l.DOCUMENT, SUPPLIER: r.SUPPLIER})
)

Hi ,
Thank you @scott for the reply, I already tried this one, it gave me the results but performance wise joins are taking a lot of time for my use case , is there any other way possible where we can count both DOCUMENT and SUPPLIER in one query itself without the use of join.
And, yes the DOCUMENT and SUPPLIER are the tags.

@Lovish_Mehta There is another approach you could try. I don’t know how much it will differ in performance compared to joining the data, but it’s worth a shot.

This approach maps over the data and restructures each row so each row has a tag column and a tagCount column. The tag column is the name of the tag that you’re counting and the tagCount column is the number of unique values of that tag. You then union the streams together and pivot the new columns into rows:

data = from(bucket: "poc")
    |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
    |> filter(fn: (r) => r["_measurement"] == "file")
    |> keep(columns: ["CUSTOMER","DOCUMENT","SUPPLIER"])
    |> group(columns: ["CUSTOMER"])

document = data
    |> unique(column: "DOCUMENT")
    |> count(column: "DOCUMENT")
    |> map(fn: (r) => ({CUSTOMER: r.CUSTOMER, tag: "DOCUMENT", tagCount: r.DOCUMENT}))

supplier = data
    |> unique(column: "SUPPLIER")
    |> count(column: "SUPPLIER")
    |> map(fn: (r) => ({CUSTOMER: r.CUSTOMER, tag: "SUPPLIER", tagCount: r.SUPPLIER}))

union(tables: [document, supplier])
    |> pivot(rowKey: ["CUSTOMER"], columnKey: ["tag"], valueColumn: "tagCount")

Flux scripts are inherently single-threaded and do not support multi-threading directly. However, InfluxDB’s task system can help you parallelize your workloads. You can create multiple tasks, each focusing on a subset of your data.

Here’s an example of how you can partition your query into multiple tasks. Each task will handle a specific subset of customers:

flux
Copy code
// Task 1: Process customers starting with A-M
data = from(bucket: “poc”)
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r[“_measurement”] == “file” and r[“CUSTOMER”] >= “A” and r[“CUSTOMER”] <= “M”)
|> keep(columns: [“CUSTOMER”,“DOCUMENT”,“SUPPLIER”])
|> group(columns: [“CUSTOMER”])

// rest of the script
flux
Copy code
// Task 2: Process customers starting with N-Z
data = from(bucket: “poc”)
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r[“_measurement”] == “file” and r[“CUSTOMER”] >= “N” and r[“CUSTOMER”] <= “Z”)
|> keep(columns: [“CUSTOMER”,“DOCUMENT”,“SUPPLIER”])
|> group(columns: [“CUSTOMER”])

// rest of the script
This is a simple example of how you can partition your data based on the first letter of the customer’s name. In practice, you can partition your data based on any criteria that make sense for your specific use case.

However, please note that this strategy will only improve performance if InfluxDB is running on a system with multiple cores, and the tasks can be run in parallel. If the system has only a single core, then running the tasks in parallel will not lead to any performance improvement.

Count docs and supplies
Count Docs & Suppliers