Taking too long to get data through query

Hi there!

I am facing a issue to get influx data using flux query. Its taking too long to get data.

I have this simple query:

import "regexp" import "strings" Customer_9571XXXX = from(bucket: "SENSOR_DATA_DB/defaultPolicy") |>range( start:0, stop: 2024-03-22T06:33:34Z ) |>filter(fn: (r) => r["_measurement"] == "9571XXXX_Customer") |>filter(fn: (r) => r["_field"]== "call_type" or r["_field"]== "subscription_status" or r["_field"]== "email_type" or r["_field"]== "app_event_type" ) |>group(columns: ["_field","_measurement"]) Customer_9571XXXX

Its taking 52.87s.

My influxdb is hosted as docker container.
AWS 2vCPU and 80GB Storage (Not dedicated to influxdb also there are 2 postgres databses are tehere)

My question is how can we manage millions of data and reduce above query time.

Is there any specific hardware setup to be done for specific to the influxdb.

1 Like

@debnath Question–how “dense” is your data? Meaning how often are points written to InfluxDB. You’re querying a huge time range (around 54 years). The first thing I’d suggest is reducing your query time range. Depending on how old and how dense your data is, this could potentially return 10s of millions of rows, which obviously takes time.

This seems undersized for the potential amount of data you’re querying. Especially considering that it’s shared with two other databases. You would likely see better performance with an instance with more CPU cores and memory, but I can’t say exactly how much better the performance would be.

Hi @scott ,

Thanks for your reply.

Can i get some extra computation and managing data and features through influxdb cloud service?

@debnath Yes, you would have access to more computation, but InfluxDB Cloud Serverless is built on a different storage engine and doesn’t support Flux. You’d have to use either InfluxQL or SQL. You are also subject to rate limits that may prevent you from querying the amount of data that you’re trying to query.

2 lakh of data every day. Its testing data. But i will be going above half a Million data per day.

We are facing problem to get data faster. We are mostly querying data and also using built in Task using flux.

Can you suggest me how can we improve to get data quicker from millions of data. We are using indexing and also best practice to make query.Also is there other alternative like hardware or deployment server config where we can move and improve.

Please suggest.

Thank You.

I can’t make specific hardware recommendations, but hardware with more resources will improve performance. There are a few places you could be bottlenecking:

  • Disk I/O: The read speed of your disk could be slow. This could also be affected by file system mounts into your container, but these generally don’t affect read speed all that much. SSD storage will give you the best read performance, but that may not be an option for you.

  • CPU: Where you currently only have 2 virtual cores and those cores are being shared between three separate databases, this is my first guess for a bottleneck. Making more CPU available should help.

  • RAM: I don’t really see memory being the bottleneck here, but you don’t mention how much memory is available to the InfluxDB process. InfluxDB uses memory in primarily two ways:

    • Store the index
    • Buffer data is results are collected and returned

    I don’t know that memory is really the culprit here, but more memory is never a bad thing.

1 Like

@scott Thank you.

I will try out your solution.

1 Like

Hello @debnath & @scott , I am also facing the same issue like when i try to read the data from influxdb using python-Influx-client ( using flux ) it takes a good amount of time like 1-2 minutes.
attaching the query please check.

        table_robot_info = f"""
                from(bucket: "Robots")
                |> range(start: {timeRangeStart_str}, stop: {timeRangeStop_str})
                |> filter(fn: (r) => r._measurement == "robotInfo")
                |> filter(fn: (r) => r["_field"] == "isEStopPressed" or r["_field"] == "stopAndHoldValue" or r["_field"] == "mapNo" or r["_field"] == "mapName" or r["_field"] == "operationMode" or r["_field"] == "areaCovered")
                |> filter(fn: (r) => r["robotID"] == "{machine_id}")
                |> aggregateWindow(every: 5s, fn: last, createEmpty: false)
                |> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
And in this we are passing mostly a week of data in which there are 100k-500k rows data. 

can you help me or guide me like what i am doing wrong? or is there any way to reduce this time?

@Vijay_Barman Have you tried just running the query against the raw InfluxDB query API? I’d be curious to know if the query takes as long running that way. I suspect there’s some latency that’s rooted in the Python client having to process all the data coming back to a format that Python expects.