Hi,
I am a new InfluxDB user and I need some advice on how to correctly use Continuous Queries for downsampling. If I understood the documentation correctly, a Continuous Query to group data by an interval i will be executed on data with timestamps in a range [now()-i, now()]. That means that if I insert data with timestamps older than now()-i, it won’t be concerned by the CQ.
I understand that normally CQ are used to save disk space by copying new data in databases with longer Retention Policies and a lower level of detail. I don’t want to use it that way. I am using downsampled databases to make queries on a large time ranges faster and the main database (the one with the non downsampled data) for queries on smaller time ranges. I will also have to insert data with old timestamps from CSV files in the database I am building. That means that I need the CQ to covers the full database.
I have two ideas :
1- run a SELECT … INTO … FROM … GROUP BY … query after importing a CSV file.
2- make a CQ with the EVERY and FOR keywords like :
CREATE CONTINUOUS QUERY <cq_name> ON <database_name>
RESAMPLE EVERY <interval1> FOR <interval2>
BEGIN
<cq_query>
END
This should run the query every EVERY intervals on data with timestamps between now()-FOR and now(). So my idea is to make the FOR interval “infinite”.
What do you guys think about these ideas ? Is there a better way to do this ?
Hopefully this example will help
CREATE CONTINUOUS QUERY "rollup_1h" ON "system_stats"
BEGIN
SELECT mean(*) INTO forever.cpu FROM twoweeks.cpu GROUP BY time(1h)
END
This rolls up 1s resolution into 1h resolution, rolling up data from twoweeks.cpu
into forever.cpu
Thanks for your answer
But if I quote the documentation (InfluxQL Continuous Queries | InfluxDB OSS 1.7 Documentation):
" CQs operate on realtime data. They use the local server’s timestamp, the GROUP BY time()
interval, and InfluxDB’s preset time boundaries to determine when to execute and what time range to cover in the query.
CQs execute at the same interval as the cq_query
’s GROUP BY time()
interval, and they run at the start of InfluxDB’s preset time boundaries. If the GROUP BY time()
interval is one hour, the CQ executes at the start of every hour.
When the CQ executes, it runs a single query for the time range between now()
and now()
minus the GROUP BY time()
interval. If the GROUP BY time()
interval is one hour and the current time is 17:00, the query’s time range is between 16:00 and 16:59.999999999."
Doesn’t this mean that only the data from the last hour will be downsampled into “rollup_1h” ?
From the docs:
Issue 3: Backfilling results for older data
CQs operate on realtime data, that is, data with timestamps that occur relative to now()
. Use a basic INTO
query to backfill results for data with older timestamps.
So the following query, as is my understanding, should handle backfilling older data, as it uses INTO
CREATE CONTINUOUS QUERY "rollup_1h" ON "system_stats"
BEGIN
SELECT mean(*) INTO forever.cpu FROM twoweeks.cpu GROUP BY time(1h)
END
According to InfluxQL Continuous Queries | InfluxDB OSS 1.7 Documentation, the INTO clause is mandatory in a CQ (“The cq_query
requires a function, an INTO
, and a GROUP BY time()
clause”). So my understanding is that the INTO clause does not change the behavior of the CQ regarding the time range on which it performs the query.
I think that the “basic INTO query” you mentioned refers to a regular query (not a continuous one) with an INTO clause, which corresponds to what I called “solution 1” in my original question.
Hi @thebuleon29,
Let me check with my colleagues and I’ll get back to you
Hi @thebuleon29,
My colleagues haven’t gotten back to me yet (timezones ), but I’ve done a lot more reading and I think you’re correct. I’ll submit a PR to the docs to remove any ambiguity in them.
You’d need to execute a one time SELECT INTO
to backfill older data
Ok, thank you very much !