Delete specific data older than 30 days

Hi,
I have an InfluxDB running on a Pi3B+ storing power consumption of 4 devices I have a home.
The device give a reading every 20secs, and then I have a process to create hourly, daily, aggregate values.

Now I would like to delete the raw values that are older than 30 days but I cannot get a query working.
Any help pls?

I am running v3.3.0 on HomeAssistant.

Any help will be appreciated

@mouthpiec Just store your raw day in a 30-day retention policy. InfluxDB will drop it automatically once it ages out.

thanks Scott for your feedback.
The problem is that I would like to keep the aggregate values for a long period of time (2yrs) and delete only the raw values after 30 days. I think if i set the retention policy to 30 days all the data will be deleted

You can create an additional retention policy for your aggregated data for 2 years

raw.30days
aggregate.2years

When you process your raw data output it to the aggregate.2years

Hi Phil, i was not aware of that, looks that it is what I need.
Can you guide me how this is done please?

Hi @mouthpiec, Sure. Bare in mind though I’m using an older influx version (1.5.2 currently)

I have one main assumption here, you are running the stack on linux variant.

There are two ways you can do this really, with CQ’s or Kapacitor batch tasks. Personally, i prefer to use the latter (there is less over heard for Kapacitor scripts as opposed to CQs) however if you are running small processing tasks then CQ’s should work.

This article might help you decide on which is best suited to you.

I’ve also found a topic i responded to a while back which might help

CQ Docs
Kapacitor as a CQ engine

Now, personally I prefer to use Kapacitor to do this. CQ’s are good but can be intensive on your InfluxDB instance. We process a lot of measurements with our instances and CQ’s were causing big issues in memory usage when they were running, using Kapacitor helped alleviate this - It also has more functionality in terms of functions it can perform. For this to work you will need to install Kapacitor as well.

Now, first of all we need a new RP if you haven’t created on already

CREATE RETENTION POLICY "2yearhistorical" ON "yourdatabase" DURATION 2y REPLICATION 1 

replace the DB name with your database and name the RP whatever you want to.
Check the new RP exists
SHOW RETENTION POLICIES

Once thats done, then create a batch TICK script to downsample the data (this is from a script i currently use, but you should be able to swap the measurements and fields and fiddle with it to get you started.)

//Downsample all metrics from the win_disk measuremnet.
batch
|query(‘SELECT mean(“Free_Megabytes”) AS “mean_Free_Megabytes”, mean(“Percent_Free_Space”) AS “mean_Percent_Free_Space” FROM “mydatabase”.“30days”.“win_disk”’)
.period(5m)
.every(5m)
.groupBy(time(5m), *)

|influxDBOut()
    .database('maydatabase')
    .retentionPolicy('2yearhistorical')
    .measurement('win_disk')
    .precision('s')

So the above script runs every 5 minutes and queries my raw data applying the “mean” function. This returns the mean values and outputs them to SAME DATABASE but into a different retention policy (you could specify a separate output DB if you wanted to keep them separate)

To enable it you would run
sudo kapacitor define downsample_task_name -type batch -tick /path_to_script -dbrp database.rawdataRP

Followed by
sudo kapacitor enable downsample_task_name

Once it’s run you should be able to use the data in chronograf to graph your “historical” data.

The best advice really is to read those articles about when to use CQ or Kapacitor to work out which best suits your needs. If you have a small amount of data to process then CQs might be the better option (1 less service to run)

As mentioned though, i am using an older version of Influx. You have this tagged as influxdb2 so there may be better way of doing this in influxdb2 - I don’t know, it’s still in development so we haven’t tested it at work.

Hope that helps, let me know if there’s anything you’re unsure of.

edit: things to consider:

  1. tag values, if you want to keep/retain all tags in the measurement then you must group by them, otherwise they won’t be inserted back into the database. You can group them all (*)
  2. if your default RP (raw data) isn’t already set to 30 days, you would need to amend the RP

Caution: changing this to 30 days will cause any data older than 30 days to be dropped.

Another point, my batch script above renames the fields mean_field_name - If you want to preserve dashboards and field names, you can change the AS mean_field_name to just AS field_name, that way you could just duplicate your dashboards and update the retention policy in the query.

Philb

thanks for the detailed info.
the problem i am seeing i am already aggregating the raw data using Home Assistant.

What I did not understand is how I can have two retention policies, I guess I need to do some reading. I was originally looking for a query to delete data older than for example 60 days and use the where clause to select which data I want to delete. I dont mind doing it manually every month.

Inside of InfluxDB, you have “databases”. A database has one or more “retention policies”. A retention policy acts as a wrapper for data that should be expired after a given period of time.

This is the general flow of aggregating (downsampling) data over the data’s lifetime.

  1. Full-precision data is written and stored for a brief period of time. What “brief” means depends on your use case, but let’s just say one month.
  2. After a month, a continuous query (or Kapacitor task) downsamples that data into a lower precision. It does this by group points into windows of time and aggregating the value of each window. For example, you could window the data into 5 minute intervals and calculate the average of values within that window. That then becomes the new downsampled data point.
  3. The downsampled, lower precision data is then stored in another retention policy with a longer retention period.
  4. The high precision data in the shorter retention policy ages out and is dropped from the database, but the lower precision version of the data remains in the other retention policy.

This process of downsampling and storing lower precision data can be repeated as many times as you need to. You just need to design the flow and balance data precision with disk usage over time. The higher precision your data, the more disk it will use.

the problems I am seeing are the following:
I have only one database and downsampling is being written in the same database.

Can I just delete old entries using a query? For in SQL i have the following

DELETE FROM on_search
WHERE search_date < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 180 DAY))

@mouthpiec In InfluxDB, a database can have multiple retention policies. If you’re downsampling data, you shouldn’t write it to the same retention-policy as your high resolution data. If you do, you’re mixing (and possibly overwriting) high resolution data with downsampled data.

Allowing retention policies to expire old data is a lot safer and A LOT more performant than running DELETE commands. DELETE commands are very expensive.

You can have one database, but you should create multiple retention policies in that database. Store high resolution data in a short retention policy and create a continuous query to downsample into another retention policy with a longer retention period. InfluxDB is designed to accommodate this specific workflow.