Delete data from DB with autogen retention policy

stayorso · December 30, 2022, 9:32am

It seems impossible to delete data with default autogen retention policy activated.
Any workaround suggested? Is it possible to configure an infinite retention policy that doesn’t trigger this error?

delete FROM “autogen”.“h” WHERE “entity_id”=‘tempo_pdc_target_acs_temperature_48’ AND time > ‘2022-12-29T15:11:49’
ERR: error parsing query: retention policy not supported at line 1, char 1

Deleting data with autogen retention policy - General - InfluxData Community Forums

DELETE does not support renention policies · Issue #15180 · influxdata/influxdb (github.com)

Anaisdg · January 3, 2023, 5:42pm

Hello @stayorso,
Welcome! This is solved in 2.x if you want to upgrade.
But “autogen” isn’t the name of your db correct?
CAn you try a command like:
> DROP SERIES FROM "h2o_feet" WHERE "location" = 'santa_monica' where h2o_feet is the name of your measurement?
Thank you.

stayorso · January 3, 2023, 6:12pm

Hi @Anaisdg, thank you!

No, autogen is not the name of the db, it’s the retention name.
I’ve also tried to copy all the date to another db with a different retention name, but always “infinite” and the bug persists.

I would love to move to 2.x and I’ve tried many times, but still I didn’t manage to transfer the “old” db

bcutter · May 9, 2024, 5:14pm

I simply can’t delete any data.

Using delete from "%" where "entity_id"='h5055_batt'; seems to work (no output, so it seems to fail silentsly), but doing a select shows/confirms: data still exists.

Retention policy is open end (Duration: ∞, so infinite - for good reasons).

Using delete from "homeassistant"."autogen"."%" where "entity_id"='entity_id=h5055_batt'; gives

ERR: error parsing query: retention policy not supported at line 1, char 1

HOW CAN I FREAKING GET RID OF DATA (single series)

scott · May 9, 2024, 8:34pm

When you use DELETE or DROP SERIES, you can’t scope the request to a retention policy. It’s scoped to a measurement and is deleted across all retention policies in a database.

@bcutter For your specific use case:

Only include the measurement name in the DELETE clause. Do not use a fully-qualified measurement that includes the database and retention policy.
InfluxQL does not support the % wildcard character. In your DELETE statement, it’s treating the % as an identifier literal and is looking for a measurement named %. If you want to delete data from multiple measurements, you have to list each one.
Is entity_id= included in your actual entity_id tag values? Or should it just be "entity_id"='h5055_batt'?
I’d also recommend using DROP SERIES over DELETE since DROP SERIES will also update the index. DELETE just modifies the underlying TSM files, but it doesn’t update the index.

DROP SERIES FROM "example_measurement_1", "example_measurement_2" WHERE "entity_id"='h5055_batt'

bcutter · May 9, 2024, 8:56pm

Makes no difference, deletion not performed.

It’s not a wildcard, it’s the measurement!

That was a copy/paste @ forums error (typo), no impact on the result.

Interesting. Please advise on this example: what would the deletion command need to look like for this example?

I have a list of over 600 series I need to remove from the database. That’s why I want to make 110 % sure the commands are working right.

scott · May 9, 2024, 9:08pm

DROP SERIES FROM "%" WHERE "entity_id"='batt'

Also note that delete operations don’t happen immediately and can take time and a lot of system resources to complete. This has to do with how InfluxDB stores and compacts data over time. All data is stored in “shards”–time-based groupings of data. As shards gets older, InfluxDB performs different levels of compaction on them to optimize disk storage. Shards that are currently written to are considered “hot” and are not compacted. Compacted shards are considered “cold.” You can query cold shards, but it hurts query performance because it has to uncompact each shard and read the data out of it.

Deletes are essentially a query that has to scan shards to identify which shards match your delete predicate. The DROP SERIES command shown above is unbounded, meaning it’s not limited to a specific time range. So the request will have to scan all data, compacted or not, that matches the predicate.

All this to say… tread carefully. Careless, unbounded deletes have been known to take a database offline. Also, some delete requests can take a long time depending on how much data matches the predicate and how old/compacted that data is.

bcutter · May 9, 2024, 9:49pm

Thank you so much for your previous and also that post. Already based on your previous one I could manage to achieve what I was looking for: removing series from the database. No clue how I came up with the ´DELETE FROM´ instead of the (successful) `DROP SERIES´.

I (partly auto-)generated (and carefully reviewed twice) a .txt file with one DROP SERIES statement per line and used it with influx -precision rfc3339 -database 'homeassistant' -username <uname> -password <pwd> < "/share/deletion_list.txt". It contains ~ 600 lines and while the CPU keeps calm it actually stresses the disk quite a lot:

While the bulk deletion is running I can watch the progress indirectly by using
SELECT numSeries FROM "_internal"."monitor"."database" WHERE "database"='homeassistant' ORDER BY time DESC LIMIT 1
which gives the `numSeries". And that is reaaaally slowly decreasing. Because of everything you said: hot and cold, takes some time to find and purge. So your post really helped me actually (better) understand how it works under the hood.

That last question while waiting for the bulk deletion to complete might be a little off-topic but of HUGE interest to me (only asking here as you seem to be very skilled):

Are you aware of a ´SELECT´ statement (or any other possibility) to get a list of series (and also measurements) with the “most data” (in terms of records or storage used), starting with the largest set and sorting the list decreasingly?

In a classic SQL I would do this, unfortunately I’m not skilled enough to transfer this syntax to something working for InfluxDB:

SELECT entity_id, COUNT(*) as count FROM states GROUP BY entity_id ORDER BY count DESC LIMIT 100

This way I’d like to identify which series are (actually or very likely, based on the storage data) consuming the most disk space. Actual problem: database grew really REALLY big and I need to sort out things. With that list I could also decide which of those “top scorers” actually need to be stored in the InfluxDB at all.

scott · May 9, 2024, 10:15pm

DELETE is a valid InfluxQL statement. It’s just a little different than DROP SERIES. Both are documented here:

This is typically what I’d expect. On older, smaller machines, CPU can become a bottleneck, but it looks like you have plenty of cores to process what needs to be processed.

Happy to help!

It’s been a while since I’ve dug around in InfluxDB v1 internal metrics. There’s a lot of data in there, but I don’t think there’s data that could easily identify the storage of each series or measurement.

bcutter · May 9, 2024, 10:17pm

Well, a proper count of events/records per series would already be a good starting point. What do you think?

scott · May 9, 2024, 10:18pm

@bcutter Define “events/records per series.” The number of points or rows that exist in each series?

scott · May 9, 2024, 10:37pm

@bcutter, do you happen to have Flux enabled in your InfluxDB v1 instance? InfluxQL has limitations that make a query like you posted above not work. However, if you were able to use Flux, getting that information would be pretty easy.

bcutter · May 9, 2024, 10:51pm

I don’t think so afaik and following what is provided by the addon (repository/influxdb at master · hassio-addons/repository · GitHub) -.-

Exactly. I think that’s a good base for a rough estimation of the allocated space.

scott · May 9, 2024, 11:06pm

Ok, looking at the docs for that add-on, you can enable Flux in InfluxDB v1 instance by updating your add-on configuration with the INFLUXDB_HTTP_FLUX_ENABLED environment variable:

envvars:
  - name: INFLUXDB_HTTP_FLUX_ENABLED
    value: "true"

Are you using Grafana for visualization? If so, you’ll have to create a new InfluxDB connector and select Flux for the query language (see the Grafana/InfluxDB instructions). The Flux query you’d use would look something like this:

dp = "EXAMPLE_DB" // database name to query
rp = "EXAMPLE_RP" // retention policy to query
field = "EXAMPLE_FIELD" // arbitrary field to query

from(bucket: "${db}/${rp}")
    |> range(start: 0)
    |> filter(fn: (r) => r._measurement == "states")
    |> filter(fn: (r) => r._field == field)
    |> group(columns: ["entity_id"])
    |> count()
    |> group()
    |> map(fn: (r) => ({count: r._value, entity_id: r.entity_id}))
    |> sort(columns: ["count"], desc: true)
    |> limit(n: 100)

Topic		Replies	Views
Deleting data with autogen retention policy	4	2083	May 9, 2024
Beginners problem with retention policy Store retention-policy	2	779	April 30, 2021
Problem with retention policy Store retention-policy	2	666	April 29, 2021
How to delete data (single series) from a database with endless retention policy?	3	282	May 9, 2024
Alter autogen retention policy and drop data older than 1 year on an existing database	2	4142	March 29, 2019

Delete data from DB with autogen retention policy

Related topics