It seems impossible to delete data with default autogen retention policy activated.
Any workaround suggested? Is it possible to configure an infinite retention policy that doesn’t trigger this error?
delete FROM “autogen”.“h” WHERE “entity_id”=‘tempo_pdc_target_acs_temperature_48’ AND time > ‘2022-12-29T15:11:49’ ERR: error parsing query: retention policy not supported at line 1, char 1
Hello @stayorso,
Welcome! This is solved in 2.x if you want to upgrade.
But “autogen” isn’t the name of your db correct?
CAn you try a command like: > DROP SERIES FROM "h2o_feet" WHERE "location" = 'santa_monica' where h2o_feet is the name of your measurement?
Thank you.
No, autogen is not the name of the db, it’s the retention name.
I’ve also tried to copy all the date to another db with a different retention name, but always “infinite” and the bug persists.
I would love to move to 2.x and I’ve tried many times, but still I didn’t manage to transfer the “old” db
Using delete from "%" where "entity_id"='h5055_batt'; seems to work (no output, so it seems to fail silentsly), but doing a select shows/confirms: data still exists.
Retention policy is open end (Duration: ∞, so infinite - for good reasons).
Using delete from "homeassistant"."autogen"."%" where "entity_id"='entity_id=h5055_batt'; gives
ERR: error parsing query: retention policy not supported at line 1, char 1
HOW CAN I FREAKING GET RID OF DATA (single series)
When you use DELETE or DROP SERIES, you can’t scope the request to a retention policy. It’s scoped to a measurement and is deleted across all retention policies in a database.
Only include the measurement name in the DELETE clause. Do not use a fully-qualified measurement that includes the database and retention policy.
InfluxQL does not support the % wildcard character. In your DELETE statement, it’s treating the % as an identifier literal and is looking for a measurement named %. If you want to delete data from multiple measurements, you have to list each one.
Is entity_id= included in your actual entity_id tag values? Or should it just be "entity_id"='h5055_batt'?
I’d also recommend using DROP SERIES over DELETE since DROP SERIES will also update the index. DELETE just modifies the underlying TSM files, but it doesn’t update the index.
DROP SERIES FROM "example_measurement_1", "example_measurement_2" WHERE "entity_id"='h5055_batt'
Also note that delete operations don’t happen immediately and can take time and a lot of system resources to complete. This has to do with how InfluxDB stores and compacts data over time. All data is stored in “shards”–time-based groupings of data. As shards gets older, InfluxDB performs different levels of compaction on them to optimize disk storage. Shards that are currently written to are considered “hot” and are not compacted. Compacted shards are considered “cold.” You can query cold shards, but it hurts query performance because it has to uncompact each shard and read the data out of it.
Deletes are essentially a query that has to scan shards to identify which shards match your delete predicate. The DROP SERIES command shown above is unbounded, meaning it’s not limited to a specific time range. So the request will have to scan all data, compacted or not, that matches the predicate.
All this to say… tread carefully. Careless, unbounded deletes have been known to take a database offline. Also, some delete requests can take a long time depending on how much data matches the predicate and how old/compacted that data is.
Thank you so much for your previous and also that post. Already based on your previous one I could manage to achieve what I was looking for: removing series from the database. No clue how I came up with the ´DELETE FROM´ instead of the (successful) `DROP SERIES´.
I (partly auto-)generated (and carefully reviewed twice) a .txt file with one DROP SERIES statement per line and used it with influx -precision rfc3339 -database 'homeassistant' -username <uname> -password <pwd> < "/share/deletion_list.txt". It contains ~ 600 lines and while the CPU keeps calm it actually stresses the disk quite a lot:
While the bulk deletion is running I can watch the progress indirectly by using SELECT numSeries FROM "_internal"."monitor"."database" WHERE "database"='homeassistant' ORDER BY time DESC LIMIT 1
which gives the `numSeries". And that is reaaaally slowly decreasing. Because of everything you said: hot and cold, takes some time to find and purge. So your post really helped me actually (better) understand how it works under the hood.
That last question while waiting for the bulk deletion to complete might be a little off-topic but of HUGE interest to me (only asking here as you seem to be very skilled):
Are you aware of a ´SELECT´ statement (or any other possibility) to get a list of series (and also measurements) with the “most data” (in terms of records or storage used), starting with the largest set and sorting the list decreasingly?
In a classic SQL I would do this, unfortunately I’m not skilled enough to transfer this syntax to something working for InfluxDB:
SELECT entity_id, COUNT(*) as count FROM states GROUP BY entity_id ORDER BY count DESC LIMIT 100
This way I’d like to identify which series are (actually or very likely, based on the storage data) consuming the most disk space. Actual problem: database grew really REALLY big and I need to sort out things. With that list I could also decide which of those “top scorers” actually need to be stored in the InfluxDB at all.
This is typically what I’d expect. On older, smaller machines, CPU can become a bottleneck, but it looks like you have plenty of cores to process what needs to be processed.
Happy to help!
It’s been a while since I’ve dug around in InfluxDB v1 internal metrics. There’s a lot of data in there, but I don’t think there’s data that could easily identify the storage of each series or measurement.
@bcutter, do you happen to have Flux enabled in your InfluxDB v1 instance? InfluxQL has limitations that make a query like you posted above not work. However, if you were able to use Flux, getting that information would be pretty easy.
Ok, looking at the docs for that add-on, you can enable Flux in InfluxDB v1 instance by updating your add-on configuration with the INFLUXDB_HTTP_FLUX_ENABLED environment variable:
Are you using Grafana for visualization? If so, you’ll have to create a new InfluxDB connector and select Flux for the query language (see the Grafana/InfluxDB instructions). The Flux query you’d use would look something like this:
dp = "EXAMPLE_DB" // database name to query
rp = "EXAMPLE_RP" // retention policy to query
field = "EXAMPLE_FIELD" // arbitrary field to query
from(bucket: "${db}/${rp}")
|> range(start: 0)
|> filter(fn: (r) => r._measurement == "states")
|> filter(fn: (r) => r._field == field)
|> group(columns: ["entity_id"])
|> count()
|> group()
|> map(fn: (r) => ({count: r._value, entity_id: r.entity_id}))
|> sort(columns: ["count"], desc: true)
|> limit(n: 100)