Drop Shard and Retention policy deletion check causes massive memory spike and OOM

Influxdb V1.7.6

Seeing huge memory spikes roughly every hour(see Image) which results in OOM and Influx being killed.

image

Memory is increasing from 14Gb to 47GB and causing OOM killer to kick in and kill Influxdb

Problem can be rerpoduced by doing a ‘drop shard ’

Retention policy is below;-

show retention policies;
name duration shardGroupDuration replicaN default


cmk_retention 2160h0m0s 24h0m0s 1 true

Also noticed that I appear to have an orphaned shard which can be seen in /influxdb/data/cmk/cmk_retention but does not show up in ‘show shard groups’. If I attempt to drop this shard memory can be seen to increase as mentioned above, but shared fails to be removed;-

drop shard 666;
ERR: no data received

Can this shard simply be deleted from /influxdb/data/cmk/cmk_retention?

and why does ‘drop shard’ result in massive memory usage?

Thanks

Mark

1 Like

I am seeing a similar issue too. Using the same version of InfluxDB (v1.7.6).

Though I did not drop the shard manually.

@markdollemore Did you try deleting the shard from the filesystem? We are encountering the same problem with InfluxDB 1.7.3 and 1.7.9.

It only happens with some of the shards. For the same database and retention policy, the one shard can easily be removed using the DROP SHARD command and then the next one triggers an OOM.

Hi,

I am having same problem and because of this data the “max-values-per-tag” keeps reaching limit. I confiugred retention policy to remove data after 1 hour and also I have a CQ that runs every 10 min to downsample the data from one_day -> one_week

Almost after every 2 hours, graphs in Grafana start showing almost no data. To fix this, I have to login to server, remove shard then restart influxdb service to make it work.

Is there any work around for this

Below is my config for database in influx

name     	duration 	shardGroupDuration 	replicaN 	default
----     	-------- 	------------------ 	-------- 	-------
autogen  	0s       	168h0m0s           	1        	false
one_day  	1h0m0s   	1h0m0s             	1        	true
one_week 	168h0m0s 	24h0m0s            	1        	false


id  	database   retention_policy 	shard_group 	start_time           	end_time             	expiry_time          owners
--  	--------   ---------------- 	----------- 	----------           	--------             	-----------          ------
117 	xxxxxxxx   one_day          	117         	2020-01-26T09:00:00Z 	2020-01-26T10:00:00Z 	2020-01-26T11:00:00Z
118 	xxxxxxxx   one_day          	118         	2020-01-26T10:00:00Z 	2020-01-26T11:00:00Z 	2020-01-26T12:00:00Z
119 	xxxxxxxx   one_day          	119         	2020-01-26T11:00:00Z 	2020-01-26T12:00:00Z 	2020-01-26T13:00:00Z
120 	xxxxxxxx   one_day          	120         	2020-01-26T12:00:00Z 	2020-01-26T13:00:00Z 	2020-01-26T14:00:00Z
114 	xxxxxxxx   one_week         	114         	2020-01-26T00:00:00Z 	2020-01-27T00:00:00Z 	2020-02-03T00:00:00Z


Current system DateTime 
Sun Jan 26 11:51:20 UTC 2020

Regards,
Mudasir Mirza.

I have tested and it does indeed work to delete the shard folder from the file system. After re-starting InfluxDB the DROP SHARD command can then be run and it will now complete without causing an OOM and successfully remove all traces of the shard.

I have done this twice on separate occasions and it worked perfectly every time. However, I would recommend making a backup of the InfluxDB storage filesystem before attempting this - just in case.