This is still a problem in the latest version 1.5.2.
My query “show series limit 1” ran for a about 15 minutes and caused massive amounts of memory allocation (a few GB) on each server in the cluster sequentially.
The result was the first metric name (alphabetically), so I assume it does sort internally, which is implemented utterly inefficiently. InfluxData need to seriously invest in searches of distributed indicies by implementing sort on top of sorted streams pulled from each node, where each stream is jitted without memory allocation, hitting appropriate persistent index.
The same problem exists for other types of metadata queries, such as find metric starting with for tag=value.
They run for about 1 hour, when user would expect milliseconds, because Grafana makes these requests for type-ahead. Unacceptable! (Interesting, that the queries eventually do succeed and type-ahead is suggested, but I guess users would be a little inpatient if finding/typing the measurement takes a few days.)
I am not sure if this is only problem in the clustered version, because I didn’t previously see it in a single-node Dev instance. If that’s the case, why would we want to pay for clustering and may scale back to a single VM or shard data manually?
Any advice on how to overcome this issue would be greatly appreciated.
My best ideas right now is (1) to test the same on a single VM and see if clustering is the problem, (2) limit amounts of historical data (on single-node VM) to 1 day and (3) split data to multiple retention policy (or databases).