We are using InfluxDB OSS 2.0 as a time-series database in our company. We are dealing with large amounts of data that come from stock exchanges - we save about 1GB of data (10 million data points) daily. The general experience we have with Influx is very positive and far beyond any competitive solution.
However, we’ve noticed that the performance of the Influx HTTP GUI (dashboards, data explore, and the whole app in general) has decreased significantly. Even more measurably, several types of errors seem to occur randomly as we use the GUI or try to download data. I enclose a couple of screenshots of the issues.
We’ve already inspected the memory used by Influx and it seems that the errors occur especially in times when the memory reaches close to its limit and (as TMI is ON) the indices have to be dumped down to disk. Increasing the memory limit does not solve the problem, as eventually there always is a point the indices are being dumped and the issues occur. Eventually it is not clear from the documentation, how does increasing the available memory influence the indexing performance. Also, we have tried extending CPU capabilities, essentially doubling the number of vCPUs, which also did not help.
The database runs as a part of a microservices docker-compose app. We use the official InfluxDB OSS 2.0 docker image. The docker environment is deployed into a AWS cloud EC2 instance and is thus easily scalable. The storage used for persisting the data, mounted to the instance and provided as a docker volume is AWS ElasticFileSystem.
As we seek the cause in the high cardinality of data, we have set up an additional testing environment resembling the main one, with the only big difference of the data being saved for just a few recent days (in the main setup we have data from about three months now). The issues seemed to be almost totally gone!
Can You please guide us with the best practice to maintain the best performance in our use case? How should we scale the system to achieve best performance to resources ratio? Would adding additional tags (i.e. with dates) help? Can we somehow check if and how the disk throughtput becomeas a bottleneck? Can we control the sizes of the indices? Or maybe we should somehow split the data, e.g. by creating a new bucket for each week (which however doesn’t seem to be the most elegant solution)? I’d appreciate any suggestions or help.
Thank You in advance.!