We are using spark to write csv files into Influxdb. We are actually writing with 5-10k points per batch. The total size of our dataset is about 500 GB which contains 1 billion records. Our influxdb is community edition which has been installed on a Azure VM with 56 GB memory. I can confirm it takes 1-2 hours to write whole records. But, we have faced some issues on querying the data.
First, we tried to use two tags named Tarrif_Code and Tarrif_description. The spark job took two hours to be completed and it was done with a successful status. I monitored the job throughout the process and I can confirm that all data were written to DB , but the problem is that when I try typing select count(*) there was only 2 million rows.
Second, I tried to change tag keys to one and different column which is national meter identifier like this : “2001007868”, I ran the spark job again, and then tried to run the same query or select * from measurement limit 5. The query failed and returned this error : ERR: %!s()
Third, I tried to write a smaller volume of data like 40 million rows, I did the same as step two and noticed that this time, I can see the number of rows and my query is working properly.
The questions here are :
1- Why the number of rows was decreased significantly, became 2 million instead of 1 billion when I chose Tarrif_Code as Tag.
2- Do you think this is memory issue or bad database design or even community edition limitation as we cannot run any query against 500 GB data
please note we have 10k series only