I’m experiencing some scaling issues with influxDB and I’m out to understand what the best solution to my problem might be. I store stock trade data in the following schema:
measurement: trades
keys: symbol (28000), exchange (17)
fields: price (float), size (int)
timestamp: ns (for duplicate trades to avoid needing another tag)
This gives an estimated cardinality of 476k. SHOW SERIES CARDINALITY
gives 469k.
I have on average 26 million points in a day, for all market days in a year (about 250 days/year). I have 5 years of data. I used the default shard size of one week.
I tried running the following query:
SELECT first(price) as o, max(price) as h, min(price) as l, last(price) as c, sum(size) as v
FROM trades WHERE size >= 100
GROUP BY time(1m), symbol fill(none)
but influxDB used all my system resources and crashed with out of memory errors. I tried breaking it down by weeks using a WHERE
clause, but it also crashed. Only breaking it down by days worked, but the queries were extremely slow and did not utilize even half my CPU or disk. Based off the hardware sizing guidelines I hoped my 6-core and 16GB of RAM workstation for a “moderate load” would be sufficient.
I’m currently in the process of reducing my cardinality by doing away with the exchange (17) tag. I’m also changing to using a infinite shard size, since I don’t ever need to delete data. However, running this query per-day to insert into my new database is taking 300+ seconds per-day:
SELECT * INTO us_equities2."almostINF".trades
FROM trades WHERE time >= '{}' AND time < '{}'
GROUP BY symbol
and hardly any of my system resources are being used. This is a pitiful write speed of 87k points/s, which is less than I acheived while inserting into the database in the first place. Also, despite setting my configuration to use the tsi1 index by default for new shards, influxDB seems to be using .tsm files when I peak inside my data directory. Here is my config, with not much changed:
[data]
dir = "E:/InfluxDB/data"
wal-dir = "C:/Program Files/influxdb-1.7.1-1/wal"
# wal-fsync-delay = "0s"
index-version = "tsi1"
# trace-logging-enabled = false
# query-log-enabled = true
# validate-keys = false
# cache-max-memory-size = "1g"
cache-snapshot-memory-size = "25m"
cache-snapshot-write-cold-duration = "10m"
compact-full-write-cold-duration = "4h"
# max-concurrent-compactions = 0
# compact-throughput = "48m"
# compact-throughput-burst = "48m"
# max-index-log-file-size = "1m"
# max-series-per-database = 1000000
# max-values-per-tag = 100000
# tsm-use-madv-willneed = false
I understand I can use the influx_inspect
tool to transform the .tsm files to .tsi1 files, which I plan on doing once my queries complete. However, even after all this, I’m afraid my aggregate queries will crash again.
In the meantime with 70 hours I’m seriously questioning influxDB. Will tsi1 solve my problems? Do I just need more RAM? How much more RAM? Is influxDB not suited for these types of queries due to how its storage engine works? Why in the world is influx so slow inserting back into itself?
Thanks for any help.