Awesome. In my mind there’s a few ways you could go about structuring your data.
Option 1
Have a single measurement stock_price
with two tags ticker
and exchange
, and three fields bid
, ask
, and value
.
In line protocol that would look like this
stock_price,ticker=<symbol>,exchange=<exchange> bid=1,ask=10,value=17 <timestamp_1>
To me, this is what I would consider to be the standard InfluxDB structure. This is because you can now do queries like
SELECT max(value) FROM stock_price WHERE time > now() - 30d AND exchange = 'NASDAQ' GROUP BY time(1d), ticker
where you do a GROUP BY ticker
.
Option 2
Have a measurement for each ticker symbol a single tag exchange
, and three fields bid
, ask
, and value
.
In line protocol that would look like this
symbol_1,exchange=<exchange> bid=1,ask=10,value=17 <timestamp_1>
While there’s nothing technically wrong with this approach, having millions of measurements is usually a bit of a red flag to me. Additionally, you lose the ability to do things like run a GROUP BY
across all of the symbols, but if you rarely run queries like this, that thats less of an issue.
Therefore, the cardinalities of the different properties would be roughly:
symbol: 1M
type: 3
exchange: 10
Does each symbol exists for each exchange? Meaning, for a given symbol symbol_1
will the following series
symbol_1,exchange=exchage_0
symbol_1,exchange=exchage_1
symbol_1,exchange=exchage_2
symbol_1,exchange=exchage_3
symbol_1,exchange=exchage_4
symbol_1,exchange=exchage_5
symbol_1,exchange=exchage_6
symbol_1,exchange=exchage_7
symbol_1,exchange=exchage_8
symbol_1,exchange=exchage_9
Or will it only be defined for a subset of the 10 different exchanges? e.g.
symbol_1,exchange=exchage_0
symbol_1,exchange=exchage_1
symbol_1,exchange=exchage_2
symbol_1,exchange=exchage_3
symbol_1,exchange=exchage_8
symbol_1,exchange=exchage_9
symbol_2,exchange=exchage_0
symbol_2,exchange=exchage_5
symbol_2,exchange=exchage_9
The motivation behind this question is for determining the total series cardinality of your instance.
The naive calculation of 1M Symbols * 3 Types * 10 Exchanges
brings me to 30M series. For versions of InfluxDB 1.2.2, the general rule of thumb for determining the number of series a single instance can handle is 1-5M series per 16G of RAM. This number varies depending on your write and query patterns.
Our setup is currently running on a 32 Core (2x16), 256GB RAM machine.
This hardware is definitely in the ballpark of what should work. Another way to maintain this type of setup would be to use a cluster to scale out.
In the next few months, we’ll have a release with Time Series Index (TSI) which is specifically suited for these types of high cardinality workflows. The hardware requirements to maintain this kind of schema will be drastically lower than the 32 Core (2x16), 256GB RAM machine
.
Http or UDP?
Batching before sending? (batch size etc.) a 100ms delay before the data is available would be acceptable for
From what you’ve described HTTP with batch sizes of around 5-8k should be sufficient.