Recommended Setup for tick price time series from stock exchange

#1

Hello,

we are currently using a KairosDB and Cassandra setup to store up to 100K updates per second divided over 1M symbols. We are performing less than 10 simple queries per second mostly on last month’s data.
Our setup is currently running on a 32 Core (2x16), 256GB RAM machine. The main problem is the extensive storage usage of cassandra and insufficient usage of the available memory for caching.

Now, I have seen, that InfluxDB has made great progress over the last years and we are considering the migrate.

What would be the recommended setup regarding the write method and config:
Http or UDP?
Batching before sending? (batch size etc.) a 100ms delay before the data is available would be acceptable for us.
How can we use as much of the available memory as possible, to reduce the latency of our queries?

Regarding the data layout: Currently, we use the symbol identifier as “measurement” or metric and tags to identify whether it is a bid, ask or last price and the stock exchange it is coming from.
Is this a recommended way to setup the data for about 1M symbols and about 5 different stock exchanges?

Your help is highly appreciated!

Regards,
Michael

1 Like
#2

@MichaelV

We are performing less than 10 simple queries per second mostly on last month’s data.

Can you give an example of the types of queries?

100K updates per second divided over 1M symbols

Just for clarity, do you mean that each second there will be roughly 100k inserts coming from a total pool or 1M different symbols?

Regarding the data layout: Currently, we use the symbol identifier as “measurement” or metric and tags to identify whether it is a bid, ask or last price and the stock exchange it is coming from.
Is this a recommended way to setup the data for about 1M symbols and about 5 different stock exchanges?

So your data in line protocol looks roughly like

symbol1,type=bid,exchange=NYSE value=100 <timestamp>
symbol2,type=ask,exchange=NASDAQ value=100 <timestamp>

How likely would it be that there would be data like

symbol1,type=bid,exchange=NYSE value=100 <timestamp_1>
symbol1,type=ask,exchange=NYSE value=100 <timestamp_1>
symbol1,type=value,exchange=NYSE value=100 <timestamp_1>
symbol1,type=bid,exchange=NASDAQ value=100 <timestamp_1>
symbol1,type=ask,exchange=NASDAQ value=100 <timestamp_1>
symbol1,type=value,exchange=NASDAQ value=100 <timestamp_1>

where each point has the same symbol and timestamp, but of a different exchange or type?

1 Like
#3

hi,

thanks for your reply.

A query typically queries all data for one given symbol for the last n days (usually not more than 30 days). For instance:

SELECT * FROM symbol1 WHERE time > '2017-03-01T00:00:00Z'

In total 100K data points will be added per second (in peak times, more like 50K on average). There are 1M different symbols in total which they can be added to.

Therefore, the cardinalities of the different properties would be roughly:

symbol: 1M
type: 3
exchange: 10

very likely. Most of the times the different types are provided with the same timestamp simultaneously.

Thanks for your help!

Regards,
Michael

#4

Awesome. In my mind there’s a few ways you could go about structuring your data.

Option 1
Have a single measurement stock_price with two tags ticker and exchange, and three fields bid, ask, and value.
In line protocol that would look like this

stock_price,ticker=<symbol>,exchange=<exchange> bid=1,ask=10,value=17 <timestamp_1>

To me, this is what I would consider to be the standard InfluxDB structure. This is because you can now do queries like

SELECT max(value) FROM stock_price WHERE time > now() - 30d AND exchange = 'NASDAQ' GROUP BY time(1d), ticker

where you do a GROUP BY ticker.

Option 2
Have a measurement for each ticker symbol a single tag exchange, and three fields bid, ask, and value.
In line protocol that would look like this

symbol_1,exchange=<exchange> bid=1,ask=10,value=17 <timestamp_1>

While there’s nothing technically wrong with this approach, having millions of measurements is usually a bit of a red flag to me. Additionally, you lose the ability to do things like run a GROUP BY across all of the symbols, but if you rarely run queries like this, that thats less of an issue.

Therefore, the cardinalities of the different properties would be roughly:

symbol: 1M
type: 3
exchange: 10

Does each symbol exists for each exchange? Meaning, for a given symbol symbol_1 will the following series

symbol_1,exchange=exchage_0
symbol_1,exchange=exchage_1
symbol_1,exchange=exchage_2
symbol_1,exchange=exchage_3
symbol_1,exchange=exchage_4
symbol_1,exchange=exchage_5
symbol_1,exchange=exchage_6
symbol_1,exchange=exchage_7
symbol_1,exchange=exchage_8
symbol_1,exchange=exchage_9

Or will it only be defined for a subset of the 10 different exchanges? e.g.

symbol_1,exchange=exchage_0
symbol_1,exchange=exchage_1
symbol_1,exchange=exchage_2
symbol_1,exchange=exchage_3
symbol_1,exchange=exchage_8
symbol_1,exchange=exchage_9
symbol_2,exchange=exchage_0
symbol_2,exchange=exchage_5
symbol_2,exchange=exchage_9

The motivation behind this question is for determining the total series cardinality of your instance.

The naive calculation of 1M Symbols * 3 Types * 10 Exchanges brings me to 30M series. For versions of InfluxDB 1.2.2, the general rule of thumb for determining the number of series a single instance can handle is 1-5M series per 16G of RAM. This number varies depending on your write and query patterns.

Our setup is currently running on a 32 Core (2x16), 256GB RAM machine.

This hardware is definitely in the ballpark of what should work. Another way to maintain this type of setup would be to use a cluster to scale out.

In the next few months, we’ll have a release with Time Series Index (TSI) which is specifically suited for these types of high cardinality workflows. The hardware requirements to maintain this kind of schema will be drastically lower than the 32 Core (2x16), 256GB RAM machine.

Http or UDP?
Batching before sending? (batch size etc.) a 100ms delay before the data is available would be acceptable for

From what you’ve described HTTP with batch sizes of around 5-8k should be sufficient.

#5

Hi,

thank you again, for your in depth answer.

Sometimes only one of the 3 different values types is updated. That info would get lost with the 3 fields setup, if we, for instance, leave the other values unchanged in that situation. Or can we just add a subset of the 3 fields?

We don’t use group by queries on the symbol at all.

Since you favour the first option, would that be the best choice too, after Time Series Index (TSI) is released?

No, they typically exist on 2-3 exchanges. Giving us about 9M series in total. Therefore, our RAM should be sufficient is guess?

Do I understand this correctly, the critical factor is how many time series we have in total, not the number of measurements?

So any design that groups a set of symbols (i.e. via a hash) to one measurement and then distinguishes them by a tag, would not bring any advantages?

Thanks and regards,

Michael

#6

Happy to help :slight_smile:

Sometimes only one of the 3 different values types is updated. That info would get lost with the 3 fields setup, if we, for instance, leave the other values unchanged in that situation. Or can we just add a subset of the 3 fields?

Adding a subset of the three is perfectly fine.

We don’t use group by queries on the symbol at all.

Since you favour the first option, would that be the best choice too, after Time Series Index (TSI) is released?

I would probably still favor the first options, but I can’t say definitively which workload would be better. Especially since you don’t need the GROUP BY. Typically when we do schema consultations, we’ll try both workloads out to see which works better for your particular write/query patterns.

No, they typically exist on 2-3 exchanges. Giving us about 9M series in total. Therefore, our RAM should be sufficient is guess?

The hardware you have specified, should be enough, but write/query usage patterns can effect this. If I were doing this I’d run some test data.

Do I understand this correctly, the critical factor is how many time series we have in total, not the number of measurements?

The number of series is defined by the total number of unique measurement, tag-set, and field-set combinations.

For example

stock,ticker=A,exhange=NYSE   value=1
stock,ticker=B,exchange=NYSE  value=2
stock,ticker=C,exchange=NYSE  value=3
stock,ticker=D,exchange=NYSE  value=4
A,exhange=NYSE   value=1
B,exchange=NYSE  value=2
C,exchange=NYSE  value=3
D,exchange=NYSE  value=4

and

stock,exhange=NYSE  A=1,B=1,C=3,D=4

all represent 4 unique series. However there are slightly different memory utilization patterns for each.

So any design that groups a set of symbols (i.e. via a hash) to one measurement and then distinguishes them by a tag, would not bring any advantages?

Not sure I totally follow what you’re asking here.