Batch versus Stream processing. Combine?

I’d like to get a discussion going on the need for both batch and stream processing. Personally I am quite fristrated by this distinction: it is costing me lots of time and figuring out how to get (even simple) things done. The mixing up of batch and stream documentation does not help much in this regard.

In my opinion they could be combined to just ‘processing’. Batch processing is simplystream processing over a specific time frame in the past. I am convinced kapacitor/influxdb could be seriously simplified and improved upon by just getting rid of the hard distinction between batch and stream processing.

What are your thoughts on this?

@HansKe I think there may be a bit of a confusion going on here. I’ll do my best to clear it up.

Kapacitor has a DSL called Tickscript that is used to define tasks. Beneath tasks are a concept called a pipeline which is made up of individual pieces (nodes) that are connected together. A node can either take in (want) a batch or stream. Similarly it can output (provide) a batch or a stream. A batch is just one or more points grouped together. The type of data that a node wants and provides varies by use case. There are circumstances where you’d want a batch and provide a stream (computing the average or maximum for example), where you want a batch and provide a batch (finding outliers in a batch of data), where you’d want a stream and provide a batch (grouping together similar points), and where you want a stream and provide a stream (applying a mathematical function like the logarithm to a value in a point).

All of the nodes in Kapacitor follow this want/provide semantics. There are times when this gets a bit blurred (see influxQL node where some nodes accept both batches and streams).

As I’m sure you’re aware, in Kapacitor there are two types of tasks batch tasks and stream tasks. The two different types of tasks are the result of the two initial nodes in a pipeline: the ones that query InfluxDB (batch) and those that subscribe to all writes in InfluxDB (stream).

So then the larger question is well why do we need both? The answer really just comes down to the amount of memory you have available. Consider the following example where we want to take an average over the last weeks worth of data every day.

As a batch task:

batch
  |query('SELECT mean(usage_user) as usage_user FROM cpu')
   .period(1w)
   .every(1d)
...

In this case, data is stored in InfluxDB and each day Kapacitor queries the week of data and yields the result to the Kapacitor pipeline. This way InfluxDB can store the weeks worth of data on disk rather that in memory (as we’ll see in the next example).

As a stream task:

stream
  |from()
   .measurement('cpu')
 |window()
  .period(1w)
  .every(1d)
|mean('usage_user')
...

Here we week the entire weeks worth of data in memory. As users are usually writing thousands of individual points a second, over the course of a week this equates to billions and billions of points that will be sitting idly in memory for most of the time.

The question you might then have is well then why would I ever use a stream task. The answer to which is that for small time windows this isn’t so much of an issue and by using a stream instead of a batch, you’ve lowered the query load on InfluxDB.

6 Likes

@michael, thank you very much for your clear explanation. Being new to the TICK stack, your info certainly helps putting things in perspective.

Regarding the semantics of nodes that require batch and/or stream input and output: getting acquainted with kapacitor, it would have been helpful to me if there was a diagram visualising the pipeline concept. Also a matrix overview in the Tickscript Nodes section with input/output types (batch/stream) would have been beneficial. Maybe it is worth considering adding such info to the website for other newbies.

Anyway: thanks a lot for your explanation. It is really helpful!

@HansKe Awesome! Glad I could help

Regarding the semantics of nodes that require batch and/or stream input and output: getting acquainted with kapacitor, it would have been helpful to me if there was a diagram visualising the pipeline concept. Also a matrix overview in the Tickscript Nodes section with input/output types (batch/stream) would have been beneficial. Maybe it is worth considering adding such info to the website for other newbies.

I agree. We’ll be working on adding more kapacitor documentation.

1 Like

Doesn’t stream need to query from influxdb first and then store it in memory?
If yes, then it’ll be querying same as batch. Isn’t it?

Streams create subscriptions to InfluxDB so each point that is written to InfluxDB will also be written to Kapacitor.

@michael
Considering Kapacitor is capable of consuming data straight out of Telegraf and that according to your post batch tasks query InfluxDB does that mean that consuming in Kapacitor from Telegraf can only be done in stream tasks ?

That is correct. Kapacitor can only consume data directly from Telegraf as a stream task.

@michael
Hi there!

I’m using batch tasks to downsample (mean, max, min, and mode) from all my measurements at the same time, but because of this issue #10042 I can’t do it.

I’ve read the section When should we use stream tasks vs batch tasks in Kapacitor on the documentation and I was wondering if I can use a stream batch to calculate this functions if InfluxDB receives some delayed points (right now I have a 2d offset, because sometimes I receive all my points in one day at the same time).

I’ve got 4 task, the first one will use a 30s period, the second one a 6m, the third one 1h, and the last one 12h. Probably the last one is too much time, but I’m not sure if I a stream task is going to work fine at least in the two first cases, or if it’s impossible with such a big delay.

Ilyasbel

@Ilyasbel - Generally if you want to downsample or perform aggregations of some form on data points, you should use batch tasks. Stream tasks are more optimized for performing transformations to individual data points. However, if batch tasks are not working for you right now, you can use stream tasks instead, although the data will be taking up memory in this case rather than sitting on disk. You seem to be keeping your time windows pretty small though, so it hopefully shouldn’t be too much of a problem.

Thanks for your reply @mschae16! :slight_smile:

And what about the delayed points? Points that arrive one day late are a problem in stream?

Ilyasbel

@Ilyasbel To the best of my knowledge, the delayed points should not present a problem in a stream task.

Ok, thank you so much for your answer ^^