Odd Batching Behavior - small amounts of data

DavidHy · January 31, 2023, 9:51am

Hi all,

I encountered a new problem with my python influxdb injection script.
I have an object that is in charge of injecting data in line format that I generate myself from a special kind of file format.

Background

Usually I want to write big chuncks of time series data. So I use batching, and I configure the batch size according to the number of datapoints that I have (divided by 3). And it seems to work pretty well with my large (hundered thousands data points) writes.

I also have another method for injecting a single line, which I use for a different goal.

Both of these methods work well, with good enough performance.

Problem

I’m trying to use the first method, but for a low amount of data as well.
For example: 28K lines, with 10 fields each.
When doing that, the batching system doesn’t work well as it doesn’t send all the batches.
I follow the batches and the lengths of them using the successful/error/retry callbacks, and I see that I’m missing batches (also tested using tcpdump and looking at the messages).
When I default the batch size to 5000 - it all works well. When only 2-3 batches are required - only the first one (or sometimes two) are sent.

What should I do? is it an issue that I should report here/on github?
I can set the write option to synchronous, which works well for the small writes, but I’m unsure of the large ones (especially performance wise).

Thanks for the help,
David

Anaisdg · January 31, 2023, 8:04pm

Hello @DavidHy,
What version of InfluxDB are you using and which Python Client Library are you using?
Have you tried increasing your batches?
This tutorial might be useful to you:

Or this script for writing millions of points:

github.com

influxdata/influxdb-client-python/blob/master/examples/import_data_set_multiprocessing.py

"""
Import public NYC taxi and for-hire vehicle (Uber, Lyft, etc.) trip data into InfluxDB 2.0

https://github.com/toddwschneider/nyc-taxi-data
"""
import concurrent.futures
import io
import multiprocessing
from collections import OrderedDict
from csv import DictReader
from datetime import datetime
from multiprocessing import Value
from urllib.request import urlopen

import reactivex as rx
from reactivex import operators as ops

from influxdb_client import Point, InfluxDBClient, WriteOptions
from influxdb_client.client.write_api import WriteType

This file has been truncated. show original

DavidHy · February 1, 2023, 8:54am

Hi @Anaisdg,

InfluxDB Version: 2.6
Python Client Library version: 1.36.0

I’ve looked at these examples, but my issue is that using the settings I have for large writes doesn’t write all “small” writes.
Another difference that I’ve found is that the tutorial you’ve linked to the size of the data is over a long period. So if I would convert it to line protocol - there are lots of “short” lines. In my case every line has up to 40 (can be modified) different fields in it. And the time difference between lines is small (1-0.01 seconds).

In order to work around it, I currently set the batch size to 5000 when a couple of conditions are present, but its a very flawed work around which I would like to replace with a proper solution.

DavidHy · February 2, 2023, 2:49pm

Adding to that:

I begin to suspect the issue is somewhere else, and not with the size of the written part.

Before calling the write command I printed one line that I’m writing. It looks like this:

SOMETHING_40123_ Signal_1=0,Signal_2=2,Signal_3=0,Signal_4=1,Signal_5=15,Signal_6=0,Signal_7=3,Signal_8=0,Signal_9=1,Signal_10=1,Signal_11=3,Signal_12=3,Signal_13=0.0,Signal_14=0.0,Signal_15=0,Signal_15 1675271633832

Only thing is that there are 40 signals. I just thought its useless to write (and manually change the name) of all of them. But this is how it looks.

In this case, a new line appears every 20 ms.

There are 12943 lines.

The batch size is set to 173000

flush_interval=100,

jitter_interval=2_000,

retry_interval=2_500,

I then call:

res = write_client.write(

self.influx_bucket,

record=lines,

write_precision=WritePrecision.MS

)

The logs that I get are (I censored the ORG_ID although probably not important):

DEBUG:root:Writing 12943 lines to influxDB

!!!+++++++ calc_batch_size=173000.0

DEBUG:Rx:timeout: 1.527539

DEBUG:root:Written batch: (‘HETT’, 'ORG_ID, ‘ms’)

DEBUG:Rx:timeout: 1.925828

DEBUG:root:Written batch: (‘HETT’, ‘ORG_ID’, ‘ms’)

DEBUG:root:The data has been injected by the database!

writing lines took: 9.076039791107178

You can see that one batch is missing.

In InfluxDB I can see that there is a part of the data missing:

I’ve checked and the lines contain all the data, but not all of it is written.

Is there a noticeable mistake in the process?

Thanks for the help

David

DavidHy · February 20, 2023, 11:02am

Hi @Anaisdg,

Continuing with this thread, I had done a couple of experiments. I can’t explain the behavior that I’m seeing:

I generated a file with the lines I want to write. Writing the lines from that file, work every time like a charm.
Writing the lines from within my script - doesn’t.
Unless the flush interval is set to larger than 100 ms. As mentioned above - the Flush_interval that I had set up was 100 ms. when changing in to 500/1_000 the data is fully written, but the writing time get longer. The difference between 500 ms and 1000 ms is about 3 seconds, which amounts to a large duration when continuously writing data.

First of all, is it a bug that I lose data if the flush interval is too low? I can see in a Wireshark Trace that the data isn’t getting sent.

Secondly, how can I optimize the jitter_interval to my needs? just try and error?

kpachal · August 16, 2024, 6:58pm

Hey, I’m having basically the same problem. I have set up my write_api exactly following the tutorial example, and I find that in some cases it writes fine and in others it appears to succeed (as in, does not return any errors) but nothing is actually written. Because it’s somewhat sporadic I can’t see a pattern as to why it works or doesn’t work. Messing around with the flush interval and batch sizes changed cases of success vs failure a bit, but introduced other failure modes (e.g. 80% of data would be written but a chunk would be missing). In all cases the write_api returns no errors and I only spot it by examining the written data myself. What’s going on?

Topic		Replies	Views
Line protocol write problem InfluxDB 2	5	451	October 13, 2021
Optimizing writing performance InfluxDB 2 influxdb , performance , python	2	1973	November 25, 2022
Increasing InfluxDB insertion rate via Influx-Python lib	6	4486	January 2, 2019
Not all points being written to InfluxDB using Python InfluxDB 2 influxdb , python	2	400	February 6, 2024
Poor Write Performance on Bulk Import Store	3	2026	October 2, 2018

Odd Batching Behavior - small amounts of data

Background

Problem

Related topics