Timeouts severe loading data

I upgraded influx, and no matter how I try to write data I get timeouts. Both CLI and Python

Version: InfluxDB 2.0.8 (git: e91d41810f) build_date: 2021-08-13T18:22:30Z

I completely cleared my installation, removed all files under ~/.influxdbv2, recreated buckets, rebooted etc, ran influxd with --http-write-timeout 0

Here is a simple example – loads a few hundred lines of a file then quits

(base) [laplace(fcoetzee):/data01/projects/TMP] $ influx write -o witblits -b kibot -f influx_load.txt
Error: Failed to write data: unexpected error writing points to database: timeout.
See ‘influx write -h’ for help
(base) [laplace(fcoetzee):/data01/projects/TMP] $ wc influx_load.txt
2465000 7395000 329590623 influx_load.txt
(base) [laplace(fcoetzee):/data01/projects/TMP] $ head influx_load.txt
kibot_eod_by_sym,sym=AC__2010F close=2.13,date_gmt=20080114i,high=2.13,low=2.13,oi=0i,open=2.13,volume=0i 1200367800000000000
kibot_eod_by_sym,sym=AC__2010F close=2.1,date_gmt=20080115i,high=2.1,low=2.1,oi=0i,open=2.1,volume=0i 1200454200000000000
kibot_eod_by_sym,sym=AC__2010F close=2.06,date_gmt=20080116i,high=2.06,low=2.06,oi=0i,open=2.06,volume=0i 1200540600000000000
kibot_eod_by_sym,sym=AC__2010F close=2.07,date_gmt=20080117i,high=2.07,low=2.07,oi=0i,open=2.07,volume=0i 1200627000000000000
kibot_eod_by_sym,sym=AC__2010F close=2.08,date_gmt=20080118i,high=2.08,low=2.08,oi=0i,open=2.08,volume=0i 1200713400000000000
kibot_eod_by_sym,sym=AC__2010F close=2.03,date_gmt=20080122i,high=2.03,low=2.03,oi=0i,open=2.03,volume=0i 1201059000000000000
kibot_eod_by_sym,sym=AC__2010F close=2.02,date_gmt=20080123i,high=2.02,low=2.02,oi=0i,open=2.02,volume=0i 1201145400000000000
kibot_eod_by_sym,sym=AC__2010F close=2.02,date_gmt=20080124i,high=2.02,low=2.02,oi=0i,open=2.02,volume=0i 1201231800000000000
kibot_eod_by_sym,sym=AC__2010F close=2.02,date_gmt=20080125i,high=2.02,low=2.02,oi=0i,open=2.02,volume=0i 1201318200000000000
kibot_eod_by_sym,sym=AC__2010F close=2.03,date_gmt=20080128i,high=2.03,low=2.03,oi=0i,open=2.03,volume=0i 1201577400000000000

One thing – even if trying to load just a few hundred lines a ton of reindexing takes place, lot of disk IO

ts=2021-10-08T10:40:03.745363Z lvl=info msg=“index opened with 8 partitions” log_id=0X3sekrG000 service=storage-engine index=tsi
ts=2021-10-08T10:40:03.745398Z lvl=info msg=“index opened with 8 partitions” log_id=0X3sekrG000 service=storage-engine index=tsi
ts=2021-10-08T10:40:03.745438Z lvl=info msg=“Reading file” log_id=0X3sekrG000 service=storage-engine engine=tsm1 service=cacheloader path=/root/.influxdbv2/engine/wal/7109037767e8fb1b/autogen/668/_00001.wal size=296
ts=2021-10-08T10:40:03.745458Z lvl=info msg=“index opened with 8 partitions” log_id=0X3sekrG000 service=storage-engine index=tsi
ts=2021-10-08T10:40:03.745535Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/222 duration=9.825ms
ts=2021-10-08T10:40:03.745739Z lvl=info msg=“Reading file” log_id=0X3sekrG000 service=storage-engine engine=tsm1 service=cacheloader path=/root/.influxdbv2/engine/wal/7109037767e8fb1b/autogen/442/_00001.wal size=1127
ts=2021-10-08T10:40:03.745883Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/441 duration=14.045ms
ts=2021-10-08T10:40:03.746002Z lvl=info msg=“Reading file” log_id=0X3sekrG000 service=storage-engine engine=tsm1 service=cacheloader path=/root/.influxdbv2/engine/wal/7109037767e8fb1b/autogen/663/_00001.wal size=292
ts=2021-10-08T10:40:03.746147Z lvl=info msg=“index opened with 8 partitions” log_id=0X3sekrG000 service=storage-engine index=tsi
ts=2021-10-08T10:40:03.746629Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/651 duration=10.616ms
ts=2021-10-08T10:40:03.746729Z lvl=info msg=“Reading file” log_id=0X3sekrG000 service=storage-engine engine=tsm1 service=cacheloader path=/root/.influxdbv2/engine/wal/7109037767e8fb1b/autogen/382/_00001.wal size=1175
ts=2021-10-08T10:40:03.746972Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/442 duration=14.700ms
ts=2021-10-08T10:40:03.747095Z lvl=info msg=“Reading file” log_id=0X3sekrG000 service=storage-engine engine=tsm1 service=cacheloader path=/root/.influxdbv2/engine/wal/7109037767e8fb1b/autogen/443/_00001.wal size=1175
ts=2021-10-08T10:40:03.747226Z lvl=info msg=“Reading file” log_id=0X3sekrG000 service=storage-engine engine=tsm1 service=cacheloader path=/root/.influxdbv2/engine/wal/7109037767e8fb1b/autogen/357/_00001.wal size=1355
ts=2021-10-08T10:40:03.747248Z lvl=info msg=“Reading file” log_id=0X3sekrG000 service=storage-engine engine=tsm1 service=cacheloader path=/root/.influxdbv2/engine/wal/7109037767e8fb1b/autogen/678/_00001.wal size=296
ts=2021-10-08T10:40:03.747395Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/678 duration=15.050ms
ts=2021-10-08T10:40:03.745891Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/596 duration=12.680ms
ts=2021-10-08T10:40:03.748882Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/263 duration=17.022ms
ts=2021-10-08T10:40:03.749288Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/663 duration=17.183ms
ts=2021-10-08T10:40:03.749374Z lvl=info msg=“index opened with 8 partitions” log_id=0X3sekrG000 service=storage-engine index=tsi
ts=2021-10-08T10:40:03.749052Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/357 duration=12.327ms
ts=2021-10-08T10:40:03.749288Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/668 duration=13.420ms
ts=2021-10-08T10:40:03.750264Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/443 duration=13.908ms
ts=2021-10-08T10:40:03.750518Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/382 duration=18.688ms
ts=2021-10-08T10:40:03.750486Z lvl=info msg=“Reading file” log_id=0X3sekrG000 service=storage-engine engine=tsm1 service=cacheloader path=/root/.influxdbv2/engine/wal/7109037767e8fb1b/autogen/66/_00001.wal size=2344
ts=2021-10-08T10:40:03.752722Z lvl=info msg=“index opened with 8 partitions” log_id=0X3sekrG000 service=storage-engine index=tsi
ts=2021-10-08T10:40:03.752711Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/66 duration=28.367ms
ts=2021-10-08T10:40:03.753637Z lvl=info msg=“Reading file” log_id=0X3sekrG000 service=storage-engine engine=tsm1 service=cacheloader path=/root/.influxdbv2/engine/wal/7109037767e8fb1b/autogen/642/_00001.wal size=290
ts=2021-10-08T10:40:03.753958Z lvl=info msg=“Opened shard” log_id=0X3sekrG000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/root/.influxdbv2/engine/data/7109037767e8fb1b/autogen/642 duration=16.428ms

@Frans_Coetzee the upcoming v2.1 will add a CLI option to influxd to allow changing the default timeout for write requests, which should help you work around this.

This is the second report I’ve seen of write-performance problems in v2.0.8. I’ll make an issue to investigate.

Thanks. I think it is more than a timeout – when I upload small files I get awful performance on writes, less than 7 lines/sec. No timeout will compensate for this. It is not the machine or disk – I cleared all processes. There seems to be an insane amount of Reindexing taking place on each data point.

I rolled back to V 2.0.7 and 2.0.6 and had similar issues. Any way to set a flag to dump timing/profiling info to look at – or that I can send to you guys?

time influx write -b kibot -o witblits -f tmp2te4q0zb
Error: Failed to write data: unexpected error writing points to database: timeout.
See ‘influx write -h’ for help

real 1m34.105s
user 0m0.012s
sys 0m0.006s
(base) [laplace(fcoetzee):/data01/INFLUX_LOAD_IN/rolltable] $ wc tmp2te4q0zb
696 2088 91035 tmp2te4q0zb
(base) [laplace(fcoetzee):/data01/INFLUX_LOAD_IN/rolltable] $ bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty’.
696/94.105
7.39599383667180277349

It should sail through this

$ head tmp2te4q0zb
by_utc,gsym=SIN_16,prefix=SIN,roll_no=16 date_roll=19900122i,new_contract=“SIN_1991J”,old_contract=“SIN_1991H” 632956800000000000
by_utc,gsym=SIN_16,prefix=SIN,roll_no=16 date_roll=19900122i,new_contract=“SIN_1991K”,old_contract=“SIN_1991J” 632956800000000000
by_utc,gsym=SIN_16,prefix=SIN,roll_no=16 date_roll=19900219i,new_contract=“SIN_1991M”,old_contract=“SIN_1991K” 635376000000000000
by_utc,gsym=SIN_16,prefix=SIN,roll_no=16 date_roll=19900326i,new_contract=“SIN_1991N”,old_contract=“SIN_1991M” 638400000000000000
by_utc,gsym=SIN_16,prefix=SIN,roll_no=16 date_roll=19900423i,new_contract=“SIN_1991Q”,old_contract=“SIN_1991N” 640815600000000000
by_utc,gsym=SIN_16,prefix=SIN,roll_no=16 date_roll=19900528i,new_contract=“SIN_1991U”,old_contract=“SIN_1991Q” 643839600000000000
by_utc,gsym=SIN_16,prefix=SIN,roll_no=16 date_roll=19900625i,new_contract=“SIN_1991V”,old_contract=“SIN_1991U” 646258800000000000
by_utc,gsym=SIN_16,prefix=SIN,roll_no=16 date_roll=19900723i,new_contract=“SIN_1991X”,old_contract=“SIN_1991V” 648678000000000000
by_utc,gsym=SIN_16,prefix=SIN,roll_no=16 date_roll=19900827i,new_contract=“SIN_1991Z”,old_contract=“SIN_1991X” 651702000000000000
by_utc,gsym=SIN_16,prefix=SIN,roll_no=16 date_roll=19900924i,new_contract=“SIN_1992F”,old_contract=“SIN_1991Z” 654121200000000000

Based on tests it seems write performance is orders of magnitude worse at nano-second vs microsecond resolution.

At microsecond resolution you can get reasonably stable performance with CLI influx write… from files.

The Python influxdb_client also does a lot better but often still truncates data or batch loads not yet started by the time the program exits or client.close() is called.

@Frans_Coetzee what are the retention-duration and shard-group-duration values for the bucket? You can find both with influx bucket list if you aren’t sure.

The “reindexing” you’ve mentioned is probably a red herring, as it’s a log that’s printed any time a shard is created/opened w/o an existing index on disk (i.e. every time a new shard is created). The fact that so many new shards are being created could be the issue.

Here is what I see when I list bucket

(base) [laplace(fcoetzee):/mnt/nvme01/home/fcoetzee/fcoe_base/src/python/hobbes/datasources] $
influx bucket list
ID Name Retention Shard group duration Organization ID
67ff7a90b454e358 _monitoring 168h0m0s n/a e2c37aa8765dad8f
6585cd0cee1da475 _tasks 72h0m0s n/a e2c37aa8765dad8f
0a9686fb98b2a49a federalreserve infinite n/a e2c37aa8765dad8f
bad63ca6144499e2 kibot infinite n/a e2c37aa8765dad8f
c58286ba99806dd7 norgate infinite n/a e2c37aa8765dad8f
d56d74cf57fdc222 quandl infinite n/a e2c37aa8765dad8f
37e0bd4aeb303d82 rolltable infinite n/a e2c37aa8765dad8f
1bc087a4b3a72707 stats infinite n/a e2c37aa8765dad8f

I’m surprised to see the "n/a"s in the output there… I’d only expect that placeholder to appear when listing buckets from InfluxDB Cloud or an OSS version < 2.0.5. To double-check, could you run the following commands and let me know the outputs?

  • influx config
  • curl <your-db-url>/health
  • influx version
  • influxd version

And a few more questions:

  • In your original post you say you “upgraded influx”. Assuming you upgraded from 1.8, were you using the inmem index, or tsi1?
  • Is your disk an HDD, or an SSD?

Thanks for all the help

Running your commands:

  • default http://localhost:8086 witblits
    (base) [laplace(fcoetzee):/mnt/nvme01/home/fcoetzee/fcoe_base/src/python/hobbes/datasources] $ curl http://localhost:8086/health
    {“name”:“influxdb”, “message”:“ready for queries and writes”, “status”:“pass”, “checks”:, “version”: “2.0.8”, “commit”: “e91d41810f”}
    (base) [laplace(fcoetzee):/mnt/nvme01/home/fcoetzee/fcoe_base/src/python/hobbes/datasources] $ influx version
    Influx CLI 2.0.8 (git: e91d41810f) build_date: 2021-08-13T18:22:30Z
    (base) [laplace(fcoetzee):/mnt/nvme01/home/fcoetzee/fcoe_base/src/python/hobbes/datasources] $ influxd version
    InfluxDB 2.0.8 (git: e91d41810f) build_date: 2021-08-13T18:22:30Z
    (base) [laplace(fcoetzee):/mnt/nvme01/home/fcoetzee/fcoe_base/src/python/hobbes/datasources] $

As to your questions re upgrading from old 1.8 to current:

(i) Re inmem or tsi1, I did not set/configure anything. Assume it was thus tsi1? For install, I used rpms, but blew those way (and all related /etc files etc) and am just using binaries directly from tar now.
Note, I completely removed the .influxdbv2 data from my old 1.8 setup,and I am just experimenting on a completely clean install.

(ii) The influxdbv2 directory is on an HDD

Based on your earlier comments I cleaned and re-installed, recreated buckets with shard created explicitly so when I look at buckets the shards now don’t show n/a duration. Not sure if they became corrupted earlier or were improperly created but they look right now

(base) [laplace(fcoetzee):/mnt/nvme01/home/fcoetzee/fcoe_base/src/python/hobbes/datasources] $ influx -o witblits bucket create -n qa --shard-group-duration 0
ID Name Retention Shard group duration Organization ID
eac81f07137504dd qa infinite 168h0m0s b1cf3ebaa7f9a596

Unfortunately the performance is still slow, with lot of timeouts. I am getting a few 100 lines/sec with CLI, and if the file is large I tend to get timeouts:

2021-10-11T14:45:41E 2021-10-11T18:45:41Z laplace.witblits.com InfluxD: Loading fn /data01/INFLUX_LOAD_IN/rolltable/load_c2d6a1878a.txt 0/1 wc=1624 rate = 738.25 lines/sec
2021-10-11T14:45:49E 2021-10-11T18:45:49Z laplace.witblits.com InfluxD: Loading fn /data01/INFLUX_LOAD_IN/rolltable/load_dd2fa2915e.txt 0/1 wc=1624 rate = 596.31 lines/sec
2021-10-11T14:45:56E 2021-10-11T18:45:56Z laplace.witblits.com InfluxD: Loading fn /data01/INFLUX_LOAD_IN/rolltable/load_c6576efafb.txt 0/1 wc=4640 Error: Failed to write data: unexpected error writing points to database: timeout.
See ‘influx write -h’ for help
rate = 374.34 lines/sec
.2021-10-11T14:46:14E 2021-10-11T18:46:14Z laplace.witblits.com InfluxD: Loading fn /data01/INFLUX_LOAD_IN/rolltable/load_2b3c1f630c.txt 0/1 wc=11832 Error: Failed to write data: unexpected error writing points to database: timeout.
See ‘influx write -h’ for help

FWIW I used to easily get around 20,000 lines/sec pre upgrade (and via Python)

May be a red herring, but I do see that when I wrote timestamps to the text file I am loading (line protocol) and file timestamps are in us, and I load from CLI with precision ns, I get fast response, but when I set CLI to have precision of ‘us’ it is slow. Not sure how you index on time but this seems very consistent. Note – for my application (finance) I need nano-second precision (so I can figure out if a quote A happened before quote B, but quotes show up in bursts).

(base) [laplace(fcoetzee):/mnt/nvme01/home/fcoetzee] $ influx -o witblits bucket create -n clean --shard-group-duration 0
ID Name Retention Shard group duration Organization ID
723c710bc7eb5d14 clean infinite 168h0m0s b1cf3ebaa7f9a596
(base) [laplace(fcoetzee):/mnt/nvme01/home/fcoetzee] $ time influx write -o witblits --bucket clean -f /data01/INFLUX_LOAD_IN/rolltable/tmp_c04a803e7c -p us
Error: Failed to write data: unexpected error writing points to database: timeout.
See ‘influx write -h’ for help

real 1m36.544s
user 0m0.016s
sys 0m0.005s

Create a clean bucket, and load exact same data with nano-second precision (wrong but table are ok)

(base) [laplace(fcoetzee):/mnt/nvme01/home/fcoetzee] $ influx -o witblits bucket create -n clean_ns --shard-group-duration 0
ID Name Retention Shard group duration Organization ID
df1f69da97ba5138 clean_ns infinite 168h0m0s b1cf3ebaa7f9a596
(ba
(base) [laplace(fcoetzee):/mnt/nvme01/home/fcoetzee] $ time influx write -o witblits --bucket clean_ns -f /data01/INFLUX_LOAD_IN/rolltable/tmp_c04a803e7c -p ns

real 0m2.941s
user 0m0.008s
sys 0m0.014s
(base) [laplace(fcoetzee):/mnt/nvme01/home/fcoetzee] $ head /data01/INFLUX_LOAD_IN/rolltable/tmp_c04a803e7c
by_utc,gsym=CL__-3 date_roll=19891214i,new_contract=“CL__1990F”,old_contract=“CL__1990F” 629587200000000
by_utc,gsym=CL__-3 date_roll=19891214i,new_contract=“CL__1990F”,old_contract=“CL__1990F” 629587200000000
by_utc,gsym=CL__-3 date_roll=19900116i,new_contract=“CL__1990F”,old_contract=“CL__1990F” 632438400000000
by_utc,gsym=CL__-3 date_roll=19900214i,new_contract=“CL__1990F”,old_contract=“CL__1990F” 634944000000000
by_utc,gsym=CL__-3 date_roll=19900314i,new_contract=“CL__1990F”,old_contract=“CL__1990F” 637363200000000
by_utc,gsym=CL__-3 date_roll=19900416i,new_contract=“CL__1990G”,old_contract=“CL__1990F” 640210800000000
by_utc,gsym=CL__-3 date_roll=19900516i,new_contract=“CL__1990H”,old_contract=“CL__1990G” 642802800000000
by_utc,gsym=CL__-3 date_roll=19900614i,new_contract=“CL__1990J”,old_contract=“CL__1990H” 645308400000000
by_utc,gsym=CL__-3 date_roll=19900716i,new_contract=“CL__1990K”,old_contract=“CL__1990J” 648073200000000
by_utc,gsym=CL__-3 date_roll=19900815i,new_contract=“CL__1990M”,old_contract=“CL__1990K” 650665200000000

Ok I think I see the problem (hard w/o source code)

The data I am loading is spread over 40 years (historical tick data). And some data occurs infrequently (events) while others occur frequently (quote streams). Note the event fields in the examples above are spaced months apart. So I guess when it hits a sparse event sequence like the above each new data point generates a new shard and the disk IO gets overrun ? In any case when I configure shards that are years in duration I seem to get a ton less disk activity.

Is there detailed documentation on how to set up shards for various data types in V2.0.x ? Eg if I do make shards long and then mix event and high-frequency data, will I either get (a) crappy loads (short shards) but fast queries, (b) fast loads (shards=years) but queries will be awful ?

@Frans_Coetzee there are a few things going on here.

First, the default index in 1.8 was inmem, while the only option in 2.x is tsi1. TSI is known to negatively impact ingest performance.

Second, based on your debugging output I was able to identify a bug in the influxd upgrade process that prevented shard-group durations from being properly carried forward, instead being set to 0. As you noticed, shard-group duration can have a significant effect on write performance. I need to double-check with my team before I give a final answer on your questions about the trade-offs of different strategies to setting the duration.

Third, InfluxDB is designed & tested to run on SSDs (the explanation in the 1.8 docs here applies equally well to the 2.x series). Since you weren’t seeing problems before, my guess is that the HDD + TSI is proving to be a very un-performant combination. If you’re in a position to switch to an SSD, I’d recommend it.

I’ve opened a pull request to fix the shard-group duration bug in influxd upgrade. I’ll comment again here once it’s merged & when I have the answers to your other questions.