Influx write fails with "internal error"

Hello,

I’m trying to migrate from 1.7 to 2.x open source.
I have a CSV file with around 7.5m data points.

When running influx write it works fine for some minutes. But then it stops with

Error: Failed to write data: unexpected error writing points to database: timeout.

In the server log I can see this:

ts=2021-09-22T06:10:22.086698Z lvl=debug msg=Request log_id=0Wk0iYG0000 service=http method=POST host=localhost:8086 path=/api/v2/write query="bucket=b4&org=private&precision=ns" proto=HTTP/1.1 status_code=204 response_size=0 content_length=-1 referrer= remote=[::1]:60574 user_agent=Go-http-client took=77.626ms
ts=2021-09-22T06:10:24.453658Z lvl=debug msg="user find by ID" log_id=0Wk0iYG0000 store=new took=0.000ms
ts=2021-09-22T06:10:24.453658Z lvl=debug msg="org find" log_id=0Wk0iYG0000 store=new took=0.000ms
ts=2021-09-22T06:10:24.454659Z lvl=debug msg="bucket find" log_id=0Wk0iYG0000 store=new took=1.001ms
ts=2021-09-22T06:10:31.045872Z lvl=debug msg=Request log_id=0Wk0iYG0000 service=http method=GET host=localhost:8086 path=/metrics query= proto=HTTP/1.1 status_code=200 response_size=4783 content_length=0 referrer= remote=[::1]:54675 user_agent=Go-http-client took=1.593ms body=
ts=2021-09-22T06:10:39.027054Z lvl=info msg="index opened with 8 partitions" log_id=0Wk0iYG0000 service=storage-engine index=tsi
ts=2021-09-22T06:10:39.029051Z lvl=info msg="Reindexing TSM data" log_id=0Wk0iYG0000 service=storage-engine engine=tsm1 db_shard_id=8641
ts=2021-09-22T06:10:39.029051Z lvl=info msg="Reindexing WAL data" log_id=0Wk0iYG0000 service=storage-engine engine=tsm1 db_shard_id=8641
.....
ts=2021-09-22T06:10:49.888342Z lvl=info msg="index opened with 8 partitions" log_id=0Wk0iYG0000 service=storage-engine index=tsi
ts=2021-09-22T06:10:49.889343Z lvl=info msg="Reindexing TSM data" log_id=0Wk0iYG0000 service=storage-engine engine=tsm1 db_shard_id=8769
ts=2021-09-22T06:10:49.889343Z lvl=info msg="Reindexing WAL data" log_id=0Wk0iYG0000 service=storage-engine engine=tsm1 db_shard_id=8769
ts=2021-09-22T06:10:49.898345Z lvl=debug msg=Request log_id=0Wk0iYG0000 service=http method=POST host=localhost:8086 path=/api/v2/write query="bucket=b4&org=private&precision=ns" proto=HTTP/1.1 status_code=500 response_size=90 content_length=-1 referrer= remote=[::1]:60574 user_agent=Go-http-client took=25444.687ms error="internal error" error_code="internal error"

My interpretation is that this “indexing” runs very long and thus causes the “write” to timeout/fail.
Is there any chance to work around this?

I tried with
http-write-timeout: 60s
but it did not have an effect.

hardware: running on windows 10, AMD Ryzen 7 PRO, 32GB, 1TB SSD NVMe

@magcode I think the reindexing logs are a red herring. Reindexing happens when a shard is first opened; this only happens on writes when the shard is new, and then there’s nothing to reindex.

What version of influxd are you running? If you’re not on the latest patch release, I can look for related issues that might have been fixed in newer versions.

InfluxDB 2.0.8 (git: e91d41810f) build_date: 2021-08-13T18:22:30Z

I will give the nightly a try (on linux)

Same issue.

Client (on the same machine)

influx write --bucket b4 --file influxdata.filtered --org private --format lp -t cMG....NFelPxsAtoulg== --rate-limit "50kB/1s"

response:

Error: Failed to write data: unexpected error writing points to database: timeout.
See 'influx write -h' for help

Server

InfluxDB nightly (git: 47007f6988) build_date: 2021-09-23T05:18:47Z

log messages:

 grep -h3 'error' nohup.out 1
ts=2021-09-23T15:35:25.970379Z lvl=info msg="Configuring InfluxQL statement executor (zeros indicate unlimited)." log_id=0WlpatQl000 max_select_point=0 max_select_series=0 max_select_buckets=0
ts=2021-09-23T15:35:26.282759Z lvl=info msg=Starting log_id=0WlpatQl000 service=telemetry interval=8h
ts=2021-09-23T15:35:26.282768Z lvl=info msg=Listening log_id=0WlpatQl000 service=tcp-listener transport=http addr=:8086 port=8086
ts=2021-09-23T15:35:30.230463Z lvl=info msg=Unauthorized log_id=0WlpatQl000 error="session not found"
ts=2021-09-23T15:35:58.800535Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T15:35:58.800896Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=1509
ts=2021-09-23T15:35:58.800906Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=1509
--
ts=2021-09-23T17:19:01.833284Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T17:19:01.833609Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2225
ts=2021-09-23T17:19:01.833624Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2225
ts=2021-09-23T17:19:01.836825Z lvl=info msg="Write failed" log_id=0WlpatQl000 service=storage-engine service=write shard=2532 error="engine: context canceled"
ts=2021-09-23T17:19:01.851716Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T17:19:01.852049Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2635
ts=2021-09-23T17:19:01.852059Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2635
ts=2021-09-23T17:19:01.867672Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T17:19:01.868221Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2741
ts=2021-09-23T17:19:01.868232Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2741
ts=2021-09-23T17:19:01.880965Z lvl=info msg="Write failed" log_id=0WlpatQl000 service=storage-engine service=write shard=2108 error="engine: context canceled"
ts=2021-09-23T17:19:01.881719Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T17:19:01.882247Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2427
ts=2021-09-23T17:19:01.882258Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2427
ts=2021-09-23T17:19:01.897258Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T17:19:01.897753Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=1767
ts=2021-09-23T17:19:01.897765Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=1767
ts=2021-09-23T17:19:01.909844Z lvl=info msg="Write failed" log_id=0WlpatQl000 service=storage-engine service=write shard=2225 error="engine: context canceled"
ts=2021-09-23T17:19:01.914729Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T17:19:01.915283Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2622
ts=2021-09-23T17:19:01.915294Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2622
ts=2021-09-23T17:19:01.926465Z lvl=info msg="Write failed" log_id=0WlpatQl000 service=storage-engine service=write shard=2635 error="engine: context canceled"
ts=2021-09-23T17:19:01.931469Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T17:19:01.932090Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2293
ts=2021-09-23T17:19:01.932102Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2293
--
ts=2021-09-23T17:19:01.959468Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T17:19:01.960068Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2434
ts=2021-09-23T17:19:01.960083Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2434
ts=2021-09-23T17:19:01.969298Z lvl=info msg="Write failed" log_id=0WlpatQl000 service=storage-engine service=write shard=2427 error="engine: context canceled"
ts=2021-09-23T17:19:01.974398Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T17:19:01.974902Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=1972
ts=2021-09-23T17:19:01.974919Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=1972
ts=2021-09-23T17:19:01.993813Z lvl=info msg="Write failed" log_id=0WlpatQl000 service=storage-engine service=write shard=2622 error="engine: context canceled"
ts=2021-09-23T17:19:02.004764Z lvl=info msg="Write failed" log_id=0WlpatQl000 service=storage-engine service=write shard=2293 error="engine: context canceled"
ts=2021-09-23T17:19:02.015768Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T17:19:02.016179Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=1785
ts=2021-09-23T17:19:02.016192Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=1785
ts=2021-09-23T17:19:02.032306Z lvl=info msg="Write failed" log_id=0WlpatQl000 service=storage-engine service=write shard=2434 error="engine: context canceled"
ts=2021-09-23T17:19:02.038857Z lvl=info msg="index opened with 8 partitions" log_id=0WlpatQl000 service=storage-engine index=tsi
ts=2021-09-23T17:19:02.039122Z lvl=info msg="Reindexing TSM data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2496
ts=2021-09-23T17:19:02.039134Z lvl=info msg="Reindexing WAL data" log_id=0WlpatQl000 service=storage-engine engine=tsm1 db_shard_id=2496
--

Hardware: linux lxc, 4 cores Intel i3 8100, local ssd

@dan-moran any more thoughts here? Thanks you!

Maybe the 7.5m data points is not a valid use case/too much of data for influx? In this case I probably need to switch to Prometheus or something

Was out of office last week, looking at this more closely today…

I see a “default” write timeout of 10 seconds in the DB’s write handler (default in quotes because we don’t expose a CLI option to change it). I’ll open an issue to add that CLI flag, and will discuss with the team about bumping up the default.