Fatal error: concurrent map iteration and map write

Hello,

I’ve got influxdb service failure because of this error:

“fatal error: concurrent map iteration and map write”

I’m getting this problem regularly, every 2-3 days.
Seems to be the same as described here: Data race on map access/write · Issue #8633 · influxdata/influxdb · GitHub

Sympthoms are very similar -

  • a massive amount of updates POST /write?db=telegraf,
  • a number of SHOW TAG VALUES ON telegraf WITH KEY queries from Grafana dashboard,
  • a fatal error.

This is very worrying, because it means it will happen even more often as more people in our company start using Grafana dashboard for influxdb data.

Is it a bug ? Is there a way to prevent this to happen ?

Please help us to solve it.

Our setup:
Influxdb cluster with 2 data nodes.

Version: 1.3.6-c1.3.6

cache-max-memory-size = 4194304000
cache-snapshot-memory-size = 26214400
index-version = “inmem”
max-concurrent-compactions = 3

telegraf database stats:

Size on disk: 40G
Measurements: 23
Series: 225647447

I can provide a full stack trace from the log also.

Short log:

[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a3528157-b4d0-11e7-bee5-000000000000 4813
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a352b259-b4d0-11e7-bee6-000000000000 3576
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a351bc5b-b4d0-11e7-bee4-000000000000 9859
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a352f3b9-b4d0-11e7-bee7-000000000000 4282
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a353e2b1-b4d0-11e7-bee8-000000000000 2898
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a353f4e2-b4d0-11e7-bee9-000000000000 4884
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a353ff0d-b4d0-11e7-beea-000000000000 4655
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = interface WHERE (_name = ‘net’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘interface’)) service=query
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = path WHERE (_name = ‘disk’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘path’)) service=query
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = process_name WHERE (_name = ‘procstat’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘process_name’)) LIMIT 30 service=query
[I] 2017-10-19T13:23:04Z SELECT mean(used) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = “name” WHERE (_name = ‘diskio’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘name’)) service=query
[I] 2017-10-19T13:23:04Z SELECT mean(usage_user) FROM telegraf.“default”.cpu WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND cpu = ‘cpu-total’ AND time > now() - 6h GROUP BY time(10m) service=query
[I] 2017-10-19T13:23:04Z SELECT mean(cached) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[I] 2017-10-19T13:23:04Z SELECT mean(buffered) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a3559cdd-b4d0-11e7-beed-000000000000 9285
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a354352f-b4d0-11e7-beeb-000000000000 18601
[I] 2017-10-19T13:23:04Z SELECT mean(free) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[httpd] 172.21.1.6, 127.0.0.1,172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “GET /query?db=telegraf&epoch=ms&q=SELECT+mean%28%22used%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29%3BSELECT+mean%28%22cached%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29%3BSELECT+mean%28%22buffered%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29%3BSELECT+mean%28%22free%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29 HTTP/1.1” 200 1353 “ragnarok.com” “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36” a3563ac1-b4d0-11e7-bef0-000000000000 8487

fatal error: concurrent map iteration and map write

goroutine 41340066 [running]:
runtime.throw(0xcdfd72, 0x26)
/usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc6a24611f0 sp=0xc6a24611d0
runtime.mapiternext(0xc6a2461560)
/usr/local/go/src/runtime/hashmap.go:737 +0x7ee fp=0xc6a24612a0 sp=0xc6a24611f0
github.com/influxdata/influxdb/tsdb/index/inmem.(*Measurement).idsForExpr(0xc458ef6480, 0xc47cf573b0, 0x0, 0xc47c939230, 0x8, 0x18, 0xc58f9b7540, 0x0, 0xc42037e800)