InfluxDB Tuning

Hi

Do we have any published tuning settings for InfluxDB community edition? i am hitting scaleing limits very often and wondering how much data or series InfluxDB can handle with decent config. i’m running InfluxDB in a container platform and increased memory from 2 GB to 30 GB now. I have just 120+ nodes collecting just basic standard input plug-ins + Docker metrics. i reduced retention policy also just 1 week since InfluxDB unable to handle memory properly.

Not sure what else need to be configured? I hit series limitation of 1M today. I increased series using max-series-per-database parameter Not sure why am hitting memory, series limitations although I just configured retention policy for just 1 week. i am assuming after one week, data will be dropped off. Whatever memory I put, it is eating with in few days. I am confusing what is the right configuration and how to handle future growth ?

my current configuration

]$ cat influxdb.conf
[admin]
enabled = true
bind-address = “:8083”

[http]
enabled = true
bind-address = ":8086"
auth-enabled = true

[coordinator]
write-timeout = “30s”

[retention]
enabled = true
check-interval = “30m”

[monitor]
store-enabled = true
store-database = "_internal"
store-interval = “30s”

[meta]
dir = “/var/lib/influxdb/meta”

[data]
dir = "/var/lib/influxdb/data"
engine = "tsm1"
wal-dir = "/var/lib/influxdb/wal"
max-series-per-database = 10000000

my telegraf is configured to collect metrics every 60 seconds. Dont worry about {{ }} varaibales as they automatically populated using ansible template language.

[global_tags]
cluster = "{{ cluster_name }}"
hostname = "{{ ansible_nodename }}"
datacenter = “{{ datacenter }}”

[agent]
interval = "60s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "5s"
flush_interval = "10s"
flush_jitter = "5s"
precision = ""
debug = false
quiet = false
logfile = ""
hostname = ""
omit_hostname = true

[[outputs.influxdb]]
urls = [“http://{{ rtp_db_ingress_ip }}:{{ port }}”]
database = "hosting"
write_consistency = "any"
timeout = "30s"
username = "hosting"
password = "******"
user_agent = "telegraf"
udp_payload = 512

[[outputs.influxdb]]
urls = [“http://{{ rcdn_db_ingress_ip }}:{{ port }}”]
database = "hosting"
write_consistency = "any"
timeout = "30s"
username = "hosting"
password = "********"
user_agent = "telegraf"
udp_payload = 512

[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false

[[inputs.disk]]
ignore_fs = [“tmpfs”, “devtmpfs”]

[[inputs.diskio]]

[[inputs.kernel]]

[[inputs.mem]]

[[inputs.processes]]

[[inputs.swap]]

[[inputs.system]]

[[inputs.net]]

[[inputs.netstat]]

[[inputs.docker]]
endpoint = “unix:///var/run/docker.sock"
timeout = “30s"
perdevice = true
total = true
docker_label_exclude = [”*”]

[[inputs.procstat]]
exe = "dockerd-current"
prefix = “docker”

[[inputs.procstat]]
exe = "openshift"
prefix = “openshift”

I didn’t reply unfortunately. new upcoming 1.3 release wills solve some of memory issues? it seems no of series has direct impact on memory usage but even using 7 days retention policy how series are increasing and not stable always since no of nodes and measurements are same?

The number of series you have is your series cardinality. Every time you write a distinct measurement + tag set + field name, a new series is created. There is an in-memory index of all series stored in all shards. If you are continuously adding new series, your memory usage will continue to grow. Since you hit the max-series-per-database limit, that is an indication you may have a schema design/cardinality issue with the way you are storing data. I would take a look at the series you are writing and see if you are writing any tags that continuously changes. For example, writing a GUID, timestamp, or other highly variable tag values is a frequent cause of high cardinality data. You can also run influx_inspect report -detailed /path/to/shard/data to see if any particular tag or measurement stands out as having higher cardinality.

The retention policy defines how long data persist on disk. Once a shard has expired, it is marked for deletion and eventually removed. Any series that existed only in those expired shards are removed from the in-memory index at this time as well.

In your telegraf config, you may need to adjust the procstat plugin to drop tags that create high cardinality issues. I’m not sure what the default behavior is currently, there was this bug which recorded pid as a tag. That can lead to cardinality issues.

2 Likes

Thanks you Jason. This info is really helpful, however the problem is how to find bottlenecks and tune them. As per documentation i knew there is issue with mutliple series and cardinality, but not much info is available, what tools or commands to use to find bottlenecks and fix them

The command you give me is really helpful to find the series cardinality and identify which input plug-in is causing more. since i have one week retention policy , i have multiple shards

ls -l /var/lib/influxdb/data/hosting/one_week/
total 32
drwxr-xr-x. 2 root root 4096 May 18 07:38 222
drwxr-xr-x. 2 root root 4096 May 19 04:59 230
drwxr-xr-x. 2 root root 4096 May 20 04:13 238
drwxr-xr-x. 2 root root 4096 May 21 04:13 246
drwxr-xr-x. 2 root root 4096 May 22 04:13 254
drwxr-xr-x. 2 root root 4096 May 23 05:24 262
drwxr-xr-x. 2 root root 4096 May 24 00:10 270
drwxr-xr-x. 2 root root 4096 May 24 03:12 279

I ran the commands against few shards and below is output


influx_inspect report -detailed /var/lib/influxdb/data/hosting/one_week/222/
File Series Load Time
000000772-000000004.tsm 741128 289.931213ms
000000772-000000005.tsm 182434 57.268557ms

Statistics
Series:
Total (est): 923184
Measurements (est):
cpu: 19660 (2%)
disk: 6141 (0%)
docker: 1156 (0%)
docker_container_mem: 196687 (21%)
kubernetes_node: 36 (0%)
kubernetes_pod_container: 143 (0%)
kubernetes_pod_volume: 21 (0%)
docker_container_blkio: 318879 (34%)
mem: 1050 (0%)
procstat: 19349 (2%)
swap: 630 (0%)
kubernetes_system_container: 44 (0%)
net: 27939 (3%)
processes: 944 (0%)
system: 735 (0%)
diskio: 39392 (4%)
docker_container_cpu: 291305 (31%)
docker_data: 315 (0%)
docker_metadata: 315 (0%)
kernel: 419 (0%)
kubernetes_pod_network: 48 (0%)
netstat: 1362 (0%)
Fields (est):
kubernetes_pod_volume: 3
kubernetes_system_container: 11
mem: 10
processes: 9
kubernetes_node: 18
kubernetes_pod_container: 13
docker_container_mem: 35
net: 115
procstat: 53
system: 7
cpu: 10
docker: 13
docker_container_cpu: 9
docker_metadata: 3
netstat: 13
swap: 6
diskio: 110
docker_container_blkio: 11
kernel: 4
kubernetes_pod_network: 4
disk: 7
docker_data: 3
Tags (est):
unit: 1
Vendor: 2
release: 13
version: 13
com.redhat.dev-mode.port: 1
io.kubernetes.container.name: 532
io.kubernetes.container.restartCount: 1242
com.redhat.dev-mode: 1
com.docker.compose.version: 1
process_name: 542
container_image: 375
io.kubernetes.container.terminationMessagePath: 1
volume_name: 3
io.openshift.s2i.scripts-url: 2
exe: 2
datacenter: 3
authoritative-source-url: 1
com.redhat.component: 12
container_name: 5952
distribution-scope: 1
vendor: 2
io.openshift.expose-services: 3
Name: 11
Version: 11
container_version: 539
io.kubernetes.pod.terminationGracePeriod: 4
io.kubernetes.pod.uid: 2217
cpu: 183
com.redhat.build-host: 11
io.k8s.description: 14
namespace: 5
fstype: 2
engine_host: 388
node_name: 2
interface: 1890
com.docker.compose.service: 4
name: 457
Architecture: 1
PostgresVersion: 2
license: 1
com.docker.compose.container-number: 1
hostname: 457
device: 536
architecture: 1
vcs-ref: 18
com.docker.compose.oneoff: 1
Build_Host: 2
io.kubernetes.pod.namespace: 273
com.redhat.deployments-dir: 1
io.openshift.tags: 13
build-date: 35
io.kubernetes.container.ports: 152
com.docker.compose.config-hash: 4
cluster: 3
description: 2
summary: 2
io.openshift.builder-version: 1
com.docker.compose.project: 1
pod_name: 12
path: 7
Authoritative_Registry: 1
vcs-type: 1
BZComponent: 11
Component: 1
Release: 12
io.kubernetes.container.hash: 1230
io.k8s.display-name: 15
io.kubernetes.pod.name: 2033
Completed in 20.936049294s

=================

influx_inspect report -detailed /var/lib/influxdb/data/hosting/one_week/230/
File Series Load Time
000000653-000000005.tsm 847282 345.675756ms
000000653-000000006.tsm 267516 95.470656ms

Statistics
Series:
Total (est): 1097999
Measurements (est):
kubernetes_pod_volume: 21 (0%)
netstat: 1388 (0%)
swap: 642 (0%)
disk: 7279 (0%)
docker_container_cpu: 355845 (32%)
docker_container_mem: 238933 (21%)
kernel: 427 (0%)
mem: 1070 (0%)
diskio: 41045 (3%)
kubernetes_pod_container: 143 (0%)
kubernetes_pod_network: 48 (0%)
net: 28707 (2%)
kubernetes_node: 36 (0%)
kubernetes_system_container: 44 (0%)
processes: 1035 (0%)
cpu: 21049 (1%)
docker: 1178 (0%)
docker_container_blkio: 397246 (36%)
docker_data: 321 (0%)
docker_metadata: 321 (0%)
procstat: 5284 (0%)
system: 749 (0%)
Fields (est):
docker_container_cpu: 9
docker_metadata: 3
kubernetes_pod_container: 13
kubernetes_pod_volume: 3
netstat: 13
system: 7
cpu: 10
docker: 13
diskio: 114
kubernetes_system_container: 11
kernel: 4
kubernetes_node: 18
kubernetes_pod_network: 4
mem: 10
net: 115
swap: 6
docker_container_mem: 35
docker_data: 3
processes: 10
procstat: 55
disk: 7
docker_container_blkio: 11
Tags (est):
io.kubernetes.container.hash: 1261
io.kubernetes.container.name: 550
vcs-ref: 17
io.openshift.s2i.scripts-url: 2
node_name: 2
path: 7
engine_host: 385
Authoritative_Registry: 1
pod_name: 12
volume_name: 3
io.k8s.description: 13
release: 13
description: 2
Release: 12
vcs-type: 1
io.openshift.tags: 12
com.docker.compose.oneoff: 1
hostname: 462
io.kubernetes.pod.namespace: 271
summary: 2
exe: 2
unit: 1
license: 1
interface: 1960
authoritative-source-url: 1
com.redhat.deployments-dir: 1
container_version: 599
com.docker.compose.version: 1
name: 469
architecture: 1
com.redhat.component: 11
com.redhat.build-host: 11
io.kubernetes.container.terminationMessagePath: 1
io.kubernetes.pod.uid: 2504
com.docker.compose.service: 4
device: 656
BZComponent: 10
Component: 1
PostgresVersion: 2
cluster: 3
version: 14
com.redhat.dev-mode.port: 1
io.kubernetes.pod.name: 2205
io.kubernetes.pod.terminationGracePeriod: 4
com.docker.compose.config-hash: 4
datacenter: 3
build-date: 34
com.redhat.dev-mode: 1
cpu: 185
distribution-scope: 1
com.docker.compose.container-number: 1
container_image: 389
io.kubernetes.container.restartCount: 1337
namespace: 5
io.openshift.expose-services: 3
fstype: 2
Name: 10
io.kubernetes.container.ports: 281
io.k8s.display-name: 14
vendor: 2
process_name: 9
Build_Host: 2
Vendor: 2
Version: 12
com.docker.compose.project: 1
Architecture: 1
container_name: 6200
io.openshift.builder-version: 1
Completed in 24.581871901s

=============================

influx_inspect report -detailed /var/lib/influxdb/data/hosting/one_week/238/
File Series Load Time
000000776-000000005.tsm 1095559 296.404829ms
000000776-000000006.tsm 469821 130.299482ms

Statistics
Series:
Total (est): 1561417
Measurements (est):
kubernetes_pod_container: 143 (0%)
kubernetes_pod_volume: 21 (0%)
kubernetes_system_container: 44 (0%)
net: 30566 (1%)
netstat: 1388 (0%)
processes: 1070 (0%)
docker: 1178 (0%)
docker_container_blkio: 557549 (35%)
docker_container_cpu: 537433 (34%)
kernel: 427 (0%)
procstat: 7877 (0%)
swap: 642 (0%)
disk: 7672 (0%)
docker_container_mem: 340009 (21%)
docker_metadata: 321 (0%)
kubernetes_node: 36 (0%)
kubernetes_pod_network: 48 (0%)
mem: 1070 (0%)
system: 749 (0%)
cpu: 21049 (1%)
diskio: 42384 (2%)
docker_data: 321 (0%)
Fields (est):
netstat: 13
procstat: 63
docker_container_blkio: 11
kernel: 4
kubernetes_node: 18
kubernetes_pod_container: 13
kubernetes_system_container: 11
docker: 13
docker_data: 3
kubernetes_pod_network: 4
net: 117
cpu: 10
disk: 7
docker_container_cpu: 9
docker_container_mem: 35
mem: 10
system: 7
diskio: 118
docker_metadata: 3
kubernetes_pod_volume: 3
processes: 10
swap: 6
Tags (est):
fstype: 2
path: 7
Vendor: 2
io.kubernetes.container.name: 540
hostname: 564
com.redhat.build-host: 11
com.redhat.component: 11
com.redhat.dev-mode.port: 1
io.openshift.tags: 12
BZComponent: 10
engine_host: 494
Version: 10
io.kubernetes.container.hash: 1170
io.kubernetes.pod.uid: 1967
io.openshift.expose-services: 3
cpu: 208
Component: 1
build-date: 34
distribution-scope: 1
com.redhat.dev-mode: 1
license: 1
Architecture: 1
unit: 1
Name: 10
io.k8s.display-name: 14
version: 12
com.docker.compose.oneoff: 1
com.docker.compose.project: 1
datacenter: 3
io.kubernetes.pod.terminationGracePeriod: 4
com.docker.compose.container-number: 1
com.docker.compose.version: 1
exe: 2
process_name: 12
io.kubernetes.pod.name: 1829
com.docker.compose.config-hash: 4
vcs-ref: 17
io.kubernetes.container.terminationMessagePath: 1
io.openshift.s2i.scripts-url: 2
volume_name: 3
name: 446
Release: 12
container_image: 440
io.openshift.builder-version: 1
Authoritative_Registry: 1
io.k8s.description: 13
release: 13
summary: 2
node_name: 2
namespace: 5
device: 698
container_name: 7144
container_version: 691
vendor: 2
PostgresVersion: 2
architecture: 1
vcs-type: 1
com.docker.compose.service: 4
cluster: 3
io.kubernetes.pod.namespace: 268
interface: 2196
io.kubernetes.container.ports: 166
io.kubernetes.container.restartCount: 465
com.redhat.deployments-dir: 1
description: 2
pod_name: 12
Build_Host: 2
authoritative-source-url: 1
Completed in 19.121463128s

===================

We have a very bigh Kubernets based Docker platform and each node running multiple containers. The amount of new pods creation and deletion is also very high. I think that is the reason very high cardinality from docker plugin. How to fix or reduce the cardinality form docker input plug-in ?

your help is highly appreciated to fix as am facing very uncertanity with InfluxDB and my metrics solution is frequently doing down and not reliable although i gave 30 GB memory to InfluxDB container.

my current series is

SELECT numSeries FROM “_internal”…“database” GROUP BY “database” ORDER BY desc LIMIT 1
name: database
tags: database=network
time numSeries


1495595820000000000 0

name: database
tags: database=hosting
time numSeries


1495595820000000000 1252706

name: database
tags: database=heapster
time numSeries


1495595820000000000 185176

name: database
tags: database=ech
time numSeries


1495595820000000000 0

name: database
tags: database=ceph
time numSeries


1495595820000000000 342

name: database
tags: database=capi
time numSeries


1495595820000000000 1854

name: database
tags: database=cae
time numSeries


1495595820000000000 305

name: database
tags: database=aci
time numSeries


1495595820000000000 0

name: database
tags: database=_internal
time numSeries


1495595820000000000 1486

Jason

Your help is appreciated. This issue for quite some causing trouble for me and unable to scale up the metrics platform.

Srinivas Kotaru

@Srinivas_Kotaru Are you running the TICK stack via the tick-charts repo? I maintain that repo and run it in a couple of environments. The extra cardinality is coming primarily from the additional tags added on the docker measurements. They come from the docker daemon. Add the following to your telegraf-ds/values.yaml file:

config:
  inputs:
    docker:
      docker_label_exclude:
        - "annotation.kubernetes.io/*"
        - "io.kubernetes*"
        - "com.docker*"
        - "com.redhat.*"

That will eliminate the most egregious cardinality offenders.

@jackzampolin. Thanks for your reply. i am not using Helm package manager rather using direct images from DockerHub and customize the configuration as per my needs. All the core components like InfluxDB, telegraf and Grafana running as containers.

If you look at my telegraf configuration am already excluding all labels.

[[inputs.docker]]
endpoint = “unix:///var/run/docker.sock"
timeout = “30s"
perdevice = true
total = true
docker_label_exclude = [”*”]

Do I need exclude any more? Based on conversation in this thread with @jason, am thinking we should exclude taga also.

Any help is highly appreciated to reduce this docker cardinality issue

Srinivas Kotaru

@Srinivas_Kotaru Based on the output above it looks like those labels are still being written. What version of telegraf are you running? The label exclude feature for docker wasn’t merged until 1.3.

@jackzampolin I’m using latest telegraf 1.3.0

@Srinivas_Kotaru Have you tried to specifically exclude the tags? I’ve had success doing that on my cluster.

@jackzampolin No I didn’t tried yet and that is what i want to try to reduce amount of data coming to InfluxDB. But the confusion is what tags need to exclude? I’m using namespace and pods and container tags exclusive. is there anyway to identify which tags are causing more problem and see whether am using them or not in my Grafana graphs and alerts?

@Srinivas_Kotaru The output you posted above contains that information.

@jackzampolin
This seems my problems with InfluxDB is never ending. Today I received this message from all of my telegraf agents.

2017-06-16T00:35:20Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-06-16T00:35:34Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: max-values-per-tag limit exceeded (100026/100000): measurement=“docker_container_blkio” tag=“container_name” value=“k8s_kong-app.251f7133_kong-app-2645397686-gdof2_coi-iamservices-poc_23f8e542-51f9-11e7-b69f-005056ac69a9_105a76e2” dropped=15]
2017-06-16T00:35:34Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-06-16T00:35:45Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: max-values-per-tag limit exceeded (100026/100000): measurement=“docker_container_blkio” tag=“container_name” value=“k8s_kong-app.251f7133_kong-app-2645397686-gdof2_coi-iamservices-poc_23f8e542-51f9-11e7-b69f-005056ac69a9_105a76e2” dropped=15]
2017-06-16T00:35:45Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-06-16T00:35:51Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: max-values-per-tag limit exceeded (100026/100000): measurement=“docker_container_blkio” tag=“container_name” value=“k8s_kong-app.251f7133_kong-app-2645397686-gdof2_coi-iamservices-poc_23f8e542-51f9-11e7-b69f-005056ac69a9_105a76e2” dropped=15]
2017-06-16T00:35:51Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-06-16T00:36:04Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: max-values-per-tag limit exceeded (100026/100000): measurement=“docker_container_blkio” tag=“container_name” value=“k8s_kong-app.251f7133_kong-app-2645397686-gdof2_coi-iamservices-poc_23f8e542-51f9-11e7-b69f-005056ac69a9_105a76e2” dropped=15]
2017-06-16T00:36:04Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-06-16T00:36:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: max-values-per-tag limit exceeded (100026/100000): measurement=“docker_container_blkio” tag=“container_name” value=“k8s_kong-app.251f7133_kong-app-2645397686-gdof2_coi-iamservices-poc_23f8e542-51f9-11e7-b69f-005056ac69a9_105a76e2” dropped=15]
2017-06-16T00:36:10Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-06-16T00:36:23Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: max-values-per-tag limit exceeded (100026/100000): measurement=“docker_container_blkio” tag=“container_name” value=“k8s_kong-app.251f7133_kong-app-2645397686-gdof2_coi-iamservices-poc_23f8e542-51f9-11e7-b69f-005056ac69a9_105a76e2” dropped=15]
2017-06-16T00:36:23Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster

for time being in increased the count and put 0 as unlimited value

why docker input plug-in causing this much problem? I already excluded all the labels. What further tuning is required?

Srinivas Kotaru

@jackzampolin any ideas guys before my InfluxDB go down again?

Srinivas Kotaru

@Srinivas_Kotaru The container_id is also very high cardinality depending on your deployment. I would suggest moving the database to a larger machine, downsampling with continuous queries or kapacitor, or moving to a scale out cluster. I would guess however that you have a decent amount of headroom. An 8-16GB RAM instance should be able to handle ~2M series with no problem.

Thanks for response. my platform is Kubernetes based cluster and already allocated 30GB memory to influxDB. i am wondering this much memory is really required for InfluxDB for a decent size cluster and wondering what about if my platform expand with more nodes?

is anything we can do to tune my docker plugin data? I mean reducing unnecessary tags or series in the DB. If you look at my config, I already using docker_label_exclude = ["*"]. Do I still need to do anything?

Any help in this regard is appreciated? I am frequently facing DB down and causing false alarms from Grafana.

Srinivas Kotaru