InfluxDB schema - new Series, Measurement, Database or Server?

paulo · April 19, 2017, 2:09pm

I want to collect:

Server KPIs (cpu/ram/disk/etc);
Web Access Logs (method/path/status/etc);
Services up-status (serviceX is running, serviceY is down, etc);
…

I am already collecting Server KPIs into an InfluxDB server similar to:

InfluxDB
--> database_xyz
----> measurement_CPU
------> tag layer: "web"
        tag server: "webserver-1"
        value cpu_idle: 93
        value cpu_something: 4
----> measurement_RAM
------> tag layer: "web"
        tag server: "webserver-1"
        value RAM_Free: 65536

Now I want to also collect, say, the Web Access Logs but I’m unsure how exactly InfluxDB is meant to be structured.

1) Should we add everything to the same measurement but filtered by Tags?

2) Should we create a different Measurement per … type of log (ex: ServerKPIs, WebAccessLogs) or per type of sub-log (ex: ServerKPI-CPU, ServerKPI-RAM)?

3) Should we create a different Database altogether per type of log?

4) Should we create a different InfluxDB server per type of log?

This just feels so clunky and ugly… However, looks to me completely feasible and I can’t think of any technical drawbacks. Can this give me the best flexibility/ease to in the future process the data (ex: CPU used per Request/s)?
Feels pretty standard to me, which is why I originally went with this structure. Am I losing any kind of ability to combine data?
Seems quite reasonable to separate such disparate logs, after all, they really are different logs. Would I lose ability to combine logs (ex: CPU used per Request/s)?
If I just create a new InfluxDB server (same machine), later scaling the system to remove bottlenecks should be quite easy, by moving each InfluxDB service into its own dedicated machine. Would I still be able to combine logs, in the likes of Grafana?

jackzampolin · April 20, 2017, 11:12pm

@paulo I would say option 2 would be your best bet. But I would suggest you look at the way telegraf stores this same data. It might also save you a bunch of time. There is a logparser plugin that might help out. There are also cpu and mem plugins along with around 100 other ones.

paulo · April 21, 2017, 8:47am

@jackzampolin,

What about retention policies? What if I want a different retention policy for Server KPIs and Web Access Logs? I’m not able to assign RPs to measurements, right? I’m thinking options #2 and #3 are probably the best ones and I feel I’m being too perfectionist but theoretically it’s still an interesting issue to determine.

I am indeed using Telegraf to collect Server KPIs, which is itself kind of a cross-breed option #2. I tried using Telegraf to also collect Web Access Logs with the input plugin LogParser but turned out it was missing a critical feature to avoid Series explosion and I made a Feature Request for it:

github.com/influxdata/telegraf

Feature Request - "Transform" Processor plugin

opened 11:30AM - 13 Apr 17 UTC

closed 10:49PM - 21 May 18 UTC

PauloAugusto-Asos

feature request

## Feature Request Requesting a "Transform" processor plugin. I am trying …to import Web access logs into InfluxDB with Telegraf. However, some of the URL PATHs include identifiers _(session IDs, product IDs, etc)_. Ex: `/products/cars/12345/view` `/shoppingBasket/1234567890/view` The URL PATH is being shipped as a Tag Value _(obviously)_. I need to to be able to replace those identifiers from the PATH Tag Value before shipping the data to Influx _(or whatever other DB)_ so that they become easily recognizable as the «same» URL PATH for searches and aggregations and to prevent an explosion of "series" in InfluxDB or Graphite. ### Proposal: **`[[processors.transformer]]`** ` tagpass = "ApacheLog"` ` tagname = "path"` ` matcher = "/products/cars/(\d+)/view/"` ` matchertype = "regex" # "literal"` ` replaceMatchedIndex = 1 # i0 being whole match. To replace *only* the ID` ` replacement = "{CarID}"` ` tagexclude = "ApacheLog"` **`[[processors.transformer]]`** ` tagpass = "ApacheLog"` ` tagname = "path"` ` matcher = "/shoppingBasket/(\\d+)/view"` ` matchertype = "regex" # literal` ` replaceMatchedIndex = 1` ` replacement = "{SessionID}"` ` tagexclude = "ApacheLog"` ### Simpler Proposal: **`[[processors.transformer]]`** ` tagpass = "ApacheLog"` ` tagname = "path"` ` matcher = "/products/cars/\\d+/view/"` ` matchertype = "regex" # "literal"` ` # replaceMatchedIndex = 1` ` replacement = "/products/cars/{CarID}/view/"` ` tagexclude = "ApacheLog"` ### SimplerSimpler Proposal: **`[[processors.transformer]]`** ` tagpass = "ApacheLog"` ` tagname = "path"` ` replaceDigits = 3 # replace all sequences of X+ digits` ` replaceGuids = true` ` replaceTrimmedGuids = true # guids stripped of dashes` ` tagexclude = "ApacheLog"`

I ended up going with LogStash instead (unfortunately as now I require 2 tools), uploading to InfluxDB, although that one was also missing a critical feature too, so I created another Feature/Bug Request for it too:

github.com/logstash-plugins/logstash-output-influxdb

Enhancement Request - automatic time-bucket conflict resolution

opened 09:41AM - 18 Apr 17 UTC

PauloAugusto-Asos

## Enhancement Request Requesting that the plugin automatically resolves time…-bucket conflicts. If we send 2 or more data points to the same "series" with the same timestamp, ex: - same Host "tag", - same HTTP Method "tag", - same HTTP Status response "tag", - same URL Path "tag", - for the exact same time, - duration / time-taken as value/data point, InfluxDB will just overwrite all the data points with the last one received. This is quite likely to happen in high traffic websites, where you'll have the same server respond to more than 1 _equal_ request in a second, while storing the request/response time with only the granularity of Second. ### Proposal: **`output {`** ` _ _ influxdb {` ` _ _ _ _ time_conflict_resolver => "AddMillisecond"` changes the timespamps 12:34:56, 12:34:56, 12:34:56 To: 12:34:56.001, 12:34:56.002, 12:34:56.003 ` _ _ _ _ time_conflict_resolver => "AddMicrosecond"` Same but at the level of Microsecond. Potentially also the same but at the level of Nanosecond. ` _ _ _ _ time_conflict_resolver => "AddNewTag"` ` _ _ _ _ time_conflict_resolving_tag => "qwerty"` Adds the following InfluxDB "tags" to ***only and each conflicting*** datapoint: qwerty=1, qwerty=2, qwerty=3 This one creates new series but it's my favorite, as it doesn't changes the timestamp.

So I ended up in the old classic, LogStash + ElasticSearch . I wanted one DB and one collector… I’m still hoping that the LogStash InfluxDB output plugin will end up “fixing” that issue and release me from having to rely on an ElasticSearch DB but for now I’m stuck.

jackzampolin · April 21, 2017, 9:06pm

@paulo How many series were you creating? “Series explosion” shouldn’t be a big issue after tsi gets merged.

What would you need in the logparser to make it work for your usecase?

Luv · April 21, 2017, 10:44pm

I would also go with option 2. It keeps log data in one database, but separates different types of logs by putting them in different measurements.

If you want to parse different log files and store them in different measurement, then you would have to define the [inputs.logparser] as many times as the number of different log files you want to parse. Because, at one time, the [inputs.logparser] plugin can only put the parsed logs under one measurement.

 [[inputs.logparser]]
  files = ["/var/log/nginx/api_access.log"]
     [inputs.logparser.grok]
      patterns = ["%{COMBINED_LOG_FORMAT}"]
       measurement = "api_access.log"

You would repeat this input plugin, and store different file under different measurement. This is the only way to put different log files under different measurements.

paulo · April 24, 2017, 8:27am

Hi Luv, what do you think of Retention policies? They’re per Database, correct? With Option #2 I won’t be able to set different RPs to different kinds of logs. Not sure that bothers me all that much, though.

paulo · April 24, 2017, 8:48am

Hi Jack.

How many series were you creating?

In terms of number of Series, many of the URLs have a Session ID in the PATH, so it would be pretty much one whole set of Series each time a user gets to the website that day.

To be honest, though, I don’t know how impacting this would be to the current engine. But it appears to be strongly discouraged and, according to what I can understand of the engine, it appears to be highly undesirable. Could that be something that the DB engine could easily live with!? We don’t want to implement something that’s discouraged only to find ourselves in a few weeks in a deep hole that I dug for ourselves with a bad design.

“Series explosion” shouldn’t be a big issue after […]

In terms of new upcoming engine, we cannot wait on «promises of the future». We need the logs «today», not someday in the future that might not even come to be. As I said, I don’t really know how impacting that «Series explosion» would be to the server, but if it is substantially negative to the engine as it is now, then we have to work around it in some way.

What would you need in the logparser to make it work for your usecase?

At the moment, we would need Telegraf’s logparser to be able to do these 2 things:

# Transform data #
For example, be able to transform this:
/getPersonalAccount/324651-1234-1234-1234-123456678/
Into this:
/getPersonalAccount/{GUID}/

# Aggregate or resolve time-conflicted lines #
We have some high-traffic web servers which serve many URLs which are equal in all “Tags” happening at the exact same second. In InfluxDB they will all be lost but one, losing the ability to calculate frequency of Requests/s and proper statistics of Average/percentiles/min/max/etc Response Times.

I would need Telegraf to have the ability to either aggregate the requests with strong statistical capabilities, or the ability to resolve those time-bucket conflicts by detecting the conflict and adding a conflict-resolving Tag.

jackzampolin · April 24, 2017, 6:40pm

@paulo A single instance can handle ~5M series pretty comfortably, so there is some headroom depending on how many series you are creating daily and how long you want your retention period to be. Currently each series key is stored in memory to look up the array of values on disk. That means the more series, the larger the RAM requirement. One way a lot of folks deal with this is having the high cardinality data downsampled into much lower cardinality data.

The promises of the future are only there because you could start collecting your data now at a lower retention period, and once the changes come in lengthen the retention period to infinite.

The logparser can currently do that, you just have to write the proper parsing rules to enable that behavior.

Time conflicted lines are best taken care of by properly tagging the data to avoid timestamp collisions. Session_id would take care of that requirement. Also you could increase the precision of the emitted timestamps. I would say that arbitrarily adding a tag is a bad idea as a way to resolve those conflicts.

paulo · April 25, 2017, 3:46pm

Hi Jack,

The promises of the future are only there because

Just to make sure, I didn’t mean any kind of critic to your comment, I was just speaking literally.

I’ve grown quite bitter and skeptical of promises from the likes of “next version all bugs will be fixed, all imaginary features will be there including curing cancer, just stick with us another year, another year, another year, ad infinitum”.

^ you can probably guess some of the high profile company names I’m thinking of who’ve been fooling us for decades.

The logparser can currently do that, you just have to write the proper parsing rules to enable that behavior.

Are you sure that’s the case? I couldn’t find anything and ended opening a Feature Request with them:
https://github.com/influxdata/telegraf/issues/2667
Which was apparently accepted, hinting me that’s not yet possible!?

Maybe that feature request is then useless. How are you thinking that can be done? Can you give me some kind of example?

Time conflicted lines are best taken care of by properly tagging the data to avoid timestamp collisions. Session_id would take care of that requirement. […] I would say that arbitrarily adding a tag is a bad idea as a way to resolve those conflicts.

I’m a bit confused. You’ve mentioned it’s best to avoid conflicts by tagging but … that adding a tag to resolve conflicts is a bad idea? In InfluxDB’s documentation it’s specified there that if we need to resolve timestamp conflicts that we should add an extra Tag for it.

Also, you seem to be referring to the idea of having the likes of Session ID in some Tags as something desirable. Isn’t that highly or at least moderately undesirable? It would create a huge amount of Series caused by a variable (which theoretically should then be used as a Field instead of as a Tag) and make most of the Series exist merely for periods of 1~20 minutes.

Topic		Replies	Views
Schema design for server performance data Store influxdb	0	407	July 30, 2019
Schema design: how may tags InfluxDB 2 influxdb , schema , query , flux	5	2747	February 23, 2021
Influxdb service crashes down Store influxdb , schema , influxql	2	1765	June 2, 2017
Schema design - Multiple field values (metrics) vs one tag + one value	5	1023	June 25, 2021
Best Practices for Tagging Store influxdb , telegraf , schema	1	1673	July 9, 2019

InfluxDB schema - new Series, Measurement, Database or Server?

Related topics