Json file > line protocol > influxdb


#1

At the bottom is an example of some metrics in a JSON file I’d like to massage into influxdb. This file is updated every 10 minutes and is converted from XML via xmltodict automatically. This is the first part of a much larger project to gather and display organisation wide infrastructure/system metrics.

Unfortunately, I’m no programmer and despite looking at [Influx Line Protocol documentation](https://docs.influxdata.com/influxdb/v1.7/write_protocols/line_protocol_tutorial/ posts I’m still unclear about how to achieve this. I believe using telegraf’s json parser is the best way to go. However I’m unclear about what values I should use in telegraf.conf. Especially the GJSON paths. Some guidance would be appreciated!

I’m interested in extracting the following values…

“system”: { “name”: “"example1 },
“volume”: “id”: “#”
“volume”: {“name”: “/”}
“volume”: {“state”: “online”}
“volume”: {“spaceUsedGB”: “#####”}
“volume”: {“spaceAvailableGB”: “#####.##”}
“volume”: {“hardQuotaGB”: “###.##”,}
“volume”: {“softQuotaGB”: “###.##”}
“time”: “2018-12-11T09:40:02Z”,

measurement: system-name
tags: volume-id,volume-name,volume-state
fields: volume-SpaceUsedGb,volume-SpaceAvailableGB,volume-HardQuotaGB,volume-SoftQuotaGB
timestamp: time

How would I configure telegraf.conf given the above values???

.
######################### /etc/telegraf/telegraf.conf #######################################

Query is a GJSON path that specifies a specific chunk of JSON to be
parsed, if not specified the whole document will be parsed.
GJSON query paths are described here:
{https://github.com/tidwall/gjson#path-syntax

json_query = ""

Tag keys is an array of keys that should be added as tags.
tag_keys = [
** “my_tag_1”,**
** “my_tag_2”**
** ]**

String fields is an array of keys that should be added as string fields.
json_string_fields = []

Name key is the key to use as the measurement name.
json_name_key = ""

Time key is the key containing the time that should be used to create the
metric.
json_time_key = ""

##################################################################################

What should the resultant influx data point look like?

Thanks in advance

{
“pasxml”: {
“version”: “6.0.0”,
“system”: {
“name”: “example1”,
“IPV4”: “0.0.0.0”,
“alertLevel”: “critical”,
“state”: “online”
},
“time”: “2018-12-11T09:40:02Z”,
“volumes”: {
“volume”: [
{
“id”: “1”,
“name”: “/”,
“bladesetName”: {
“id”: “1”,
#text”: “Set-1”
},
“state”: “Online”,
“raid”: “Object RAID6+”,
“director”: “Shelf-001,1”,
“volservice”: “0x0400000000000004(FM)”,
“objectId”: “I-xD0200000000000004-xG7ee84b0c-xU00004a75726a656e”,
“recoveryPriority”: “1”,
“efsaMode”: “retry”,
“spaceUsedGB”: “0”,
“spaceAvailableGB”: “695667.94”,
“hardQuotaGB”: “0.52”,
“softQuotaGB”: “0.52”,
“userQuotaPolicy”: {
“inherit”: “0”,
#text”: “disabled”
},
“stats”: null
},
{
“id”: “8”,
“name”: “/datacentre/archvol/pan101”,
“bladesetName”: {
“id”: “1”,
#text”: “Set-1”
},
“alertLevel”: “critical”,
“state”: “Online”,
“raid”: “Object RAID6+”,
“director”: “Shelf-008,1”,
“volservice”: “0x04000000000000ec(FM)”,
“objectId”: “I-xD02000000000000ec-xG5c7aef6f-xU00004a75726a656e”,
“recoveryPriority”: “50”,
“efsaMode”: “retry”,
“spaceUsedGB”: “117000.01”,
“spaceAvailableGB”: “695667.94”,
“hardQuotaGB”: “117000.00”,
“softQuotaGB”: “90000.00”,
“userQuotaPolicy”: {
“inherit”: “1”,
#text”: “disabled”
},
“stats”: null
}


#2

Okay so this is a completely different path so if you don’t want to do this no big deal.

Normally I recommend to people that need to do any advanced data parsing to move it outside of Telegraf. Telegraf is really good at collecting stuff. Let something else tell it what data to store.

I would try to create a ruby/python/etc script to gather this data and then return it to Telegraf. You can do this with the exec plugin


#3

Perhaps I’m not seeing benefit though I’m happy to be corrected. Isn’t the exec plugin just a substitute for cron? It executes scripts and puts the resultant datapoints into influx at the specified interval. Also, isn’t GJSON go json parsing which is external to telegraf?

Thank you for the suggestion. However, It probably makes sense to continue the way I’m currently going.

My post is more about understanding the line protocol and how it is extracted from json. If you or someone else could give an example of a gjson path based on the values above that would be great.


#4

The json parser is currently unable to do what you intend.
The line protocol would look something like:

example1,volume_id=1,volume_name=/,volume_state=Online volume_spaceUsedGB="0",volume_spaceAvailableGB="695667.94",volume_hardQuotaGB="0.52",volume_softQuotaGB="0.52" 1544659870000000000

Note that if you use the same timestamp, it will be overwritten in many timeseries databases (influx included) because the timestamp is intended to be unique.

As i mentioned, you can get close with the json parser and an input (file?).:

telegraf config example:

  name_override="example1"
  data_format = "json"
  json_query = "pasxml.volumes.volume"
  json_name_key = "system_name"
  tag_keys = [
    "id",
    "name",
    "state"
  ]

  json_string_fields = [
    "spaceUsedGB",
    "spaceAvailableGB",
    "hardQuotaGB",
    "softQuotaGB"
  ]

output:

example1,id=1,name=/,state=Online hardQuotaGB="0.52",spaceUsedGB="0",spaceAvailableGB="695667.94",softQuotaGB="0.52" 1544661151000000000
example1,id=8,name=/datacentre/archvol/pan101,state=Online hardQuotaGB="117000.00",softQuotaGB="90000.00",spaceUsedGB="117000.01",spaceAvailableGB="695667.94" 1544661151000000000

input:

{
	"pasxml": {
		"version": "6.0.0",
		"system": {
			"name": "example1",
			"IPV4": "0.0.0.0",
			"alertLevel": "critical",
			"state": "online"
		},
		"time": "2018-12-11T09:40:02Z",
		"volumes": {
			"volume": [{
					"id": "1",
					"name": "/",
					"bladesetName": {
						"id": "1",
						"#text": "Set-1"
					},
					"state": "Online",
					"raid": "Object RAID6+",
					"director": "Shelf-001,1",
					"volservice": "0x0400000000000004(FM)",
					"objectId": "I-xD0200000000000004-xG7ee84b0c-xU00004a75726a656e",
					"recoveryPriority": "1",
					"efsaMode": "retry",
					"spaceUsedGB": "0",
					"spaceAvailableGB": "695667.94",
					"hardQuotaGB": "0.52",
					"softQuotaGB": "0.52",
					"userQuotaPolicy": {
						"inherit": "0",
						"#text": "disabled"
					},
					"stats": null
				},
				{
					"id": "8",
					"name": "/datacentre/archvol/pan101",
					"bladesetName": {
						"id": "1",
						"#text": "Set-1"
					},
					"alertLevel": "critical",
					"state": "Online",
					"raid": "Object RAID6+",
					"director": "Shelf-008,1",
					"volservice": "0x04000000000000ec(FM)",
					"objectId": "I-xD02000000000000ec-xG5c7aef6f-xU00004a75726a656e",
					"recoveryPriority": "50",
					"efsaMode": "retry",
					"spaceUsedGB": "117000.01",
					"spaceAvailableGB": "695667.94",
					"hardQuotaGB": "117000.00",
					"softQuotaGB": "90000.00",
					"userQuotaPolicy": {
						"inherit": "1",
						"#text": "disabled"
					},
					"stats": null
				}
			]
		}
	}
}


#5

Thanks, that’s a great help!

With regard to the timestamp uniqueness. Would a reasonable solution be to add a timestamp + 1ns to each pasxml.volumes section to make each entry unique? Would that have any consequences when retrieving data later on?


#6

Perhaps I’ve misunderstood your post with regard to uniqueness. If it’s the case that time series databases cannot have two matching entries, then the following documentation is a little misleading to say the least.

The examples on this page have exactly similar timestamps for two or more entries in the same table. I presume I don’t need to add unique timestamps for each entry in my situation?


#7

Yes, that was my bad. Apparently it’s measurement, timestamp, and tag-set unique. so, you’d be fine as your tags would be different, just as the park_id is different for the same timestamps in that doc you linked.