Calculate average of a single field with two tags

Hi,

I wanted to calculate the average of a field value with respect to two tags server and name. server stores the IP address of a server and name corresponds to the ouput names of the ipmi sensor readings. We collect them through telegraf. So we know that we always get for each time point all values.

name could be psu1_pout or psu2_pout. So we want to calculate the average power draw of each server. I tried it with the following tick script, but failed. Right now I have no idea, why the stateDuration produce erros. It would be great to get some advice how to achieve the task within a tick script.

Since I was able to achieve the wished result within the data explorer in Chronograf, it should be possible. Nevertheless, I failed and have no idea how to continue, since the provided documentation is quite thin with respect to the creation of complex tick scripts. So if anybody have some useful links to more documentation or examples, this would be also very good.

Or if there are any further recommendations, please share your opinion.

Best Regards,

Stephan

And now the output from kapacitor with respect ot the alert.

root@kapacitor:/etc/kapacitor/templates# kapacitor show CMC_PSU_Off 

ID: CMC_PSU_Off
Error:
Template:
Type: stream
Status: enabled
Executing: true
Created: 05 Aug 19 16:35 CEST
Modified: 05 Aug 19 16:35 CEST
LastEnabled: 05 Aug 19 16:35 CEST
Databases Retention Policies: [“db_cmc”.“autogen”]
TICKscript:
var db = ‘db_cmc’
var rp = ‘autogen’
var measurement = ‘ipmi_sensor’
var groupBy = [‘server’, ‘name’]
var whereFilter = lambda: (“name” == ‘psu1_pout’ OR “name” == ‘psu2_pout’)
var name = ‘CMC PSU off’
var idVar = name + ‘-{{.Group}}’
var message = ’
ID {{.ID}}
Name {{.Name}}
TaskName {{.TaskName}}
Level {{.Level}}
GroupBy {{.Group}}
Tags {{.Tags}}
CMC {{ index .Tags “server” }}
Fault Chassi is off
Time {{.Time}}

var idTag = ‘alertID’
var levelTag = ‘level’
var messageField = ‘message’
var durationField = ‘duration’
var outputDB = ‘chronograf’
var outputRP = ‘autogen’
var outputMeasurement = ‘alerts’
var triggerType = ‘threshold’

var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|stateDuration(lambda: (mean(“value”) == 0))
.unit(1m)
.as(‘CritDuration’)

var trigger = data
|alert()
// state duration crit
.crit(lambda: (“CritDuration” > 5))
.stateChangesOnly()
.message(message)
.id(idVar)
.idTag(idTag)
.levelTag(levelTag)
.messageField(messageField)
.log(’/etc/kapacitor/templates/alert_logs/cmc_psu_off.log’)

trigger
|eval(lambda: float(“value”))
.as(‘value’)
.keep()
|influxDBOut()
.create()
.database(outputDB)
.retentionPolicy(outputRP)
.measurement(outputMeasurement)
.tag(‘alertName’, name)
.tag(‘triggerType’, triggerType)

trigger
|httpOut(‘output’)

DOT:
digraph CMC_PSU_Off {
graph [throughput=“0.00 points/s”];

stream0 [avg_exec_time_ns=“0s” errors=“0” working_cardinality=“0” ];
stream0 -> from1 [processed=“123404010”];

from1 [avg_exec_time_ns=“22.172µs” errors=“0” working_cardinality=“0” ];
from1 -> state_duration2 [processed=“2490097”];

state_duration2 [avg_exec_time_ns=“48.357µs” errors=“2490097” working_cardinality=“1136” ];
state_duration2 -> alert3 [processed=“0”];

alert3 [alerts_inhibited=“0” alerts_triggered=“0” avg_exec_time_ns=“0s” crits_triggered=“0” errors=“0” infos_triggered=“0” oks_triggered=“0” warns_triggered=“0” working_cardinality=“0” ];
alert3 -> http_out6 [processed=“0”];
alert3 -> eval4 [processed=“0”];

http_out6 [avg_exec_time_ns=“0s” errors=“0” working_cardinality=“0” ];

eval4 [avg_exec_time_ns=“0s” errors=“0” working_cardinality=“0” ];
eval4 -> influxdb_out5 [processed=“0”];

influxdb_out5 [avg_exec_time_ns=“0s” errors=“0” points_written=“0” working_cardinality=“0” write_errors=“0” ];
}

Hello @SWalter,

Thanks for your question. Yah Kapacitor is tricky like that. Do you mind first sharing the query that you produced in Chronograf (from when you say you “Since I was able to achieve the wished result within the data explorer in Chronograf”)

Hi @Anaisdg

The following queuing creates the expected result. Only servers are shown within the graph, where “value” with the tags ‘psu1_pout’ and ‘psu2_pout’ are zero.

Select "mean_value" from (SELECT mean("value") AS "mean_value" FROM "db_cmc"."autogen"."ipmi_sensor" WHERE time > :dashboardTime: AND ("name"='psu1_pout' OR "name"='psu2_pout') GROUP BY time(:interval:), "server" FILL(null)) where "mean_value"=0 Group By "server"

PS: The forum is unusable with IE11 Version 11.0.960019377 since it load 500MB into for each single character of input and create high load so that the system hangs for several seconds. Chrome is fine

@SWalter why have you chosen to use Kapacitor for this? Have you tried using a Continuous Query? It might be easier.

We have to create alerts. So from my point of view, this should be done through Kapacitor. Or not?

So no further input?

We have to use Kapacitor, since we also need to send a mail if we detect this events. With respect to my knowledge this is the purpose for that kapacitor is build.

I have looked into the Continuous Querry and it could be helpful at other points, but not for this specific problem.

@SWalter,
Sometimes it makes sense to use CQ instead of using Kapacitor if you’re only performing a few aggregations. You can then use Kapacitor on top to alert on the CQ. You’re right though, the alerting should be done through Kapacitor.

As for your script, I would suggest using a batch task instead, where the query is a subquery. Then do the additional where "mean_value"=0 Group By "server" in the rest of the tickscript.

@SWalter

    |from()
        .database('db_cmc')
        .retentionPolicy('autogen')
        .measurement('ipmi_sensor')
        .where(lambda: "name" == 'psu1_pout' OR "name" == 'psu2_pout')
        .groupBy('server')
    |window()
        .period(1m)
        .every(1m)
    |mean('value')
        .as('mean_value')```

Ok, that seems quite simple. Is the trick, the usage of the window function, or that you just GroupBy server?

Right now I would say, that the window function would just eliminate the need of the stateDuration function. If we would increase the window to 5 minutes, since 0 means, the same like stateDuration of 5 minutes.

The drawback would be, that it wouldn’t be possible to define the critReset also with a duration. With respect to my knowledge.

The continues Queries are interesting, but it would make our setup more komplex, since we have different containers for all TICK components and right now you never have to to something within the Influxdb Container. With respect to my understanding this wouldn’t be anymore the case if we would use the CQ.

Thank you for your help. I will try to test it today