Is it possible for a TICKscript to crash kapacitor? That seems to be what’s happening, but I can’t confirm that and don’t even know where to look.
What I know is that we’ve had kapacitor running for quite a while without much use. I created a template task and used it to make tasks. Now Kapacitor crashes with the following after running for about a minute:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa0 pc=0xa07575]
goroutine 3078 [running]:
github.com/influxdata/kapacitor.(*InfluxQLNode).runStreamInfluxQL(0xc4428bcc00, 0xc43bd1edc0, 0xc420020600)
/home/ec2-user/go/src/github.com/influxdata/kapacitor/influxql.go:106 +0xe05
github.com/influxdata/kapacitor.(*InfluxQLNode).runInfluxQLs(0xc4428bcc00, 0x0, 0x0, 0x0, 0xc434550f78, 0xc43bd1edc0)
/home/ec2-user/go/src/github.com/influxdata/kapacitor/influxql.go:43 +0x115
github.com/influxdata/kapacitor.(*InfluxQLNode).(github.com/influxdata/kapacitor.runInfluxQLs)-fm(0x0, 0x0, 0x0, 0xc434550fa0, 0x0)
/home/ec2-user/go/src/github.com/influxdata/kapacitor/influxql.go:36 +0x48
github.com/influxdata/kapacitor.(*node).start.func1(0xc4428bcc00, 0x0, 0x0, 0x0)
/home/ec2-user/go/src/github.com/influxdata/kapacitor/node.go:140 +0x8e
created by github.com/influxdata/kapacitor.(*node).start
/home/ec2-user/go/src/github.com/influxdata/kapacitor/node.go:141 +0x5d
If I delete my tasks, kapacitor doesn’t crash right away, so I’m thinking there’s something in the TICKscript that’s ticking off kapacitor.
Here’s the template task:
// API call that this task monitors and triggers on
var targetApi string
// Application this is grouped with
var application string
// Number of errors that triggers an alert
var errorAbsolute int
// Percentage of errors that triggers an alert
var errorPercentage int
// Minimum number of errors before triggering an alert
var minErrorCount int
// Minimum number of requests before triggering an alert
var minRequestCount int
// Any relevant notes about this api; used if there are known issues
var note string
// response time on which to trigger an alert
var responseTime int
// length of sliding window of time
var period = 5m
// frequency it is checked
var every = 1m
var data = stream
|from()
.measurement('api_detail')
|where(lambda: "api" == targetApi)
|window()
.period(period)
.every(every)
|mean('duration')
.as('executionTime')
|log().prefix('initial stream: ').level('DEBUG')
var totalRecords = data
|count('duration')
|log().prefix('totalRecords: ').level('DEBUG')
var totalErrors = data
|where(lambda: "error" == 'true')
|log().prefix('totalErrors (pre-count): ').level('DEBUG')
|count('duration')
|log().prefix('totalErrors (post-count): ').level('DEBUG')
totalErrors
|log().prefix('pre-join: ').level('DEBUG')
|join(totalRecords)
.fill(0)
.as('totalErrors','totals')
|eval(lambda: float("totalErrors.count") / float("totals.count" + 1) )
.as('rate')
.keep()
|log().prefix('pre-alert: ').level('DEBUG')
|alert()
.id('{{ index .Tags "api"}}')
.message('Count: {{ index .Fields "totals.count" }}; Errors: {{ index .Fields "totalErrors.count" }} Rate: {{ index .Fields "rate" }} maxDuration: {{ index .Fields "executionTime" }}')
//.crit(lambda: TRUE)
.crit(lambda: "rate" > errorPercentage OR "totals.count" > errorAbsolute OR max('duration') > responseTime)
// Whenever we get an alert write it to a file.
.log('/tmp/alerts.log')
.sensu()
So generally, can anyone spot anything that might cause a SIGSEGV? Or shed some light on what I might be doing wrong?
I have some specific questions, too:
In the stack trace, does this goroutine number:
goroutine 3078 [running]:
tell me anything useful about which task might be at fault (if it’s a task)?
This line seems to lay the blame on a InfluxQLNode:
github.com/influxdata/kapacitor.(***InfluxQLNode**).runStreamInfluxQL(0xc4428bcc00, 0xc43bd1edc0, 0xc420020600)
Is there a way to tell which InfluxQLNode? And do the number after runStreamInfluxQL point to anything useful in tracking down this bug?
Any help would be appreciated. Still a newbie…