SIGSEGV: segmentation violation caused by TICKscript?

sysdig
kapacitor
#1

Is it possible for a TICKscript to crash kapacitor? That seems to be what’s happening, but I can’t confirm that and don’t even know where to look.

What I know is that we’ve had kapacitor running for quite a while without much use. I created a template task and used it to make tasks. Now Kapacitor crashes with the following after running for about a minute:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa0 pc=0xa07575]

goroutine 3078 [running]:
github.com/influxdata/kapacitor.(*InfluxQLNode).runStreamInfluxQL(0xc4428bcc00, 0xc43bd1edc0, 0xc420020600)
        /home/ec2-user/go/src/github.com/influxdata/kapacitor/influxql.go:106 +0xe05
github.com/influxdata/kapacitor.(*InfluxQLNode).runInfluxQLs(0xc4428bcc00, 0x0, 0x0, 0x0, 0xc434550f78, 0xc43bd1edc0)
        /home/ec2-user/go/src/github.com/influxdata/kapacitor/influxql.go:43 +0x115
github.com/influxdata/kapacitor.(*InfluxQLNode).(github.com/influxdata/kapacitor.runInfluxQLs)-fm(0x0, 0x0, 0x0, 0xc434550fa0, 0x0)
        /home/ec2-user/go/src/github.com/influxdata/kapacitor/influxql.go:36 +0x48
github.com/influxdata/kapacitor.(*node).start.func1(0xc4428bcc00, 0x0, 0x0, 0x0)
        /home/ec2-user/go/src/github.com/influxdata/kapacitor/node.go:140 +0x8e
created by github.com/influxdata/kapacitor.(*node).start
        /home/ec2-user/go/src/github.com/influxdata/kapacitor/node.go:141 +0x5d

If I delete my tasks, kapacitor doesn’t crash right away, so I’m thinking there’s something in the TICKscript that’s ticking off kapacitor.

Here’s the template task:

// API call that this task monitors and triggers on
var targetApi string

// Application this is grouped with
var application string

// Number of errors that triggers an alert
var errorAbsolute int

// Percentage of errors that triggers an alert
var errorPercentage int

// Minimum number of errors  before triggering an alert
var minErrorCount int

// Minimum number of requests before triggering an alert
var minRequestCount int

// Any relevant notes about this api; used if there are known issues
var note string

// response time on which to trigger an alert
var responseTime int

// length of sliding window of time
var period = 5m

// frequency it is checked
var every = 1m


var data = stream
        |from()
                .measurement('api_detail')
        |where(lambda: "api" == targetApi)
        |window()
                .period(period)
                .every(every)
        |mean('duration')
                .as('executionTime')
        |log().prefix('initial stream: ').level('DEBUG')

var totalRecords = data
        |count('duration')
        |log().prefix('totalRecords: ').level('DEBUG')


var totalErrors = data
        |where(lambda: "error" == 'true')
        |log().prefix('totalErrors (pre-count): ').level('DEBUG')
        |count('duration')
        |log().prefix('totalErrors (post-count): ').level('DEBUG')

totalErrors
        |log().prefix('pre-join: ').level('DEBUG')
        |join(totalRecords)
                .fill(0)
                .as('totalErrors','totals')
        |eval(lambda: float("totalErrors.count") / float("totals.count" + 1)  )
                .as('rate')
                .keep()
        |log().prefix('pre-alert: ').level('DEBUG')
        |alert()
                .id('{{ index .Tags "api"}}')
                .message('Count: {{ index .Fields "totals.count" }}; Errors: {{ index .Fields "totalErrors.count" }} Rate: {{ index .Fields "rate" }}  maxDuration: {{ index .Fields "executionTime" }}')
                //.crit(lambda: TRUE)
                .crit(lambda: "rate" > errorPercentage OR "totals.count" > errorAbsolute OR max('duration') > responseTime)
                // Whenever we get an alert write it to a file.
                .log('/tmp/alerts.log')
                .sensu()

So generally, can anyone spot anything that might cause a SIGSEGV? Or shed some light on what I might be doing wrong?

I have some specific questions, too:

In the stack trace, does this goroutine number:

goroutine 3078 [running]:

tell me anything useful about which task might be at fault (if it’s a task)?

This line seems to lay the blame on a InfluxQLNode:

github.com/influxdata/kapacitor.(***InfluxQLNode**).runStreamInfluxQL(0xc4428bcc00, 0xc43bd1edc0, 0xc420020600)

Is there a way to tell which InfluxQLNode? And do the number after runStreamInfluxQL point to anything useful in tracking down this bug?

Any help would be appreciated. Still a newbie…