I am just a novice learning to use TICK stack for our organization. i have configured the TICK stack for my processes and server running. However i would like to do the following processes:
File System Utilization more than 80%
Need to look into the root cause of disk space usage ( can be because of log file )
try to clean up the file system
send a notification to slack
httpd process failure
Check the root cause by looking into the log file
try to restart the http service and monitor for sometime
send a notification to slack about the nature of the problem and recovery steps taken
Is this possible with the Kapacitor, An early help would be really appreciated.
Are you asking if Kapacitor can perform the root cause analysis (RCA) or just notify you that there is a problem?
If you want Kapacitor to perform the RCA how would you for example, script finding the cause of disk filling up?
Thanks for getting back Nathaniel. Yes i was asking if Kapacitor can perform RCA or detect an anomaly from the common pattern and act upon it
Suppose if the log file directory is getting accumulated with files , can it delete the old files and clean the filesystem?
The short answer is yes, Kapacitor can do those things.
The longer answer is Kapacitor does that via scripts that you still have to write.
For example using the disk usage case there are two parts to the problem. First detect that disk usage is full. Second clean up the disk usage. The first part Kapacitor can do natively, the second part you need to write a script for and Kapacitor will trigger it when needed.
Here is an example TICkscript to automatically clear up disk space when it gets over 80%.
|query('SELECT last(disk_usage) as disk_usage FROM telegraf.autogen.disk')
.crit(lambda: "disk_usage" > 80)
// Call the disk cleaning script, the information about which disk etc is getting full is passed over STDIN as JSON to the process.
Thanks @nathaniel for the the info
I tried to use the code using the above code , with some modifications. Please the code below:
batch |query('SELECT last(used_percent) as disk_usage FROM telegraf.autogen.disk') .period(1m) .every(1m) .groupBy(*) |alert() .crit(lambda: "disk_usage" > 80) // Call the disk cleaning script, the information about which disk etc is getting full is passed over STDIN as JSON to the process. .log('/tmp/usage.log') .exec('/usr/local/bin/testStdin.sh')
When i run the above code the log is getting written, but it seems the testStdin.sh is not getting called. Please find the code below:
#!/bin/bash read disk_usage postToSlack -t "this is a test message" -b "$var" -c "devops" -u "https://hooks.slack.com/services/T592WECRX/B59ND03UM/yT2k4kTb2Quj3cbaSD1fSOx"
I have given the permissions
-rwxr-xr-x 1 kapacitor kapacitor 164 May 12 05:46 /usr/local/bin/testStdin.sh