InfluxDB linear regression forecasting


#1

I couldn’t find anything on Google about people doing linear regression with influxdb, and it doesn’t appear to be supported natively (yet), so I wrote a query for it so we could forecast time series and predict disk failure. I’ve included the query below in case anybody else finds it useful. Sorry it is so messy!

SELECT
    ((100 - b) / slope) as days_until_full,
    slope,
    b
FROM
(
    SELECT
       (n * sum_xy - sum_x * sum_y) / (n * sum_xx - pow(sum_x,2)) * 24 as slope, -- convert hours -> days
        y_bar - x_bar * (n * sum_xy - sum_x * sum_y) / (n * sum_xx - pow(sum_x,2)) as b,
        n / 24 -- convert hours -> days
    FROM (
      SELECT
        count(y) as n,
        sum(x) as sum_x,
        sum(y) as sum_y,
        sum(x_times_y) as sum_xy,
        sum(x_times_x) as sum_xx,
        mean(x) as x_bar,
        mean(y) as y_bar
      FROM
      (
        SELECT 
            CUMULATIVE_SUM(elapsed(mean(metric_value), 1h))  as x,
            pow(CUMULATIVE_SUM(elapsed(mean(metric_value), 1h)), 2) as x_times_x,
            mean(metric_value) as y,
            CUMULATIVE_SUM(elapsed(mean(metric_value), 1h)) * mean(metric_value) as x_times_y
        FROM "servicestatus"
        WHERE 
          (
            "check_command" =~ /check_(snmp_)?disk/
             AND metric_name =~ /Percent Usage (Reported|Calculated) % /
          ) 
          AND $timeFilter
        GROUP BY 
          metric_name,
          host_name,
          time(1h)
      )
      group by 
          metric_name,
          host_name
    )
    group by 
        metric_name,
        host_name
)
where 
    slope != 0 or slope = 0 -- bootleg version of filtering out nulls
group by 
  metric_name,
  host_name

#2

Just as a quick explanation, the dependent variable is percent disk utilization (0 - 100).

I first use a mean function to smooth data points by hour, and cumulative_sum(elapsed(…)) to convert time to data point indices (0,1,2, etc).

Finally, this looks backwards in time according to the Grafana $timeFilter condition, and uses that amount of time to build the model.

Not 100% confident that it is bulletproof, but it seems to behave properly.