From ifql created by [aanthony1243](https://github.com/aanthony1243) : influxdat…a/ifql#255
# Interpolate Proposed as a replacement for Fill()
The basic notion of fill() in InfluxQL is to fill in missing values under certain conditions, most often in the case of an empty time window or group. A more general notion is that of interpolate() which should solve the existing use cases, and also provide more advanced features. The end of this text includes some concrete examples of InfluxQL usage for the fill() function.
initial writeup of interpolate design, including a discussion of the Window() operation can be found here: https://docs.google.com/document/d/1C4ELMRJblvy7UsyBF_o3TGVfGrydChKET2LdGOYu53k/edit?usp=sharing
from the above doc:
# Types of interpolation:
Interpolation may be defined on a NxM matrix as filling in the null/missing values on that matrix. In influxdb, a matrix can be materialized as a collection of related series where the row-indices are the series timestamp, and the column indices are series ID:
time/seriesID | m,f1=foo | m,f1=bar | m,f1=asd
-- | -- | -- | --
T1 | (t1, 23) | null | (t3,19)
T2 | (t1,1) | (t2,5) | (t3,2)
T3 | (t1,1) | null | null
We’ll initially define two types of interpolation, one that operates only on rows of the matrix, and one that operates only on the columns of the matrix.
# Column-wise interpolation
A column-wise interpolation will focus on a single series, so that other series in the data matrix are not considered. A few types of interpolation on a column to consider are:
1. Fill with null
2. Fill with constant
3. Fill with previous column value from the series
4. Fill with an aggregate or selector from the entire series (min, max, mean, count, etc.)
5. Fill with a moving average from the series
6. Linear interpolation between two known values on the series
# Row-wise interpolation
Row-wise interpolation isn’t necessarily different from column-wise interpolation. In principle, a row-wise interpolation can be achieved by applying a column-wise interpolation to the transposed data matrix. However, in InfluxDB, there are some limitations because while the rows of the data matrix in this context are numeric and sorted by time, the columns of the matrix are categorical and may appear in any arbitrary order. With this in mind, we consider the following row-wise interpolations valid:
1. Fill with null
2. Fill with constant value
3. Fill with aggregate or selector from other series, but in the same time row. (e.g. from(db) |> selectMeasurement(m) |> range(start:t2, end:t3) |> mean() )
If it were possible to assign a total ordering on the rows of the matrix, then the remaining interpolations may be well-defined:
- Fill with ‘previous’ series value in the same time range row
- Fill with a moving average of series values found in the same on the time range row.
Finally, linear interpolation would be difficult to define generally on a row. If some notion of numeric distance between row values may be defined on the tags, then some form of linear interpolation may be applied. Currently, the language has little support for this other than presuming a fixed unit of distance between two adjacent columns in the matrix.
# Column-Wise Interpolation Operator
The interpolation function on a column can be simplified into two parts:
Identify each missing value in the series.
Determine a replacement for that missing value.
The most convenient scenario for interpolation will be where a timestamped NULL value exists on the series. Later, we will lift this assumption, but given that a series contains timestamped nulls, a possible interpolate function would be:
interpolate(table=<-, nullFn, fillFn)
Where nullFn is a function defined on a series value that returns true/false if a value should be considered a candidate for replacement, and fillFn is a function that generates a value to insert in place of the NULL value.
This approach would be the most general, but possibly inefficient unless the fillFn can be optimized such that it can be computed quickly. In general, interpolation is computed using known values that are near the NULL value. We can cache these values into a moving window as we scan the series, and build in the various functions that may be desirable for interpolation:
interpolate(table=<-, nullFn, stepsPrev, stepsNext, fillType=”prev”)
Where stepsPrev and stepsNext indicate the boundary of the window to each side of the NULL value, and the fill Type may be one of:
- Constant value: set the value to some specified constant
- Prev: set the value to the furthest previous value
- Next: set the value to the farthest next value
- Min/Max/Mean/Count/AggFn: aggregate all values found within the window.
- Linear Interpolation: use point-slope computation to determine a value on the line between the furthest previous and farthest next value.
- Linear regression: fit a line to all points in the window, and estimate a value for the NULL based on this line.
- ?? may be tricky to allow this to be customizable with a function, but worth discussing.
# Appendix: InfluxQL examples, with Flux adaptations:
## External requirements
1. Null values: we need to have some representation of null/missing values for a row so that we can identify where to apply interpolation.
2. Many existing InfluxQL queries use grouping/windowing to segment the data, and then a Fill() operation to insert a default value for any empty segments. To get the most out of this feature, we need to make sure that our Window() function outputs empty groups either by default or by argument:
` |> window(every: 10m, keepEmpty: true) `
3. NOTE: depending on how we implement Window() example 1 may not require interpolate() at all, if we did something like: ` |> window(every:10m, emptyValue: 0`
Example 1:
Many chronograf queries require data for each point on an axis. If the axis is populated with the results of a GROUP BY, empty groups must get a default value. Exmaple:
```
// SELECT mean("queueBytes")
// FROM "telegraf"."default"."influxdb_hh_processor"
// WHERE "cluster_id" = :Cluster_Id:
// AND time > :dashboardTime:
// GROUP BY :interval:, "host" fill(0)
HintedHandoffQueueSize = (INTERVAL, CID, DASHTIME) =>
telegrafDashtime(DASHTIME: DASHTIME)
|> withCID(value: CID)
|> select(measurement: "influxdb_hh_processor", field: "queueBytes")
|> window(every: INTERVAL, keepEmpty: true)
|> interpolate(nullFn: (r) => r._field == NULL, 0, 0, fillType="constant", fillValue=0)
|> group(by: ["host"])
|> mean()
```
Example 2:
it's not well-defined what happens here, if anything. But there are some queries on the cloud monitor that use Fill(0) independently of a GROUP BY:
```
// InfluxQL Disk Usage Query
// SELECT last("used")/1073741824 AS "used" FROM
// "telegraf"."default"."disk"
// WHERE time > :dashboardTime:
// AND cluster_id = :Cluster_Id:
// AND (host =~ /.data./ OR host =~ /tot-.*-(3|4)/)
// FILL(0)
DiskUsage = (DASHTIME) =>
fromRange(forDB:"telegraf", forRange:DASHTIME)
|> select(measurement: "disk", field: "used")
//|> withCID(value: CID)
//|> hostfilter()
|> last()
|> map(fn:(r) => r._value / 1073741824)
```
It's most possible that this query started out as a continuous query, where fill() is defined in some contexts in this manner. In the query above, it's not 100% clear what's meant to be filled. This use-case may not require any changes unless we support continuous-Flux queries.