We’re using Influx to store sensor data. It’s a great fit.
We’re now wrestling with the best way to manage our “bad” data. Sometimes sensors are noisy. Sometimes a step in a maintenance procedure is missed and wonky measurements end up in the database. There’s any number of scenarios that can cause known bad data to be stored.
Our analytics and visualizations can be directly impacted by this. In a previous manual, non-Influx-based workflow we’d identify ranges of bad data from logs (i.e. X action raised temperature in zone Y outside its normal parameters over time span Z) or by simple inspection & investigation (“That doesn’t look right. Hey, Bob…”). That bad data would be culled before processing it.
What’s a good strategy and tooling setup to deal with data in a similar way in Influx? Quite honestly we’d like to keep the bad data so we can use it to see patterns and better our procedures.
We think a workable solution is to manually tag measurements with a category as to its validity: good or bad (and what kind of badness). Queries can then filter for either appropriately. Of course, the question is how to construct a setup that easily and safely allows us to tag bad data and later edit those tags if necessary. Manually filtering out bad data at the visualization step is not practical in our usage.
Is what we’re thinking reasonable, or is there a better way? Any thoughts on best practices and how to implement those practices with specific tools for Influx are much appreciated.
I can speak from experience that garbage data sucks to clean up, so I hear your pain. From our perspective, we did two primary things to address this:
1. drop default retention policy to something much smaller
In our case, we dropped the default retention policy from 7d to 24h to minimize the impact of junk data in a shard. Since dropping a shard is much easier and faster than dropping individual measurements, this was a quick + dirty way to get rid of garbage data. Our window of 24h was large enough that most malformed data would not span multiple shards; though with more data coming in now, we’re thinking of dropping to something like 4h or 8h. The point here is that dropping shards are WAY faster than dropping individual measurements; the former simply removes the window from the index where the latter has to remove the measurement from ALL shards.
2. put all measurements behind a relay with active filtering
Our environment is pretty large and globally spread out, it doesn’t make sense to ‘phone home’ over a WAN-spanning TCP connection. We send data via UDP to a centralized relay (forked from influxdb-relay) that forwards the data to our influxdb endpoints. The key design choice here is to implement opt-in filtering that effectively chooses which measurements to send through to the backend, cleaning up all garbage that might come across the WAN. This puts a bit more stress onto the relay itself, but reduces the need to drop shards in step 1
Happy to discuss further if you’re interested in hearing more, but hope this helps!
1 Like
Thank you, @sebito91. That’s a much different and more creative approach than we would have otherwise come up with. And, it may just be helpful towards managing certain extreme aspects of our case. We will factor your idea into our strategizing.
By and large we really do want to keep much of the bad data for analyses of a different class than for the good data. Further, we have a variety of sensor types and many individual sensors and the data is important even down to around 10 minute spans. Outright dropping shards is probably not a great fit for much of our problem.
If we do continue down the road of trying to mark certain spurts of data as bad perhaps the question is: What tools, if any, exist that are especially good at modifying the tags of Influx points selectively and/or in small batches?
1 Like
Very much like we sketched out in the original question, we ended up adding a field to all measurements called “Validity” filled with a simple string-based schema for Good and Bad values. Before values are written to Influx we perform simple sanity checks; any clearly out-of-range values are immediately marked as bad at write time. Maintenance work that could cause values to go wonky cause Validity to be filled with a maintenance specific value at write time. Using a Validity field allows us to rewrite previous points in Influx. Manual tools we created allow us to go back in history and retroactively mark data as bad when needed. The downside of using a field to allow rewrites is that it slows down some queries but so far it’s an entirely manageable tradeoff. We’re able to present “clean” data in dashboards and reports while also specifically monitoring our systems for patterns in Bad data that can be corrected.
Sharing a solution we are using…
We expect data at a regular interval so a single query, grouping by our expected interval, and filtering out specific, erroneous values gives the result we want.
select first(*) from measurement_name where value1 < bad_value group by 1m fill(linear)
In our case, the sensor pegs the value to some high number when there is an error so I’m looking for all values below that number and doing a linear fill for those points. We expect data at 1 minute intervals but that grouping can change to fit other situations.