Possible issue with shard corruption

#1

Hi,

We have been running InfluxDB in production now for about two weeks and have noticed something very worrying: aggregated data (i.e. generated using SELECT … INTO) seems to be disappearing.

This is a fairly long post but I crave your indulgence …

We are running Influx version 1.2.2 on Ubuntu 16.04 LTS inside VMware ESXi 6.0.0.

Raw measurements are received in batches from devices out in the field, roughly once every 5 minutes, each batch contains data points collected at one-minute intervals. (I.E. each device uploads its data roughly every 5 minutes, but across our entire set of devices the database is being updated continuously.) This data is loaded into one database “solarmon” which was created with the default retention policy and shard duration.

Aggregated data is calculated on 5- and 30-minute intervals and loaded into databases “solarmon_5min” and “solarmon_30min” respecitively, both of these have default retention policy and a shard duration of 90 days. (I set that duration because an earlier comment seen on this forum suggested that for long-lived data the shards should be larger.) The aggregated data is calculated using a simple but brutal SELECT … INTO, e.g. for the 5-minute aggregate:

SELECT mean(*) as avg, max(*), min(*) INTO "solarmon_5min".."$measurement" FROM $measurement WHERE __devhash='$devhash' AND time>=${start}s AND time<${end}s GROUP BY time(5m),*

(The code is Perl and appropriate substitutions take place before the SELECT is sent to Influx.)

Prior to going live I ran some scripts to load the various measurements from our Postgres database and calculated the aggregates, and this all seemed to work nicely.

A day or two before we went live, the system disk filled up (for other reasons) but no data was being added to Influx and everything still looked OK. We went live and for the next week or so everything worked as expected. But a week ago the server stalled and had to be rebooted (twice), and ever since then it’s behaved oddly.

In the week since then we noticed that 3 months of aggregated data had gone missing for most devices. The missing aggregates exactly match the 90-day shard duration I set up for those databases.

Then we noticed that for many devices up to two weeks raw data was lost also.

The raw data in most cases could be reloaded from the monitoring devices and in the end we have lost only one day’s data for most devices, which is annoying but not the end of the world.

I wrote a script to recalculate the lost aggregated data, and this is where things get “interesting”. The script looks at the number of raw measurements logged for each device for each day, and compares this to the count of aggregated measurements for each day for that device. If the aggregated count is deficient it recalculates the aggregated data for that day for the device. I ran this script to look at all devices and it seemed to work.

But then we started getting reports from users that the aggregated data was missing again.

I am fairly confident that the raw measurements which were initially lost all belonged to the same pair of shards for that database, and the aggregated measurements which continue to be lost are all in one shard for their respective databases:

> show shards
...
name: solarmon
id  database retention_policy shard_group start_time           end_time             expiry_time          owners
--  -------- ---------------- ----------- ----------           --------             -----------          ------
613 solarmon autogen          613         2011-09-12T00:00:00Z 2011-09-19T00:00:00Z 2011-09-19T00:00:00Z 
349 solarmon autogen          349         2013-04-29T00:00:00Z 2013-05-06T00:00:00Z 2013-05-06T00:00:00Z 
...
596 solarmon autogen          596         2017-05-01T00:00:00Z 2017-05-08T00:00:00Z 2017-05-08T00:00:00Z 
597 solarmon autogen          597         2017-05-08T00:00:00Z 2017-05-15T00:00:00Z 2017-05-15T00:00:00Z 
601 solarmon autogen          601         2017-05-15T00:00:00Z 2017-05-22T00:00:00Z 2017-05-22T00:00:00Z 
619 solarmon autogen          619         2017-05-22T00:00:00Z 2017-05-29T00:00:00Z 2017-05-29T00:00:00Z 
605 solarmon autogen          605         2017-05-29T00:00:00Z 2017-06-05T00:00:00Z 2017-06-05T00:00:00Z 
604 solarmon autogen          604         2017-09-18T00:00:00Z 2017-09-25T00:00:00Z 2017-09-25T00:00:00Z 
603 solarmon autogen          603         2017-10-09T00:00:00Z 2017-10-16T00:00:00Z 2017-10-16T00:00:00Z 


name: solarmon_5min
id  database      retention_policy shard_group start_time           end_time             expiry_time          owners
--  --------      ---------------- ----------- ----------           --------             -----------          ------
615 solarmon_5min autogen          615         2011-06-23T00:00:00Z 2011-09-21T00:00:00Z 2011-09-21T00:00:00Z 
360 solarmon_5min autogen          360         2013-03-14T00:00:00Z 2013-06-12T00:00:00Z 2013-06-12T00:00:00Z 
361 solarmon_5min autogen          361         2013-06-12T00:00:00Z 2013-09-10T00:00:00Z 2013-09-10T00:00:00Z 
379 solarmon_5min autogen          379         2013-09-10T00:00:00Z 2013-12-09T00:00:00Z 2013-12-09T00:00:00Z 
389 solarmon_5min autogen          389         2013-12-09T00:00:00Z 2014-03-09T00:00:00Z 2014-03-09T00:00:00Z 
404 solarmon_5min autogen          404         2014-03-09T00:00:00Z 2014-06-07T00:00:00Z 2014-06-07T00:00:00Z 
419 solarmon_5min autogen          419         2014-06-07T00:00:00Z 2014-09-05T00:00:00Z 2014-09-05T00:00:00Z 
433 solarmon_5min autogen          433         2014-09-05T00:00:00Z 2014-12-04T00:00:00Z 2014-12-04T00:00:00Z 
448 solarmon_5min autogen          448         2014-12-04T00:00:00Z 2015-03-04T00:00:00Z 2015-03-04T00:00:00Z 
464 solarmon_5min autogen          464         2015-03-04T00:00:00Z 2015-06-02T00:00:00Z 2015-06-02T00:00:00Z 
479 solarmon_5min autogen          479         2015-06-02T00:00:00Z 2015-08-31T00:00:00Z 2015-08-31T00:00:00Z 
494 solarmon_5min autogen          494         2015-08-31T00:00:00Z 2015-11-29T00:00:00Z 2015-11-29T00:00:00Z 
508 solarmon_5min autogen          508         2015-11-29T00:00:00Z 2016-02-27T00:00:00Z 2016-02-27T00:00:00Z 
524 solarmon_5min autogen          524         2016-02-27T00:00:00Z 2016-05-27T00:00:00Z 2016-05-27T00:00:00Z 
539 solarmon_5min autogen          539         2016-05-27T00:00:00Z 2016-08-25T00:00:00Z 2016-08-25T00:00:00Z 
555 solarmon_5min autogen          555         2016-08-25T00:00:00Z 2016-11-23T00:00:00Z 2016-11-23T00:00:00Z 
570 solarmon_5min autogen          570         2016-11-23T00:00:00Z 2017-02-21T00:00:00Z 2017-02-21T00:00:00Z 
586 solarmon_5min autogen          586         2017-02-21T00:00:00Z 2017-05-22T00:00:00Z 2017-05-22T00:00:00Z 
609 solarmon_5min autogen          609         2017-05-22T00:00:00Z 2017-08-20T00:00:00Z 2017-08-20T00:00:00Z 
607 solarmon_5min autogen          607         2017-08-20T00:00:00Z 2017-11-18T00:00:00Z 2017-11-18T00:00:00Z 


name: solarmon_30min
id  database       retention_policy shard_group start_time           end_time             expiry_time          owners
--  --------       ---------------- ----------- ----------           --------             -----------          ------
614 solarmon_30min autogen          614         2011-06-23T00:00:00Z 2011-09-21T00:00:00Z 2011-09-21T00:00:00Z 
358 solarmon_30min autogen          358         2013-03-14T00:00:00Z 2013-06-12T00:00:00Z 2013-06-12T00:00:00Z 
359 solarmon_30min autogen          359         2013-06-12T00:00:00Z 2013-09-10T00:00:00Z 2013-09-10T00:00:00Z 
378 solarmon_30min autogen          378         2013-09-10T00:00:00Z 2013-12-09T00:00:00Z 2013-12-09T00:00:00Z 
388 solarmon_30min autogen          388         2013-12-09T00:00:00Z 2014-03-09T00:00:00Z 2014-03-09T00:00:00Z 
403 solarmon_30min autogen          403         2014-03-09T00:00:00Z 2014-06-07T00:00:00Z 2014-06-07T00:00:00Z 
418 solarmon_30min autogen          418         2014-06-07T00:00:00Z 2014-09-05T00:00:00Z 2014-09-05T00:00:00Z 
432 solarmon_30min autogen          432         2014-09-05T00:00:00Z 2014-12-04T00:00:00Z 2014-12-04T00:00:00Z 
447 solarmon_30min autogen          447         2014-12-04T00:00:00Z 2015-03-04T00:00:00Z 2015-03-04T00:00:00Z 
463 solarmon_30min autogen          463         2015-03-04T00:00:00Z 2015-06-02T00:00:00Z 2015-06-02T00:00:00Z 
478 solarmon_30min autogen          478         2015-06-02T00:00:00Z 2015-08-31T00:00:00Z 2015-08-31T00:00:00Z 
493 solarmon_30min autogen          493         2015-08-31T00:00:00Z 2015-11-29T00:00:00Z 2015-11-29T00:00:00Z 
507 solarmon_30min autogen          507         2015-11-29T00:00:00Z 2016-02-27T00:00:00Z 2016-02-27T00:00:00Z 
523 solarmon_30min autogen          523         2016-02-27T00:00:00Z 2016-05-27T00:00:00Z 2016-05-27T00:00:00Z 
538 solarmon_30min autogen          538         2016-05-27T00:00:00Z 2016-08-25T00:00:00Z 2016-08-25T00:00:00Z 
554 solarmon_30min autogen          554         2016-08-25T00:00:00Z 2016-11-23T00:00:00Z 2016-11-23T00:00:00Z 
569 solarmon_30min autogen          569         2016-11-23T00:00:00Z 2017-02-21T00:00:00Z 2017-02-21T00:00:00Z 
585 solarmon_30min autogen          585         2017-02-21T00:00:00Z 2017-05-22T00:00:00Z 2017-05-22T00:00:00Z 
608 solarmon_30min autogen          608         2017-05-22T00:00:00Z 2017-08-20T00:00:00Z 2017-08-20T00:00:00Z 
606 solarmon_30min autogen          606         2017-08-20T00:00:00Z 2017-11-18T00:00:00Z 2017-11-18T00:00:00Z 

The shards in question are those covering the time period 2017-05-15T00:00:00Z - 2017-05-29T00:00:00Z for the “solarmon” (raw) database, and the time period 2017-05-22T00:00:00Z - 2017-08-20T00:00:00Z for the “solarmon_5min” and “solarmon_30min” (aggregated) databases.

All of which leads to the following questions and observations …

  1. The “disk full” condition occurred on 15h May which is at the start of one of the affected “solarmon” shards, and the “system rebooted” event happened on 26th & 27th May which is in the middle of the other affected “solarmon” shard. So I am wondering if these two conditions caused those shards to become corrupted in such a way that data continued to be written to them but then lost when the system was rebooted.
  2. Why does recalculating the historical aggregated data apparently cause the loss of other aggregated data which would be stored in the same shard? (You will notice that the shard ID for the date range 2017-05-22T00:00:00Z - 2017-08-20T00:00:00Z is out of sequence compared to the shard which follows it.) Should I instead simply delete this shard and recreate it from scratch by running my script over the required time range? (And if so, presumably I should shut down our application so that no other database updates are occurring at that time.) Would I get better results by making the shard duration shorter?
  3. Is there a tool to verify the database consistency i.e. its internal file organisation?
  4. You will notice that in the “solarmon” database the first shard is dated 2011-09-12 and its shard ID is out of sequence relative to the next 2013-04-29. This results from a bug in the monitoring devices whereby upon rebooting they sometimes get their system clocks reverting back to 12-Sep-2011 and log data with timestamps starting from that moment. This apparently causes the shard to be recreated … would I be correct in guessing that this is pathalogically bad for InfluxDB? (Measurements with those timestamps are of no value so we could probably ignore them anyway.)

Thanks,
Jeremy Begg

#2

Are your field types changing between shards?

You may be seeing something similar to: https://github.com/influxdata/influxdb/issues/8422

1 Like
#3

Hi Jason,

I don’t think the field types will be changing - the same code is used to create the aggregated data initially and to back-fill it later, and all the values are floats.

I noticed in another thread you mentioned the influx_inspect utility, I will have a play with that to see if tells me anything useful. In general, what are the tools for performing consistency checking on the internal structure of the database (TSM files etc)?

Given the nature of our application, I wonder if a 90-day shard duration might be ambitious. Most of our measurement data will arrive within a few minutes of being collected but it’s not uncommon for it to be delayed by up to two weeks and in theory we could be asked to add measurement data that is much older. Any such data will have to be added to the “raw” database as well as being manually aggregated (using a SELECT .. INTO) into the “aggregated” databases.

Is there a write-up which explains the inner workings of the database, e.g. how it handles the situation where a data point must be added with a timestamp that is at the beginning or middle of the range of timestamps held in a shard? (And maybe this varies depending on whether or not the shard being updated with “old” data is the one to which “current” data is also being added?)

Thanks
Jeremy Begg