I am working with meteorological data and need to produce local statistics. For example, determining the low and high temperature within the local day. I can use flux to do this for a single local day, but I have not found the flux solution for processing months of data in a single operation.
It is easy to process months of data by aggregating the data in UTC time, but how can I offset this default aggregation to my local day?
I have read countless postings and the docs and I cannot find a clear solution for this seemingly simple problem.
Everything is in UTC time. That is normal everywhere whether it is databases, languages or even system times. When you see local time, that is just an adjustment for the UI or API. MySQL, Java, PHP, Linux system calls, etc all use UTC unless you specify a timezone. Speaking from experience, you really do want to work within UTC until you need to present it to users. Communication between systems is much much easier. The time for which something occurred is always going to be the same, just your perspective of what day/time of day it was in your part of the Earth is what changed. Many web projects now base their local time on your browser’s reported timezone. That means the information is always UTC in the database. Usually a JS library like moment is used to adjust these values in the web browser. So your queries in the database, should be geo constraining, I am guessing, and not worried about timezone so much. Leave that to the UI layer.
I totally agree with you. I have no problem working with UTC and have for years. Many meteorological products are provided in UTC times.
My challenge is not to escape using UTC, but rather knowing how to run daily queries (local time) with flux to gather specific statistics. In short, UTC times need to be offset.
Consider this from NOAA for one metric; daily maximum temperature: “Highest temperature in °F during the 24hrs from midnight to midnight standard time.” This definition is quite typical and the basis for several climatological measures.
So, working solely in UTC is fine, but I can’t simply aggregate daily data on UTC time zone days (midnight to midnight UTC) based on the definitions above. This I can do, but the metrics do not conform the NOAA definition.
As I mentioned, I can achieve my goal for a single day, but I cannot figure out how to get flux to shift UTC for the daily aggregates when I try to process months of data (data acquired every minute) on the local day to produce daily statistics. This is where I could really use some help.
OK I think I know what you mean. I see the documentation has it planned but that it isn’t implemented in flux yet. They recommend you use influxQL at the moment. Honestly, I would do this programmatically myself until it is ready. I would make it a task in Spring Boot or a job in K8 and save the calculation each night and run it once a day per target. You know the offset to use for standard time and the query will be the same for every target. A little metadata database for all of the tasks that need to be run nightly would be required. Then roll it up the daily calculations for monthly queries whenever needed.
Thanks for replying back. Given that the flux language can almost do the entire job very efficiently, I was hoping to avoid writing code to pull out data to calculate the statistics and then insert it back into InfluxDB. This is not a difficult chore.
This lack of this very simple time functionality in flux discourages the use of InfluxDB for a much broader class of IoT projects that must work on or provide metrics on local/standard time. Even I am beginning to wonder if I should start looking at other solutions if I have to start writing code for some simple tasks. This is not about being lazy, but it is just a practical consideration of the overall ROI given the current state of platform maturity. It is truly a disappointing thought given that the InfluxDB platform has so incredibly much offer (including the power of the flux language).
I see that this issue has been around since 2018 and there are numerous postings for people having similar time-related challenges. Can someone please provide some insight as to where a fix for this problem is in terms of priority or even a timeline?
Can’t you just run the task with the offset computed in the flux script before doing the query? It makes it less readable but it isn’t a complete show stopper right? You would need to tailor the query per timezone anyway right? I list of variables to run through possibly. I am still getting up to speed on flux myself.
If there was a way to iterate what worked for a single day, I would have used that. Iterating over a block of code does not appear to be possible.
I looked at the various flux functions that could be used to offset time and ran quite a number of tests using each. The only way I found to reliably adjust UTC to local time was to use the timeShift function. Then I leveraged the aggregateWindow function with a 24-hour duration. This approach appears to work.
The downside is that the timeShift duration is fixed, and is not tied automatically to NTP. A simple built-in function to adjust for time zone is missing and it would make this time adjustment incredibly straight-forward. Deployment of sensors to multiple time zones is a bit more cumbersome.
@dcaho, I fully agree, just recently was trying to get some time based daily / monthly stats in local time and:
InfluxQL has no monthly / yearly windows
Flux has no local time
So none of them seems to be complete to do time based analysis (unless I miss something) which looks very surprising to me knowing that we are talking about a time series database.
I ended up the same - using timeShift function before window but this is not really accurate, especially that the historical query results will change with every DST change.
@MichalR has observed the same issues I am finding.
After considerable digging into the details since my last posting, I find that Flux does not accurately do the job. The ONLY way I have been able to obtain accurate, consistent, and predictable results is to leverage the API from my own code (back to @kramik1’s original suggestion).
The handling of time within Flux has a lot of issues. What I have seen:
Windows (including aggregate) cover their full duration regardless of what you might exclude. For example, I wanted to skip the first 6 days of a month within a Window duration of 1 month. This is not possible. Window duration is absolute and overrides other filters.
The time in the InfluxDB do not match the submitted epoch times being added to the data point. There is always a difference of at least a few seconds. For example, if you try to submit the epoch for 2021-01-01T23:59:59 UTC, what is logged in InfuxDB is 2021-01-02T00:00:00 UTC. Unless I back off several seconds, the logged time is always to the next day. This should be precise.
@dcaho thank’s for this eye opening info on issues with time windows in Flux. Everything works fine up to 1h and then it’s actually impossible to use aggregateWindow. I have only worked with InfluxDB 2.0 for two month’s and love the approach to timeseries with the Flux language, but this is very annoying.
However I have today decided to address the issue by adding a “local_day” field with text values e.g. “2021-05-07” for timeseries that meaningfully can be aggregated above the hour level and then using …|> group(columns: [“local_day”]) |> sum()/count()/mean()/… instead of aggregateWindow. The timeseries are sampled in seconds, but will be pre-aggregated to hour before this attempt.
Besides this only solves for my own local timezone - is there another approach, that is more agile? I have not found a function in Flux, that can convert UTC to another timezone (why?) - that would be the most simple solution for a workaround.
@pgaarde, I have found that I can do everything I want via my own code and using the API. I have absolute confidence (after a lot of testing ) in my own code.
Part of the problem may be how you view the data. I fell into the trap of using the InfluxDB UI and Data Explorer. I have learned that the ‘raw data’ view is not truly the raw data. The displayed times are definitely wrong because, by default, the UI is already windowing/aggregating the data that is displayed. Notice the Window Period box in the lower right of the Data Explorer window. There is no switch to disable this default feature. But, you can still get to your raw data by clicking on the Script Editor in the Data Explorer and deleting the ’ |> aggregateWindow’ line in the script. Then, and only then, can you see the raw data in the table and charts.
@dcaho - yes I have had my part of confusion with the Data Explorer and use the Script Editor like you describe - using the Explorer to build the first draft of the query.
Can you elaborate on how to do correct localized aggregations via the API? Are you fetching raw data and doing aggregations in your own code?
I have implemented my own aggregation via the API. This provides greater flexibility working in local time and I trust the results. I have a raw dataset that is not perfect, so I am also cleaning up the time series at the same time that I am down-sampling and extracting statistics in local time slices.
Sounds doable and making data preprocessing at the same time, gives it a purpose. I however have given up for aggregations above 1h - my luck is, that I only have a few parameters that meaningly can be aggregated above that and this will be solved by a group of functions in postgresql, that is the main storage anyway.
I came across this thread while trying to aggregate energy consumption. This obviously have to be aggregated in local time. It seems impossible to query one full billing month of data and try aggregate it by day. No matter the time shift you apply to start and end, Flux still consider a day being 00:00UTC to 23:59UTC. So, without external code (python?) to correct aggregate is impossible to get it right in Flux only.
It seems as the only solution to aggregations above 1h. From the top of my head - if InfluxDB was the primary source of data (meaning the raw data are stored directly to InfluxDB), I would make a task in InfluxDB, that made aggregated values on an hourly basis. Then create a client-task, that made higher levels of aggregations and stores it back into InfluxDB with correct UTC start times. That way the InfluxDB UI, Grafana etc. will be able to aquire the correct aggregations from any locale (and very fast).
It is ugly that an aggregation of -28h is bound to the UTC midnight. For instance:
if have a range from 3:00 UTC to 2:59UTC next day, and aggregate over 24h (or 1d) I get two records, one from 3:00~00:00 and another from 00:00~2:59UTC. Ok, I can get that 1d is bounded by midnight but 24h should be bound to start time. Anyway, you can try aggregate over 28h or 29h passing the same range (only 24h), what do expect? I expect that, if I am aggregating in 28h time span, I have one record every 28h right? No, for some days I get one record, for others I get two records (from 3 to 0UTC and other 0 to 2:59UTC), no explanation to that.
Now imagine that if hourly does this mess, how 1month will perform?
@jrbenito I think, that might be related to what dcaho wrote earlier in this thread. Windows are always covering the full duration (UTC) and thus not respecting the range filter fully.
Kind of ironic that a time series database cannot really work with time intervals over one hour, but that said: How should a time series database track local days when all times should be reflected in UTC?
I like @pgaarde ’s solution of adding local_day field as the cleanest solution to keep information within Influx where it belongs for information that must be managed logically beyond one hour, but it is a shame that there is no better way (within flux).
I have just bumped into this issue. I deployed InfluxDB to handle the surging amount of time-series data my company gathered. I connected a feed from InfluxDB to a dashboard and it all looked great.
Only when a colleague informed me the time window appeared to be a bit off, I realize InfluxDB has been aggregating things at UTC+00. I have to set the aggregation window interval to “1h” and do the offset before feeding the query results to the dashboard. This totally defeats the purpose of having any interval option above “1h” lol…