Schema for transient data sets

I’m just getting started with InfluxDB and am not sure if my data is a good fit for it. I’m not sure I understand the system well enough to figure that out.

I am logging phase data comparing frequency standards (a device under test vs. a reference) over time, generated from a single measurement setup. Each data run for a DUT may last hours to weeks with a datapoint typically once per second. I need to identify each run so I can link it to the device that’s being tested and export the associated data for further processing. In my manual system, I included comment lines at the top of the file identifying the DUT, the reference, the start date/time, the measurement rate, and any other special info. Then each measurement writes a line to the log with (typically) timestamp and phase value.

What’s the best way with InfluxDB to handle data like that? Is there a way to associate metadata to a time series without including it all along with each measurement (which would take much more space than the data itself)?

Thanks!

Hi @n8ur ,
Great to hear you are checking out InfluxDB! I am not an expert in your domain so my apologies for the butchery of your field. Let’s see if you can sort your data into tags and fields:
Tags:

  • DeviceID - This will be to identify your device the samples are coming from
  • reference - Your associated reference to the DUT being performed
  • start-time - time your test started
  • measurement rate - note if you store this as a tag then it will be considered a string value. depending on if you want to use the value as part of calculations you might want to consider this to be a field instead. Up to you.

field:

  • phase value
  • (optionally) measurement rate

Note when using tags that they are extremely useful for filtering your data and providing metadata. They are also low impact on overall performance. This is a nice blog to consider performance Optimizing InfluxDB Performance for High Velocity Data - InfluxData

Though I do not consider your data high velocity. If you like you could also consider a more hybrid approach. Storing the metadata within a relational DB and your time series data in InfluxDB. Flux supplies the ability to query a relational database and pull that data into InfluxDB. Check out the docs for this great example with an IoT sensor: Query SQL data sources with InfluxDB | InfluxDB OSS 2.0 Documentation

If you have a sample of your data we could try and convert this together into a format for loading into InfluxDB

Thank you, Jay, for the quick and friendly reply! To give you an idea of what I’m doing, below is an example of the most complex metadata I’m likely to have (this is generated by one of the tools I use; others have much less header info).

jra@flob:~$ cat influx_example.tim
;
;Measurement log file C:\Documents and Settings\John Ackermann\My Documents\hp5065a-bva.tim
;Written by TimeLab 1.010 of Mar 12 2012 by John Miles, KE5FX (jmiles@pop.net)
;

STR 0x00000000 "Driver" Symmetricom 5115A / 5120A / 5125A (Frequency stability) ...
STR 0x79000000 "Channel" 1/1
STR 0xC8000000 "Time/Date" 4/9/2012 6:57:38 PM
DBL 0xCD400000 "MJD" 56026.95680392361
STR 0x05000001 "Trace" HP5065A vs. BVA
STR 0x0A000001 "Notes" 
DBL 0x32200041 "Sample Interval" 0.1
S32 0x00000000 "Data Type" 0
DBL 0x2D000000 "Scale Factor" 1.0
DBL 0x14000081 "Input Freq" 5e6
S32 0x5A000000 "Bin Density" 29
S32 0x5F000000 "Bin Threshold" 4
S32 0x50000000 "Trace History" 1
STR 0xF0000000 "Address" femto.febo.com:1298
STR 0xE6000001 "Instrument" TSC-5120A
DBL 0x3C001001 "Duration" 86400000000
S32 0x00000000 "Duration Type" 5
S32 0x00000000 "Stop Condition" 0
DBL 0x32000080 "Sample Rate" 1000
BLN 0x00000000 "Unwrapped" True

TIC 864000
   -2.3176790940984981E-010
   -2.3332144553377970E-010
   -2.3415675380975065E-010
   -2.3545737328480339E-010
   -2.3673402870843350E-010
   [ ~864K lines like this follow... ]

The semicolon lines are commas that provide helpful context. The STR lines are metadata with the parameters of the run and the fields should always follow the same format. The data are the values on the lines following (in this case, the individual lines don’t need to be timetagged because the metadata tells us the start time and the sample rate (in this case 1000 samples/second, but that’s unusually high; more often it’s once per second).

If we can handle this format, anything else I need will be simpler.

If this is too complex for tags, your idea of a relational database could work. For example, I could have the filename be a unique serial number and build a record in the other database with the metadata, and then in InfluxDB just include that serial number to link the data to the metadata.

Let me know if I’ve frightened you off…

Thanks!
John

HI @n8ur ,
I am so sorry for the late reply I did not get notified about this post. Anytime you don’t hear from me. Just @ me in the post. I will start to convert this now

@n8ur ,
So here is an example of the line protocol:

test,Address=femto.febo.com:1298,Channel=1/1,Driver=Symmetricom\ 5115A,Instrument=TSC-5120A,Trace=HP5065A\ vs.\ BVA,host=Jays-MBP Bin\ Density=29,Bin\ Threshold=4,Data\ Type=0,Duration=86400000000,Duration\ Type=5,Input\ Freq=5000000,MJD=56026.95680392361,Sample\ Interval=0.1,Sample\ Rate=1000,Scale\ Factor=1,Stop\ Condition=0,Trace\ History=1 1636728592000000000

Which will look like this in InfluxDB:

I have a few follow up questions tho:

  1. Does the Time/Data indicate only when the test started or is one of these produced for every point.
  2. How should the data under TIC be represented. Should these each be plotted on the graph at specific intervals?
  3. Should all the meta data including non-STR be considered tags?

Hi Jay –

Now it’s my turn to apologize for slow response! My excuse is that I’ve been working with a simpler dataset and trying to get that going. Thanks so much for putting the schema together. To answer your questions:

  1. The “Time/Date” indicates either the start or the finish time – I need to double check that. It’s not generated on every sample.

  2. The TIC field indicates the total number of samples to follow. The Sample Interval tells the interval between samples. So here, the data is 864000 points generated 0.1 second apart – ie, one day of data at a 10 Hz rate.

Naively we can just plot that data and get some visually interesting graphs, but the primary use of a data set like this is to subject it to a specialized statistical algorithm (Allan Deviation) that provides something like a standard deviation for samples taken at different intervals. At some point, I’ll learn how to use Influx and/or Grafana to do that analysis and generate plots. In the short term, I’ll pull the data from the database to reconstruct the input file so it can be fed into the existing analysis tools.

  1. Yes, everything down to and including the TIC line is metadata and I suppose would go into tags. Here my lack of Influx familiarity shows – as you can see, the data sets can be quite large; I’ve routinely had 5 million or more points. If all those tags are literally included with each data record, it seems like the storage required would be enormous. In my limited knowledge, this seem slike something that a simple relationship database could do well – use a key field that links all the data points in one measurement run to a table containing all the metadata for that run. I don’t know if I’m barking up the wrong tree about this…

Thanks so much for your help. I have one other question, but I’ll post it as a new topic (if I can’t find an answer) rather than have scope creep here.