Hi. I’m new to Influx. As part of a research project I was given a database with around 1200 keys and 10TB of data in it. There are one or two specialized tasks I would like to do comprehensively on these data. My understanding is that while databases are good for efficiently extracting specific subsets of data, they are not the right tool if I want to chew through the whole lot.
Are their tools or libraries that would allow me to iterate through all the records?
I was looking at flux, but I found posts claiming it is actually slower than influxQL… I don’t know.
First of all I would ask “do you specifically want to perform timestamp-based
operations on this data”, the alternative being general queries where you
might search primarily for something other than the timestamp on the records?
It’s important to understand that Influx is a
Time series database - Wikipedia and is very good at what
it does, but nothing like as good as a Relational database - Wikipedia
is you want to perform more general types of queries across the data.
So, before you start dealing with 10Tbytes of dtaa, make sure you’ve chosen
the right tool for the job
Thanks for your reply.
To answer your questions: generally speaking, I will need to do timestamp-based operations, and I believe influx is appropriate for the data I have. It’s just one or two computations that I need to make running over all the data. I need to compute a binary value that means “these 6k fields with tag A=1 are ‘similar in value’ to those with tag A=0”. I need to do this for all measurements for all times. Querying the data and making the comparisons one measurement at a time is slow going.
You say that you “need to compute a binary value that means “these 6k fields
with tag A=1 are ‘similar in value’ to those with tag A=0”. I need to do this
for all measurements for all times.”
That (especially the last three words) makes me think that you are not
primarily selecting by timestamp.
If you need to compare all values in the database with tab A=0 against all
values with tag A=1, and then see whether some other field has a “similar
value”, then it doesn’t sound as though timestamp sequencing is important.
If you can give any slightly more specific example of the sort of query you’re
considering doing on the data, we should be able to give a more definitive
opinion on whether a TSDB is the best place to start.
I think your intuition is correct, for this one task timestamp sequencing is not important.
Yet, the data are in an influx database. So, my question is basically how can I read through the data without using the influx queries? Dumping the whole database to csv first seems wasteful. Is there any way to access that data directly? E.g. can I write a python or GO program that iterates through all rows and does some computations?
I think your intuition is correct, for this one task timestamp sequencing
is not important. Yet, the data are in an influx database.
Taking your questions one at a time:
So, my question is basically how can I read through the data without using
the influx queries?
Almost any way you like, but it will be woefully inefficient
Dumping the whole database to csv first seems wasteful.
I think you might be surprised. 10Tbytes might seem like a lot, but provided
you have some spare disk capacity, dumping to CSV and then importing into an
RDBMS may well give you far faster results in the long run than trying to do
this in Influx.
Is there any way to access that data directly? E.g. can I write a python or
GO program that iterates through all rows and does some computations?
Yes, you can, but this will by its very nature be inefficient. Part of the
whole purpose of a(ny) database is the power of its query language, and if you
simply suck out all of the data and then sort/select it in something like
Python or Go, you immediately lose that power.
I’ll let anyone else step in here and suggest a more efficient way to get the
data out of Influx and into something like MySQL / MariaDB / PostgreSQL / MSSQL
or whatever else you might be able to use, but I remain of the opinion that
analysing this quantity of data using InfluxDB in a way which seems far more
appropriate to an RDBMS is going to be very slow.