Optimize query and scema design to speed up querying against the DB

Hi there,

we are currently evaluating influxdb as a potential replacement of our current data handling.

The data we want to store is tree-structured (i.e. “System.performance.cpu.usage”) and is always queried against completely (never just “System.performance” but the entire path).
For scope: We have a couple thousand leaf-nodes most of which contain time-series data and write up to every ms.
I translated the structure so that the last part is the field-key and everything else is the measurement (in the example: measurement = “System.performance.cpu” field-key=“usage”). Is that a viable way of doing it or is there a better way?

For writing I use the following code:
using(InfluxDBClient client = InfluxDBClientFactory.Create(clientOptions))
client.EnableGzip(); // I’m not sure if this helps but it says so in the best-practices
task = client.GetWriteApiAsync().WriteRecordAsync(WritePrecision.Ms, write);

clientOptions conatins my url(“localhost”), my token(All buckets read/write), my org and the bucket.
“write” is a string containing a batch of writes, format: “measurement leaf=data TimestampInMs\n…”
This code is executed with 5000 single records in one batch because of this: best-practices
I only wait for the task to complete if i need to query against any of the written data.

For querying I use the following code:
List<FluxTable> table;
using (InfluxDBClient client = InfluxDBClientFactory.Create(clientOptions))
table = client.GetQueryApiSync().QuerySync(query);

clientOptions is the same as before and the query looks like this:
from(bucket: “MyBucket”) |> range(start: -30d) |> filter(fn: (r) => r._measurement == “path”) |> filter(fn: (r) => r["_field"] == “leaf”) |> drop(columns: ["_start", “_stop”, “_measurement”, “_field”]) |> last()

As quite often the data is requested without a time period (mostly ‘last n points’) I am now using my retention period for the start time.

Now to my Issue:

While the time spend on writing is more than fast enough (0.6ms - 2ms per batch), the time for reading lacks behind massively in comparison to our current implementation (0.5ms vs 0.03ms).
And even though both read and write times are fairly similar the size of the batch makes up for the added time per write, but of course doesn’t help the read-time.

Is there any way to make my queries faster?
As influx beats our current implementation at everything but querying it is more than okay if there is a trade off.

I am using Influx 2.1 and the c# api client version 4.0.0

Hello @TomsCodingCode,
No I believe you’re committing a common error.
Instead of

I translated the structure so that the last part is the field-key and everything else is the measurement (in the example: measurement = “System.performance.cpu” field-key=“usage”). Is that a viable way of doing it or is there a better way?

Your measurement should be system, your tag should be cpu, and your field key usage.
I’d take a look at to better understand:


Common schema design mistakes that lead to runaway cardinality

Mistake 1 : Log messages as tags.
Solution 1 : We don’t advise that anyone store logs as tags due to the potential for unbounded cardinality (e.g. logs likely contain unique timestamps, UUIDs, etc). You can store attributes of a log as a tag, as long as the cardinality isn’t unbounded. For example, you could extract the log level (error, info, debug) or some key fields from the log message. Storing logs as a field is ok, but it is less efficient to search (essentially table scans), compared to other solutions.

Mistake 2 : Too many measurements. This typically happens when people are moving from —or think of InfluxDB as— a key-value store. So for example, if you’re writing system stats to an InfluxDB instance you might be inclined to write data like so: Cpu.server-5.us-west.usage_user value=20.0
Solution 2 : Instead encode that information as tags like so: cpu, host=server-5, region = us-west, usage_user=20.0

Mistake 3 : Making ids (such as eventid, orderid, or userid) a tag. This is another example that can cause unbounded cardinality if the tag values aren’t scoped.
Solution 3 : Instead, make these metrics a field.

As for querying for the last() value,
can you decrease your time range? Or do you expect to not get data for an entire month frequently?

Thank you for getting back to be so quickly!
As some data is not actually time series data and is only generated after every restart of the system e.g. Software version I do expect that some times. But because those might also not even be generated for longer periods of time I will move them into a different bucket and add some checks in my code.
I’m guessing querying for 1h and then, if I didn’t find anything, query again with a larger range will also do me good (with reasonable thresholds of course).
If the data comes in very regular intervals, is the time saving linear to the size of the time range or is there a point after witch is becomes basically irrelevant? E.g. if 1m over 10min

Thank you, I will be trying this out as soon as I’m back to the office on Wednesday.
As I don’t have an underlying logic to the Nodes in my data-Tree like it is the case here

I am wondering if it’d make more sense to have a lexicographical tag-key (A=performance,B=cpu,C=usage) or if I should rather use the previous tag-value as the next key (System.performance.cpu.usage.currentValue becomes: system root=performance,performace=cpu,cpu=usage currentValue=…).
Because B lies within a certain range depending on A, wich I understand falls under Dependent tags, on the other hand so do the possibilities for the tag-keys.