Best practices around database/retention policy/measurement design

Hello,

I have read the docs, searched through Github issues and this forum. However, couldn’t find definite answers to the questions that I have.

I’m storing financial data in Influx - it’s not a huge amount, but I have no need for any retention policies other than the INF one. The dataset would consist both of raw market data coming from different sources and manually downsampled/derived series.

I’m mostly concerned about operational issues that I might run into due to inappropriate data placement across databases/retention policies/series. After some reading I think that the following would work, however I’d like to obtain some validation :slight_smile:

  • database per source of data - this allows independent backup/restore
  • everything in a single autogen retention policy - can’t see any drawbacks to doing that
  • measurement per type of data stored with a different epoch (e.g. raw with ms, 5min with m, etc) - really I couldn’t find any information on the special meaning or implications on choosing a measurement vs tag.

And the questions:

  1. What do I lose by splitting the data out into separate databases?
  2. Are there any known drawbacks to having everything in the autogen retention policy?
  3. Is there a problem with storing everything in the same measurement given the field set is the same?
  4. If I backup the databases at different points in time and then restore either a subset of them or backups made at different times - will the metastore be in good shape? (Will test it out once I have some time)

Thanks in advance!

@dm3

  1. You loose the ability to query it together. If it has a similar schema then writing it all to the same measurement will allow you to correlate the data. In that situation you would distinguish the data sources with a tag (source=foo).
  2. Nope!
  3. You can think of measurements as tables in SQL, tags as indexed columns, and fields as unindexed columns. Does this help?
  4. This should work for you. Give it a test though. We made some backup changes for the 1.2 release so make sure you are using the latest.
1 Like

@jackzampolin thanks!

Re: 1 - I think I’ll keep it in separate dbs for easier maintenance.
Re: 2 - I actually ended up creating separate RPs with different shard duration times set according to the granularity of the input data (according to the recommendations in official docs).
Re: 3 - I guess it helps to think of the measurement as of a relational table. However, I’ve read somewhere that Influx is planning to abolish both measurements and field/tag distinction in 2.0. Is that true? Is there any timeline for when 2.0 is going to see the light of day?

@dm3 I wouldn’t worry about the 2.0 stuff. Theres a lot of work between then and now. At least 1.5 years out. Also 2.0 would be able to support the 1.0 data format so if you prefer that you could keep using the 1.0 syntax.

@jackzampolin let’s dig a little more into #3 here: If you’re liking measurements to tables, then are there any issues with points within a measurement to have totally different tags +/- fields much less values?

As database tables typically containing a specific entity with standard columns of what is collected, you can leverage this data knowing the data you expect to be there. Since InfluxDB doesn’t actually have schema, every point within a measurement can have different tags and/or fields much less the actual values of said fields. I know other systems like elasticsearch really push for dense organizations of data (i.e. highly consistent collections of data) for performance sake.

@andyfeller Ideally you have a standard set of fields/tags within the same measurement. That, as you astutely point out, means that you can have data that is orthogonal in the same measurement. This is not recommended and is not a performant way to store data. There are use cases where this can be advantageous and, because we are a columnar store, performant (this post is a good example).

Thanks @jackzampolin! Perhaps we can dig a little deeper on #3 given a practical example on how you generally recommend capturing measurements. I’ve thrown together an example Java Kafka consumer thread that processes batches of records from Kafka. I’m interesting capturing lots of data about the processing of any given record including latency and throughput within the meat of the code much less the surrounding logic. Also it’s worth noting that the volume of data most drive through Kafka is vast, so it’s possible there will be a lot of points from this.

How would you suggest creating measurement(s) around such a case?

import org.apache,kafka.clients.consumer.*;
import org.apache.kafka.common.TopicPartition;

import java.util.*;

public class ExampleKafkaConsumerThread implements Runnable {

	private final KafkaConsumer<String, String> kafkaConsumer;
	private long pollingTimeMillis = 10000;

	public ExampleKafkaConsumerThread(final KafkaConsumer<String, String>> kafkaConsumer) {
		this.kafkaConsumer = kafkaConsumer;
	}

	@Override
	public void run() {

		final ConsumerRecords<String, String> records = consumer.poll(pollingTimeMillis);

		// TODO: Capture latency around polling kafka and number of records

		for (final TopicParition topicPartition : records.partitions()) {

			final List<ConsumerRecord<String, String>: partitionRecords = records.records(topicPartition);

			for (final ConsumerRecord<String, String> record : partitionRecords) {

				try {

					// TODO: Capture point around the specifics of the record beginning processing

					// Bunch of code here where we'll want to capture other metrics around latency, rates, etc

					// TODO: Capture latency around processing record and any particulars of record

				} catch {

					// TODO: Capture point around exception in processing record and specifics of exception
				}
			}

			// TODO: Capture latency around processing records and number of records
		}

		// TODO: Capture latency around processing records and number of records
	}

	public void setPollingTimeMillis(final long pollingTimeMillis) {
		this.pollingTimeMillis = pollingTimeMillis;
	}
}
1 Like