How does InfluxDB handle duplicate points?

Hi,

According to faq, if timestamp and tags are the same but fields are different, Influx will only store the new field set.

With simple ‘s’ precision - if I send 5 files to influx with different names as fields but all with the same tag set and timestamp (because they are being sent so fast, precision will not register it as a different second) - I will only get the last point sent.

So I send file 1, 2, 3, 4, 5 at the same time and in the same second, influx will only show I sent file 5.

According to the faq this is intended behavior:

How does InfluxDB handle duplicate points?
A point is uniquely identified by the measurement name, tag set, and timestamp.
If you submit a new point with the same measurement, tag set, and timestamp as an existing point,
the field set becomes the union of the old field set and the new field set,
where any ties go to the new field set. This is the intended behavior.

Then gives two suggestions to work around this:

  1. Introduce an arbitrary new tag to enforce uniqueness.

This would explode the index wouldn’t it?

  1. Increment the timestamp by a nanosecond.

This does not guarantee another client somewhere else incremented their time to the same nanosecond as well.

I’ve already invested a lot of time in my application not realizing this was the case but this is a complete show stopper for me unfortunately and I will have to rewrite a lot of code if there are no other ways around this.

It seems distressing though that unique fields are not considered unique and are overwritten by default. There may be a perfectly reasonable use case for this but at least for me and my project it’s unintuitive and completely broken.

Since it is in the faq this must be a very common question. It would be great if this behavior was configurable. Otherwise it’s going to take me weeks to work out another solution to logging my file uploads with my current setup.

Thanks.

I am possibly looking in the wrong section but:
(from https://github.com/influxdata/influxdb/blob/master/coordinator/points_writer.go)
Maybe I could just always return nil from this function:

// ShardGroupAt attempts to find a shard group that could contain a point
// at the given time.
//
// Shard groups are sorted first according to end time, and then according
// to start time. Therefore, if there are multiple shard groups that match
// this point's time they will be preferred in this order:
//
//  - a shard group with the earliest end time;
//  - (assuming identical end times) the shard group with the earliest start time.
func (l sgList) ShardGroupAt(t time.Time) *meta.ShardGroupInfo {
	idx := sort.Search(len(l), func(i int) bool { return l[i].EndTime.After(t) })

	// We couldn't find a shard group the point falls into.
	if idx == len(l) || t.Before(l[idx].StartTime) {
		return nil
	}
	return &l[idx]
}

@5k3105 if you are writing to different field_keys with each point then the separate fields will be stored separately. IE:

INSERT foo,bar=baz field_1=1 1
INSERT foo,bar=baz field_2=2 1
SELECT * FROM foo
name: foo
time                           bar field_1 field_2
----                           --- ------- -------
1970-01-01T00:00:00.000000001Z baz 1       2

Does this help? Providing some sample points from your dataset would help tremendously in debugging this issue.

The field is FileName.

time (s), fname
1, file1
1, file2
1, file3

I am using influx to log my file uploads. Generally the tags are the same but the fields are different and they can be uploaded quickly. The problem disappears if I turn the precision to us but this is not really a guarantee.

All I need is for influx to log everything I send it, regardless of it being the same time. I can understand if the tag set AND the field set were exactly the same at the same second, but in my case it’s the fields that are different. And I would hate to ruin my index by putting in filenames. The db would be useless after I posted a million files.

@5k3105 How about tagging with the filename? The database can easily handle in the millions of series so this should help you solve this issue.

I am using my tag set to recreate the folder structure.

The folder stuct is uniform: Year/Domain/Site/Sensor/Product etc.

It’s a bit more complicated but I’ve taken pains to make sure that when I ‘show series’ I get the unique paths to reproduce this.

Each Site stored adds ~80 to the tag set. I’ve calculated that this will eventually increase to 48,000 over 10 years.

Using show series allows me to reconstruct this tree in the client application so the user can browse the files. In this use case, users will download the result of show series each time they open the application.

Adding unique filenames to the tag set will destroy that abstraction.

If someone could show me where in the source code the munging of fields occurs I could compile with this change for my application.

Thanks.

Hopefully this is it:

I was asked to share what expected output is:

Using database ingest_log2
> select * from log
name: log
time                domain filesize fname host       modtime                               objectpath                                    product product_level sensor site spath uploader versionid     visit year
----                ------ -------- ----- ----       -------                               ----------                                    ------- ------------- ------ ---- ----- -------- ---------     ----- ----
1493177738582481000 D90    8214     1.pnr AASD-04911 2017-03-31 16:58:45.3371468 -0600 MDT 2020/fullsite/D90/2020_XRZC_90/L200/X/Z/1.pnr Z       L200          X      XRZC       clxnt    1493177738272 90    2020
1493177738887481000 D90    136      3.pnr AASD-04911 2017-03-31 16:59:18.1551468 -0600 MDT 2020/fullsite/D90/2020_XRZC_90/L200/X/Z/3.pnr Z       L200          X      XRZC       clxnt    1493177738716 90    2020
1493177739297481000 D90    182      2.pnr AASD-04911 2017-03-31 16:59:04.1971468 -0600 MDT 2020/fullsite/D90/2020_XRZC_90/L200/X/Z/2.pnr Z       L200          X      XRZC       clxnt    1493177739008 90    2020
1493177739534481000 D90    124      4.pnr AASD-04911 2017-03-31 16:59:30.1191468 -0600 MDT 2020/fullsite/D90/2020_XRZC_90/L200/X/Z/4.pnr Z       L200          X      XRZC       clxnt    1493177739416 90    2020

This is correct only because

  1. I have set precision to us
  2. I am not sending files concurrently
  3. I am the only user sending files

Tags are: product, product_level, sensor, site, domain, visit, year

In particular - fname is a field.

Because many times a fname will have the same tag set, if any file, regardless of name is sent at the same time with the same tag set, it is overwritten/merged.

I am hoping to narrow down which routine is doing this so I can get the behavior I need for this application - which corresponds to -logging time series data-.

Also, in this use case saving points at nanosecond precision is overkill. I am only doing this now because if I sent a point at second precision the next point during the same second would get merged.

> show series on ingest_log2
key
---
log,domain=D90,product=Z,product_level=L200,sensor=X,site=XRZC,visit=90,year=2020

I see now that this may have to do with B-Tree index being the time field.

Since this is probably the case, having two or more records at the same second would not be allowed - no duplicate keys.

If so, maybe there could be a sequence number - so if configured, timestamp + sequence number would always give a unique key.

Then the problem would be that all the queries depend on unique time stamps and may break if there were this appended unique sequence key attached.

Can’t say that I am qualified to give an answer but it would sure be nice if there were a mode where I could be sure that every record I sent to the database was kept ;/

Thanks.

reply to my own thread. solution is simple and not new. use simplified (twitter) snowflakes like method. time up to second then random number up to microsecond. now we have precision to second and unique key.

3 Likes