According to faq, if timestamp and tags are the same but fields are different, Influx will only store the new field set.
With simple ‘s’ precision - if I send 5 files to influx with different names as fields but all with the same tag set and timestamp (because they are being sent so fast, precision will not register it as a different second) - I will only get the last point sent.
So I send file 1, 2, 3, 4, 5 at the same time and in the same second, influx will only show I sent file 5.
According to the faq this is intended behavior:
How does InfluxDB handle duplicate points?
A point is uniquely identified by the measurement name, tag set, and timestamp.
If you submit a new point with the same measurement, tag set, and timestamp as an existing point,
the field set becomes the union of the old field set and the new field set,
where any ties go to the new field set. This is the intended behavior.
Then gives two suggestions to work around this:
Introduce an arbitrary new tag to enforce uniqueness.
This would explode the index wouldn’t it?
Increment the timestamp by a nanosecond.
This does not guarantee another client somewhere else incremented their time to the same nanosecond as well.
I’ve already invested a lot of time in my application not realizing this was the case but this is a complete show stopper for me unfortunately and I will have to rewrite a lot of code if there are no other ways around this.
It seems distressing though that unique fields are not considered unique and are overwritten by default. There may be a perfectly reasonable use case for this but at least for me and my project it’s unintuitive and completely broken.
Since it is in the faq this must be a very common question. It would be great if this behavior was configurable. Otherwise it’s going to take me weeks to work out another solution to logging my file uploads with my current setup.
// ShardGroupAt attempts to find a shard group that could contain a point
// at the given time.
//
// Shard groups are sorted first according to end time, and then according
// to start time. Therefore, if there are multiple shard groups that match
// this point's time they will be preferred in this order:
//
// - a shard group with the earliest end time;
// - (assuming identical end times) the shard group with the earliest start time.
func (l sgList) ShardGroupAt(t time.Time) *meta.ShardGroupInfo {
idx := sort.Search(len(l), func(i int) bool { return l[i].EndTime.After(t) })
// We couldn't find a shard group the point falls into.
if idx == len(l) || t.Before(l[idx].StartTime) {
return nil
}
return &l[idx]
}
I am using influx to log my file uploads. Generally the tags are the same but the fields are different and they can be uploaded quickly. The problem disappears if I turn the precision to us but this is not really a guarantee.
All I need is for influx to log everything I send it, regardless of it being the same time. I can understand if the tag set AND the field set were exactly the same at the same second, but in my case it’s the fields that are different. And I would hate to ruin my index by putting in filenames. The db would be useless after I posted a million files.
I am using my tag set to recreate the folder structure.
The folder stuct is uniform: Year/Domain/Site/Sensor/Product etc.
It’s a bit more complicated but I’ve taken pains to make sure that when I ‘show series’ I get the unique paths to reproduce this.
Each Site stored adds ~80 to the tag set. I’ve calculated that this will eventually increase to 48,000 over 10 years.
Using show series allows me to reconstruct this tree in the client application so the user can browse the files. In this use case, users will download the result of show series each time they open the application.
Adding unique filenames to the tag set will destroy that abstraction.
If someone could show me where in the source code the munging of fields occurs I could compile with this change for my application.
Using database ingest_log2
> select * from log
name: log
time domain filesize fname host modtime objectpath product product_level sensor site spath uploader versionid visit year
---- ------ -------- ----- ---- ------- ---------- ------- ------------- ------ ---- ----- -------- --------- ----- ----
1493177738582481000 D90 8214 1.pnr AASD-04911 2017-03-31 16:58:45.3371468 -0600 MDT 2020/fullsite/D90/2020_XRZC_90/L200/X/Z/1.pnr Z L200 X XRZC clxnt 1493177738272 90 2020
1493177738887481000 D90 136 3.pnr AASD-04911 2017-03-31 16:59:18.1551468 -0600 MDT 2020/fullsite/D90/2020_XRZC_90/L200/X/Z/3.pnr Z L200 X XRZC clxnt 1493177738716 90 2020
1493177739297481000 D90 182 2.pnr AASD-04911 2017-03-31 16:59:04.1971468 -0600 MDT 2020/fullsite/D90/2020_XRZC_90/L200/X/Z/2.pnr Z L200 X XRZC clxnt 1493177739008 90 2020
1493177739534481000 D90 124 4.pnr AASD-04911 2017-03-31 16:59:30.1191468 -0600 MDT 2020/fullsite/D90/2020_XRZC_90/L200/X/Z/4.pnr Z L200 X XRZC clxnt 1493177739416 90 2020
This is correct only because
I have set precision to us
I am not sending files concurrently
I am the only user sending files
Tags are: product, product_level, sensor, site, domain, visit, year
In particular - fname is a field.
Because many times a fname will have the same tag set, if any file, regardless of name is sent at the same time with the same tag set, it is overwritten/merged.
I am hoping to narrow down which routine is doing this so I can get the behavior I need for this application - which corresponds to -logging time series data-.
Also, in this use case saving points at nanosecond precision is overkill. I am only doing this now because if I sent a point at second precision the next point during the same second would get merged.
> show series on ingest_log2
key
---
log,domain=D90,product=Z,product_level=L200,sensor=X,site=XRZC,visit=90,year=2020
I see now that this may have to do with B-Tree index being the time field.
Since this is probably the case, having two or more records at the same second would not be allowed - no duplicate keys.
If so, maybe there could be a sequence number - so if configured, timestamp + sequence number would always give a unique key.
Then the problem would be that all the queries depend on unique time stamps and may break if there were this appended unique sequence key attached.
Can’t say that I am qualified to give an answer but it would sure be nice if there were a mode where I could be sure that every record I sent to the database was kept ;/
reply to my own thread. solution is simple and not new. use simplified (twitter) snowflakes like method. time up to second then random number up to microsecond. now we have precision to second and unique key.