Missing Data when comparing with Raw Data

Anshul_choudhary · March 27, 2019, 10:35am

hi everyone,

hope you can help me

I can see some of the data points are missing when comparing the data from raw CSV file and data available in influx(chronograph).

Can you tell me an instance when this can happen?
PS: I know when the timestamp is the same as all the tags being the same it would be written with the latest line protocol and its not the case with my scenario.

Raw Data:

From Influx:

The raw data is in EST and data in influx (chronograf) is in IST which is 10:30 hours more than EST.

Regards,
Anshul

noahcrowley · March 27, 2019, 3:48pm

The time difference is likely due to the fact that data is stored in InfluxDB using UTC time, but adjusted for your local time zone when viewed in Chronograf.

Can you share your CSV, or an example that exhibits the same issue, and provide more information about how you are loading the data into the database, how you are querying the data back, and what data is missing? If you are receiving a successful response to your write request, no data should be missing.

Anshul_choudhary · March 28, 2019, 9:12am

I have compared the data which was shown in logs and from the raw data file using simple count() while inserting the data. It has shown a precise count of 233288. However after insertion into influxDB using count() it shows 188282 rows which means there has come difference of around 45006. Note the timestamp is not the same for values which I have compared the data, will explain and attach the information down below.
The timezone shouldn’t be an issue, using this in my system to translate the timezone to UTC, the file time zone is EST. The entries which are matching have the exact timezone when translating from EST to what has been stored in InfluxDB. My InfluxDB instance stores UTC+5:30. I have checked this manually.

Define Timezone
TIME_FORMAT, file_timezone = ‘%Y-%m-%d %H:%M:%S.%f’, ‘US/Eastern’

Convert timezone
datetime_naive = datetime.datetime.strptime(str(row[time_column]), TIME_FORMAT)
datetime_aware = timezone(file_timezone).localize(datetime_naive)
timestamp = int(datetime_aware.timestamp()) * 1000000000

This is the code being used: this is code being used, can’t share the full code as it other non-sharable information

try:
reader = csvfile
for row in reader:
tags, fields = ,

' Update the tags '
     for t in tag_columns:
         if t in row.keys():
             tags.append("{tag}={value}".format(tag=tag_columns[t], value=format_tags(row[t])))
     ' Update the values '
     for f in field_columns:
         if f in row.keys():
             fields.append("{field}={value}".format(field=field_columns[f], value=format_fields(row[f])))
     ' Update the time '
     
     datetime_naive = datetime.datetime.strptime(str(row[time_column]), TIME_FORMAT)
     datetime_aware = timezone(file_timezone).localize(datetime_naive)
     timestamp = int(datetime_aware.timestamp()) * 1000000000
     ' create the line protocol '
     data_points.append("{MEASUREMENT},{tags} {fields} {time}".format(MEASUREMENT=MEASUREMENT, tags=','.join(tags),fields=','.join(fields), time=timestamp))
     ' Count the number of rows have been read '
     count = count + 1
     if count > 10000000000:
         print("Exiting the for loop as rows exceeded 10M")
         break

     if len(data_points) % BATCH_SIZE == 0:
         line = '\n'.join(data_points)
         # print('Read %d lines ' % count)
         # print('Inserting %d data_points...' % (len(data_points)))
         ingestion_status = client.write_points(line, protocol=PROTOCOL)
         if ingestion_status is False:
             print('Problem inserting points, exiting...')
             exit(1)
         print("Wrote %d, and ingestion status is: %s" % (len(data_points), ingestion_status))
         data_points = []
 if len(data_points) > 0:
     line = '\n'.join(data_points)
     print('Writing the remaining rows', len(data_points))
     ingestion_status = client.write_points(line, protocol=PROTOCOL)
     print("Final rows are = %d, and ingestion status is %s" % (len(data_points), ingestion_status))

The data which has been tested by comparing the raw file and data using chronograph is mentioned below for value = 41

From Raw Data when value = 41

From InfluxDB when value = 41

Here the difference can be seen for the value 41and we have 28 entries in the raw data and 26 entries in the influxDB though I know that the timestamp would be different and in case of same timestamp the tags would be different.

Can you please suggest what could be the issue? and if anything you need from my end then please let me know?

Anshul_choudhary · April 3, 2019, 10:32am

@noahcrowley, is there anything on you can help on?

Anshul_choudhary · April 12, 2019, 10:37am

@noahcrowley, i was debugging this issue from long and analysed that when I am injecting unique identifier then all the rows are getting injected but the still around 10% data loss when unique identifier has been removed from the ingestion. Unique identifier seems like have a limit of becoming the tag value when the count increases > 100K seems like.

The problem is even when the timestamp values are different they are getting overwritten

Please suggest something!

MarcV · April 12, 2019, 12:04pm

I @Anshul_choudhary ,

is this related to your problem ?

max-values-per-tag ?

Anshul_choudhary · April 16, 2019, 5:24am

Hi @MarcV, no that’s not the issue, the issue is after the ingestion in influxdb I can see missing data when compared with the raw file used for ingestion. I tried to use the unique identifier after knowing that I am loosing around 8 - 10 % of data in influxdb and thus with unique identifier as a tag value I can see the 100% data is ingested thereafter.
So need any expert suggestion from the community what is going wrong here.
I have the code attached already which I am using for ingestion or any technique which I can follow to debug this issue.

MarcV · April 16, 2019, 6:47am

Hi @Anshul_choudhary,
Maybe it is a problem in the code ?
You can try to load the records with value
41 with insert statements using influx command line.
Or write from your code the data for value 41 to a logfile and check the output.
Best regards

Anshul_choudhary · April 16, 2019, 7:03am

great suggestion, let me try that out too, I was trying ingestion using JSON protocol, let me check that and will get back soon.

Anshul_choudhary · April 16, 2019, 7:17am

Also one more thing @MarcV, its returning ingestion status TRUE for rows inserted one by one or in batch of rows, would there be still problem in code?

MarcV · April 16, 2019, 8:07am

I don’t know Anshul ,
but if you write to a file and not to the database
it will become clear I think ,

Topic		Replies	Views
The data is displayed in influxdb a few minutes earlier InfluxDB 2 influxdb , telegraf , grafana	3	38	November 5, 2024
Wrong Imported Timestamps in Chronograf influxdb	12	2486	May 28, 2020
Setting Time zones Telegraf influxdb , telegraf	4	14662	May 9, 2018
Different query response on 'same' query InfluxDB 1 influxdb , query	4	1547	August 16, 2021
Data showing incorrect timezone in Explorer Telegraf influxdb	3	1755	January 27, 2021

Missing Data when comparing with Raw Data

Related topics