Missing Data when comparing with Raw Data

#1

hi everyone,

hope you can help me :slight_smile:

I can see some of the data points are missing when comparing the data from raw CSV file and data available in influx(chronograph).

Can you tell me an instance when this can happen?
PS: I know when the timestamp is the same as all the tags being the same it would be written with the latest line protocol and its not the case with my scenario.

Raw Data:


From Influx:
image

The raw data is in EST and data in influx (chronograf) is in IST which is 10:30 hours more than EST.

Regards,
Anshul

0 Likes

#2

The time difference is likely due to the fact that data is stored in InfluxDB using UTC time, but adjusted for your local time zone when viewed in Chronograf.

Can you share your CSV, or an example that exhibits the same issue, and provide more information about how you are loading the data into the database, how you are querying the data back, and what data is missing? If you are receiving a successful response to your write request, no data should be missing.

0 Likes

#3
  1. I have compared the data which was shown in logs and from the raw data file using simple count() while inserting the data. It has shown a precise count of 233288. However after insertion into influxDB using count() it shows 188282 rows which means there has come difference of around 45006. Note the timestamp is not the same for values which I have compared the data, will explain and attach the information down below.

  2. The timezone shouldn’t be an issue, using this in my system to translate the timezone to UTC, the file time zone is EST. The entries which are matching have the exact timezone when translating from EST to what has been stored in InfluxDB. My InfluxDB instance stores UTC+5:30. I have checked this manually.

    Define Timezone
    TIME_FORMAT, file_timezone = ‘%Y-%m-%d %H:%M:%S.%f’, ‘US/Eastern’

    Convert timezone
    datetime_naive = datetime.datetime.strptime(str(row[time_column]), TIME_FORMAT)
    datetime_aware = timezone(file_timezone).localize(datetime_naive)
    timestamp = int(datetime_aware.timestamp()) * 1000000000

  3. This is the code being used: this is code being used, can’t share the full code as it other non-sharable information

    try:
    reader = csvfile
    for row in reader:
    tags, fields = ,

    ' Update the tags '
         for t in tag_columns:
             if t in row.keys():
                 tags.append("{tag}={value}".format(tag=tag_columns[t], value=format_tags(row[t])))
         ' Update the values '
         for f in field_columns:
             if f in row.keys():
                 fields.append("{field}={value}".format(field=field_columns[f], value=format_fields(row[f])))
         ' Update the time '
         
         datetime_naive = datetime.datetime.strptime(str(row[time_column]), TIME_FORMAT)
         datetime_aware = timezone(file_timezone).localize(datetime_naive)
         timestamp = int(datetime_aware.timestamp()) * 1000000000
         ' create the line protocol '
         data_points.append("{MEASUREMENT},{tags} {fields} {time}".format(MEASUREMENT=MEASUREMENT, tags=','.join(tags),fields=','.join(fields), time=timestamp))
         ' Count the number of rows have been read '
         count = count + 1
         if count > 10000000000:
             print("Exiting the for loop as rows exceeded 10M")
             break
    
         if len(data_points) % BATCH_SIZE == 0:
             line = '\n'.join(data_points)
             # print('Read %d lines ' % count)
             # print('Inserting %d data_points...' % (len(data_points)))
             ingestion_status = client.write_points(line, protocol=PROTOCOL)
             if ingestion_status is False:
                 print('Problem inserting points, exiting...')
                 exit(1)
             print("Wrote %d, and ingestion status is: %s" % (len(data_points), ingestion_status))
             data_points = []
     if len(data_points) > 0:
         line = '\n'.join(data_points)
         print('Writing the remaining rows', len(data_points))
         ingestion_status = client.write_points(line, protocol=PROTOCOL)
         print("Final rows are = %d, and ingestion status is %s" % (len(data_points), ingestion_status))
    
  4. The data which has been tested by comparing the raw file and data using chronograph is mentioned below for value = 41

From Raw Data when value = 41

From InfluxDB when value = 41
image
image

Here the difference can be seen for the value 41and we have 28 entries in the raw data and 26 entries in the influxDB though I know that the timestamp would be different and in case of same timestamp the tags would be different.

Can you please suggest what could be the issue? and if anything you need from my end then please let me know?

0 Likes

#4

@noahcrowley, is there anything on you can help on?

0 Likes

#5

@noahcrowley, i was debugging this issue from long and analysed that when I am injecting unique identifier then all the rows are getting injected but the still around 10% data loss when unique identifier has been removed from the ingestion. Unique identifier seems like have a limit of becoming the tag value when the count increases > 100K seems like.

The problem is even when the timestamp values are different they are getting overwritten :frowning:

Please suggest something!

0 Likes

#6

I @Anshul_choudhary ,

is this related to your problem ?

max-values-per-tag ?

0 Likes

#7

Hi @MarcV, no that’s not the issue, the issue is after the ingestion in influxdb I can see missing data when compared with the raw file used for ingestion. I tried to use the unique identifier after knowing that I am loosing around 8 - 10 % of data in influxdb and thus with unique identifier as a tag value I can see the 100% data is ingested thereafter.
So need any expert suggestion from the community what is going wrong here.
I have the code attached already which I am using for ingestion or any technique which I can follow to debug this issue.

0 Likes

#8

Hi @Anshul_choudhary,
Maybe it is a problem in the code ?
You can try to load the records with value
41 with insert statements using influx command line.
Or write from your code the data for value 41 to a logfile and check the output.
Best regards

0 Likes

#9

great suggestion, let me try that out too, I was trying ingestion using JSON protocol, let me check that and will get back soon.

0 Likes

#10

Also one more thing @MarcV, its returning ingestion status TRUE for rows inserted one by one or in batch of rows, would there be still problem in code?

0 Likes

#11

I don’t know Anshul ,
but if you write to a file and not to the database
it will become clear I think ,

0 Likes