influxDB python query is slow

hello im comparing some DB for my projects.
and using python API to upload DataFrame and querying.

comparing influxDB with mssql, influxdb shows slower querying speed than mssql.
i queried very simple query but mssql showed better speed than influxDB.

my server spec :
Ryzen 48-core
128GB memory
SSD
influxDB 1.8.10

from htop i saw that mssql loading all-core constantly, but influxDB loading all-core sparsely.
it usually loads single-core.

is there any problem on my code?
or should i optimize influxDB it self?

i`m new to influxdb and db work so maybe it could look silly…
thank you!

Performance chart(sec by rows)

| (sec)    | No Cond  |          | Cond(10%)|          |Cond(50%) |          |
------------------------------------------------------------------------------
|          | MS       | IFL      | MS       | IFL      | MS       | IFL      |
| DAY      | 0.8904   | 2.8463   | 20.388   | 0.5493   | 0.5991   | 30.0634  |
| WEEK     | 6.1464   | 20.6774  | 0.7643   | 4.0616   | 3.175    | 11.1019  |
| MONTH    | 25.8782  | 93.2356  | 3.1635   | 15.4812  | 13.6533  | 51.1597  |
| YEAR     | 901.7113 | inf      | 54.0475  | 188.9322 | 165.214  | 1780.7145|

Used queries for test

##### influxDB_no_condition
temp = client.query("SELECT Error_Code FROM test \
                WHERE time >= '2022-04-20 00:00:00' \
                AND time < '2022-04-21 00:00:00'")
##### MS-SQL_no_condition
cursor.execute("SELECT Error_Code FROM test \
                WHERE Regist_Date >= '2022-04-20 00:00:00' \
                AND Regist_Date < '2022-04-21 00:00:00'")
temp = cursor.fetchall()
##### influxDB_cond
temp = client.query("SELECT Error_Code FROM test \
                WHERE Idx % 10 = 1 \
                AND time >= '2022-04-20 00:00:00' \
                AND time < '2022-04-21 00:00:00'")
##### MS-SQL_cond
cursor.execute("SELECT Error_Code FROM test \
                WHERE Idx % 10 = 1 \
                AND Regist_Date >= '2022-04-20 00:00:00' \
                AND Regist_Date < '2022-04-21 00:00:00'")
temp = cursor.fetchall()

performance compare time measuring way

same way to mssql

sum_ = 0
for i in range(round_):
    stamp = time.time()
    temp = client.query("SELECT Error_Code FROM history_5x \
                     WHERE time >= '2022-04-20 00:00:00' \
                     AND time < '2022-04-21 00:00:00'")
    sum_ += (time.time() - stamp)
    
print(round(sum_ / round_, 4), 'sec')

influxDB Python API dummy db generate code

i expanded my orginal sensor dataset (21 columns, 60000row) to 500,000,000 rows
shuffled data to make dummy data and duplicated to fill DB
dummy time index freq = 100ms(0.1s)

host = 'localhost'
port = '9000'

client = influxdb.DataFrameClient(host, port, timeout= 30_000)


...


#21 columns, 60000row, 'Regist_Date' Column is timeseries index
xlsx = pd.read_excel('./GEIM_history_sample.xlsx', sheet_name='Data-History')

#NaN to ''
xlsx['Modify_Date'] = xlsx['Modify_Date'].astype('int64')
xlsx['AlertDATE'] = xlsx['AlertDATE'].astype('int64')

xlsx = xlsx.replace(np.nan, '',regex=True)
xlsx = xlsx.replace(-9223372036854775808, 0,regex=True)

#generate dummy timeformat freq 100ms
time_table = pd.date_range(start = '2022-1-31 20', \
                           periods= 500000000, freq = '100ms')

for i in range(10000):
    #shuffle original data to make dummy data
    df = xlsx.drop(['Index', 'Regist_Date'], axis = 1)
    df_shuffled = df.sample(frac=1)
    df = df_shuffled[:50000]
    
    #insert ordered dummy index & time index 'Regist_Date'
    df.insert(0, 'Idx', np.arange(50000 * i, 50000 * (i + 1)))
    df.insert(5, 'Regist_Date', time_table[50000 * i: 50000 * (i + 1)])
    df.set_index('Regist_Date', inplace = True)

    client.write_points(df,'test', protocol = 'line')
    print(i)
1 Like

The query itself seems simple enough and there is not much to try in order to improve its performance:

  • are you filtering a tag or a field?
  • you can try to pass an explicit list of inclusion/exclusions (or a regex)

you can use EXPLAIN or EXPLAIN ANALYZE in order to have a breakdown of what’s going on, you may get some useful info about of it.

On a broader topic, performance depends on the kind of query you are trying to run, and more generally on your use case.

If all you need is a plain select with some filters MSSQL might be faster as it also allows for flexible indexing. (not your sample as WHERE Idx % 10 = is not sargable).

Meanwhile, InfluxDB has no way of having flexible indexing, it indexes tags (which are saved only as strings) and that’s it. You can also filter on fields, which are not indexed but with a worse performance.

The two DB engines are completely different and are meant for different purposes, SQL DBs are definitely more flexible and can be used for a broader range of cases compared to Time-Series DBs, which given the name shine in just one specific usage.

Can you provide some more context?
What kind of data are you storing and what do you need out of those data?

(compliments for the post, very well made and detailed)