hello im comparing some DB for my projects.
and using python API to upload DataFrame and querying.
comparing influxDB with mssql, influxdb shows slower querying speed than mssql.
i queried very simple query but mssql showed better speed than influxDB.
my server spec :
Ryzen 48-core
128GB memory
SSD
influxDB 1.8.10
from htop i saw that mssql loading all-core constantly, but influxDB loading all-core sparsely.
it usually loads single-core.
is there any problem on my code?
or should i optimize influxDB it self?
i`m new to influxdb and db work so maybe it could look silly…
thank you!
Performance chart(sec by rows)
| (sec) | No Cond | | Cond(10%)| |Cond(50%) | |
------------------------------------------------------------------------------
| | MS | IFL | MS | IFL | MS | IFL |
| DAY | 0.8904 | 2.8463 | 20.388 | 0.5493 | 0.5991 | 30.0634 |
| WEEK | 6.1464 | 20.6774 | 0.7643 | 4.0616 | 3.175 | 11.1019 |
| MONTH | 25.8782 | 93.2356 | 3.1635 | 15.4812 | 13.6533 | 51.1597 |
| YEAR | 901.7113 | inf | 54.0475 | 188.9322 | 165.214 | 1780.7145|
Used queries for test
##### influxDB_no_condition
temp = client.query("SELECT Error_Code FROM test \
WHERE time >= '2022-04-20 00:00:00' \
AND time < '2022-04-21 00:00:00'")
##### MS-SQL_no_condition
cursor.execute("SELECT Error_Code FROM test \
WHERE Regist_Date >= '2022-04-20 00:00:00' \
AND Regist_Date < '2022-04-21 00:00:00'")
temp = cursor.fetchall()
##### influxDB_cond
temp = client.query("SELECT Error_Code FROM test \
WHERE Idx % 10 = 1 \
AND time >= '2022-04-20 00:00:00' \
AND time < '2022-04-21 00:00:00'")
##### MS-SQL_cond
cursor.execute("SELECT Error_Code FROM test \
WHERE Idx % 10 = 1 \
AND Regist_Date >= '2022-04-20 00:00:00' \
AND Regist_Date < '2022-04-21 00:00:00'")
temp = cursor.fetchall()
performance compare time measuring way
same way to mssql
sum_ = 0
for i in range(round_):
stamp = time.time()
temp = client.query("SELECT Error_Code FROM history_5x \
WHERE time >= '2022-04-20 00:00:00' \
AND time < '2022-04-21 00:00:00'")
sum_ += (time.time() - stamp)
print(round(sum_ / round_, 4), 'sec')
influxDB Python API dummy db generate code
i expanded my orginal sensor dataset (21 columns, 60000row) to 500,000,000 rows
shuffled data to make dummy data and duplicated to fill DB
dummy time index freq = 100ms(0.1s)
host = 'localhost'
port = '9000'
client = influxdb.DataFrameClient(host, port, timeout= 30_000)
...
#21 columns, 60000row, 'Regist_Date' Column is timeseries index
xlsx = pd.read_excel('./GEIM_history_sample.xlsx', sheet_name='Data-History')
#NaN to ''
xlsx['Modify_Date'] = xlsx['Modify_Date'].astype('int64')
xlsx['AlertDATE'] = xlsx['AlertDATE'].astype('int64')
xlsx = xlsx.replace(np.nan, '',regex=True)
xlsx = xlsx.replace(-9223372036854775808, 0,regex=True)
#generate dummy timeformat freq 100ms
time_table = pd.date_range(start = '2022-1-31 20', \
periods= 500000000, freq = '100ms')
for i in range(10000):
#shuffle original data to make dummy data
df = xlsx.drop(['Index', 'Regist_Date'], axis = 1)
df_shuffled = df.sample(frac=1)
df = df_shuffled[:50000]
#insert ordered dummy index & time index 'Regist_Date'
df.insert(0, 'Idx', np.arange(50000 * i, 50000 * (i + 1)))
df.insert(5, 'Regist_Date', time_table[50000 * i: 50000 * (i + 1)])
df.set_index('Regist_Date', inplace = True)
client.write_points(df,'test', protocol = 'line')
print(i)