Hello 
I have around 100,000
numpy
arrays each of approximate size [3600 * 256, 20]
which represents recordings with size around duration~3600 # one hour
, and num_channel~20
, and sampling rate of Fs~256 # HZ
. These are some medical recordings of around 25_000 patients and each has around 4 sessions of recording.
I need to store these numpy arrays with some tags like (patient_id, recording_id)
so i can easily filter each numpy array later to access the patient data.
I need to store all above data to my database at creation time and emulate a structure as follow:
database
└── patient_id (Multiple signals per a patient, similar to a folder + some metadata)
└── recording_id (Holds signal itself similar to a file + some metadata)
After creating dataset, I have the following tasks:
Task 1: Write Small Chunks of Data Every 10 Seconds
About every 10 second, I need to append some data to a few specific arrays. Data comes from up to 10 devices where each sends data with shape [Fs*10, num_channel]
and a (patient_id, recording_id)
tuple.
Task 2: Read Small Chunks of Data Very Often
Given (patient_id, recording_id)
tuple and a (start_offset and read_duration)
tuple I should read the data for that specific patient and recording with a ofset from start of the recording
MY QUESTION:
How to structure the data so influxdb knows that I have this kind of a folder-like structure and can query data pretty FAST For READ. Write tasks seems to be easy to handle.
If possible, please talk in Python
.
Thanks in advance.
Hello @Dariush_Karamati,
Welcome!
First InfluxDB fields can only be strings, int, float or bool. They can’t be numpy arrays. So I’m not quite sure that InfluxDB is a good fit if you’re looking to store numpy arrays.
Now if you’re just storing recording ID/pwd and not the numpy array itself then I would make my example line protocol like:
recordings,patient_id=1234,recording_id=abcd,channel=0 value=0.52 1715800000000000000
Where you have:
Heres an example of writing with a pandas dataframe:
from datetime import datetime, timedelta, timezone
import numpy as np
import pandas as pd
from influxdb_client import InfluxDBClient, WriteOptions
# Config
Fs = 256 # sampling frequency
num_channels = 20
chunk_duration = 10 # seconds
num_points = Fs * chunk_duration
# Simulated signal: shape = [Fs * chunk_duration, num_channels]
signal = np.random.randn(num_points, num_channels)
# Time index
start_time = datetime.now(tz=timezone.utc)
timestamps = [start_time + timedelta(seconds=i / Fs) for i in range(num_points)]
# Flatten to long format: one row per channel per timestamp
rows = []
patient_id = "patient_001"
recording_id = "rec_20250515"
for ch in range(num_channels):
rows.append(pd.DataFrame({
"time": timestamps,
"value": signal[:, ch],
"channel": f"ch_{ch}",
"patient_id": patient_id,
"recording_id": recording_id
}))
df = pd.concat(rows).set_index("time")
# Connect to InfluxDB
with InfluxDBClient(url="http://localhost:8086", token="my-token", org="my-org") as client:
with client.write_api(write_options=WriteOptions(batch_size=1000, flush_interval=10_000)) as write_api:
write_api.write(
bucket="my-bucket",
org="my-org",
record=df,
data_frame_measurement_name="eeg_recordings",
data_frame_tag_columns=["patient_id", "recording_id", "channel"]
)
I hope that helps!