Designing a Folder-Like Schema in InfluxDB for High-Volume Medical Time-Series Data, Emphasizing on Fast Read And Few Writes

Hello :sun_with_face:

I have around 100,000 numpy arrays each of approximate size [3600 * 256, 20] which represents recordings with size around duration~3600 # one hour, and num_channel~20, and sampling rate of Fs~256 # HZ. These are some medical recordings of around 25_000 patients and each has around 4 sessions of recording.

I need to store these numpy arrays with some tags like (patient_id, recording_id) so i can easily filter each numpy array later to access the patient data.

I need to store all above data to my database at creation time and emulate a structure as follow:

database
   └── patient_id (Multiple signals per a patient, similar to a folder + some metadata)
      └── recording_id (Holds signal itself similar to a file + some metadata)
    

After creating dataset, I have the following tasks:

Task 1: Write Small Chunks of Data Every 10 Seconds

About every 10 second, I need to append some data to a few specific arrays. Data comes from up to 10 devices where each sends data with shape [Fs*10, num_channel] and a (patient_id, recording_id) tuple.

Task 2: Read Small Chunks of Data Very Often

Given (patient_id, recording_id) tuple and a (start_offset and read_duration) tuple I should read the data for that specific patient and recording with a ofset from start of the recording

MY QUESTION:

How to structure the data so influxdb knows that I have this kind of a folder-like structure and can query data pretty FAST For READ. Write tasks seems to be easy to handle.
If possible, please talk in Python :grin:.

Thanks in advance.

Hello @Dariush_Karamati,
Welcome!
First InfluxDB fields can only be strings, int, float or bool. They can’t be numpy arrays. So I’m not quite sure that InfluxDB is a good fit if you’re looking to store numpy arrays.
Now if you’re just storing recording ID/pwd and not the numpy array itself then I would make my example line protocol like:

recordings,patient_id=1234,recording_id=abcd,channel=0 value=0.52 1715800000000000000

Where you have:

Heres an example of writing with a pandas dataframe:

from datetime import datetime, timedelta, timezone
import numpy as np
import pandas as pd

from influxdb_client import InfluxDBClient, WriteOptions

# Config
Fs = 256  # sampling frequency
num_channels = 20
chunk_duration = 10  # seconds
num_points = Fs * chunk_duration

# Simulated signal: shape = [Fs * chunk_duration, num_channels]
signal = np.random.randn(num_points, num_channels)

# Time index
start_time = datetime.now(tz=timezone.utc)
timestamps = [start_time + timedelta(seconds=i / Fs) for i in range(num_points)]

# Flatten to long format: one row per channel per timestamp
rows = []
patient_id = "patient_001"
recording_id = "rec_20250515"

for ch in range(num_channels):
    rows.append(pd.DataFrame({
        "time": timestamps,
        "value": signal[:, ch],
        "channel": f"ch_{ch}",
        "patient_id": patient_id,
        "recording_id": recording_id
    }))

df = pd.concat(rows).set_index("time")

# Connect to InfluxDB
with InfluxDBClient(url="http://localhost:8086", token="my-token", org="my-org") as client:
    with client.write_api(write_options=WriteOptions(batch_size=1000, flush_interval=10_000)) as write_api:
        write_api.write(
            bucket="my-bucket",
            org="my-org",
            record=df,
            data_frame_measurement_name="eeg_recordings",
            data_frame_tag_columns=["patient_id", "recording_id", "channel"]
        )

I hope that helps!