Hi @NoxWorld,
I think your estimate is a little high, since InfluxDB only stores information about the actual series that exist in the database, and not the entire set of possibilities. I’ll try explaining a little bit, and then using some of the information you provided to give you a little better estimate for your expected memory usage.
Trying to predict exact memory usage is very difficult as we don’t have a fixed schema database, so our in-memory structures are dynamically-sized. That being said, we’ve generally found memory usage to grow linearly, so if you have 10,000 series, it would be about 1/10th the memory usage as if you have 100,000 series (again, roughly).
The index is based around series, which are made up of the measurement name and the tag key-value pairs. As an example, given the data:
cpu,host=a.example.com,region=us-east-1,core=0 user=54 idle=35 system=11
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Series Key
The series key is highlighted, and is 46 bytes, but it’s going to take more than a single 46 byte allocation to represent that using our in-memory index.
The most important aspect of estimating memory usage and load on the server is what we call “series cardinality”, simply put: the total number of unique series. You read our short explanation here, but I’ll reiterate the ideas here.
Going back to our example data, we have the tags host
, region
, and core
. If we have 200 hosts, each with 8 cores, that would be 1600 series. If those hosts are spread across 8 regions, we still only have 1600 series. That is because region
will only ever have one value per-host
. One reason we include region
as a tag when it could probably be deduced from host
is because of queries: tags allow for more efficient filtering than values would. If region
were a value instead of a tag, a query with where region = 'us-east-1'
would have to scan through all of the data, correlate the region value with the value we’re interested in (matching the timestamps), and then filter. Since region
is a tag, we can just scan for values in the series that are appropriate.
Let’s go back to your example, and consider a few things: What tags will only contain dependent values (like region
in our example), what tags can we eliminate (things that don’t necessarily make sense to group by/filter on), and what tags will have a scaling effect (core
in our example).
-
robot_id
– unavoidable if you want to query a single robot, perhaps id
would be a better name (like we used host
instead of hostname
in our example), since hopefully the database or measurement name would be enough to deduce that we’re tracking robot measurements.
-
type_of_measurement
– Without more examples, I’m not sure I follow what kind of values would appear here, and it might be better off as just the field name. Unless there will be multiple values for each type and the same value fields could appear under multiple type_of_measurement
s, and you’d like to be able to filter them out. Again, I would try to find a shorter name that is still descriptive enough to be valuable.
-
unit_type
– For the purposes of analysis, I would recommend settling the units before writing data. Example: “What is the average temperature?” You probably want a single value, and not one average for Celsius and one for Fahrenheit. Trying to average the raw values, if both °C and °F are used, would give a very strange/incorrect answer. InfluxDB doesn’t have the ability to conditionally apply math to value based on a tag-value. In which case you might be better off just calling the fields temp_C
and temp_F
, but trying to do analysis would still be difficult.
-
robot_type
– This seems like it would be analogous to region
in our example, something that is directly tied to the id
, and thus does not contribute to cardinality. However, also like region
, you might want to aggregate or filter results based on this type, and would probably be a good candidate for a tag. Again, perhaps with a shorter-but-still-descriptive name.
-
sitecode
– Definitely sounds like another good candidate for a tag, perhaps site
would be sufficient as a name, though? Depending on the scenario, this could either be a scaling tag, or dependent tag. If robots are deployed on a single site, and stay there forever, it’s a dependent tag. If robots will visit multiple sites, then it becomes a scaling tag.
Now let’s do some off-the-wall base-less estimations!
If you have:
- 100 robots
- 350
type_of_measurement
s each (see the note above about considerations there)
- 2 units per measurement (again, see note above)
- 1 type per robot
- 10 sites per robot
You will have 700,000 series (this could be greatly reduced)
Using your original values, an example series key would be ~150 bytes in length, but let’s say the measured overhead per-series is actually about 1KB per series (wild estimation, probably over-estimate), but that means that you would end up with a 700MB index for that measurement.
If you consolidate units into the fields, and have multiple fields rather than a single tag per measured value, that number starts dropping a lot.
I would recommend that you also check out our schema design page for some more recommendations.