I think your estimate is a little high, since InfluxDB only stores information about the actual series that exist in the database, and not the entire set of possibilities. I’ll try explaining a little bit, and then using some of the information you provided to give you a little better estimate for your expected memory usage.
Trying to predict exact memory usage is very difficult as we don’t have a fixed schema database, so our in-memory structures are dynamically-sized. That being said, we’ve generally found memory usage to grow linearly, so if you have 10,000 series, it would be about 1/10th the memory usage as if you have 100,000 series (again, roughly).
The index is based around series, which are made up of the measurement name and the tag key-value pairs. As an example, given the data:
cpu,host=a.example.com,region=us-east-1,core=0 user=54 idle=35 system=11
The series key is highlighted, and is 46 bytes, but it’s going to take more than a single 46 byte allocation to represent that using our in-memory index.
The most important aspect of estimating memory usage and load on the server is what we call “series cardinality”, simply put: the total number of unique series. You read our short explanation here, but I’ll reiterate the ideas here.
Going back to our example data, we have the tags
core. If we have 200 hosts, each with 8 cores, that would be 1600 series. If those hosts are spread across 8 regions, we still only have 1600 series. That is because
region will only ever have one value per-
host. One reason we include
region as a tag when it could probably be deduced from
host is because of queries: tags allow for more efficient filtering than values would. If
region were a value instead of a tag, a query with
where region = 'us-east-1' would have to scan through all of the data, correlate the region value with the value we’re interested in (matching the timestamps), and then filter. Since
region is a tag, we can just scan for values in the series that are appropriate.
Let’s go back to your example, and consider a few things: What tags will only contain dependent values (like
region in our example), what tags can we eliminate (things that don’t necessarily make sense to group by/filter on), and what tags will have a scaling effect (
core in our example).
robot_id – unavoidable if you want to query a single robot, perhaps
id would be a better name (like we used
host instead of
hostname in our example), since hopefully the database or measurement name would be enough to deduce that we’re tracking robot measurements.
type_of_measurement – Without more examples, I’m not sure I follow what kind of values would appear here, and it might be better off as just the field name. Unless there will be multiple values for each type and the same value fields could appear under multiple
type_of_measurements, and you’d like to be able to filter them out. Again, I would try to find a shorter name that is still descriptive enough to be valuable.
unit_type – For the purposes of analysis, I would recommend settling the units before writing data. Example: “What is the average temperature?” You probably want a single value, and not one average for Celsius and one for Fahrenheit. Trying to average the raw values, if both °C and °F are used, would give a very strange/incorrect answer. InfluxDB doesn’t have the ability to conditionally apply math to value based on a tag-value. In which case you might be better off just calling the fields
temp_F, but trying to do analysis would still be difficult.
robot_type – This seems like it would be analogous to
region in our example, something that is directly tied to the
id, and thus does not contribute to cardinality. However, also like
region, you might want to aggregate or filter results based on this type, and would probably be a good candidate for a tag. Again, perhaps with a shorter-but-still-descriptive name.
sitecode – Definitely sounds like another good candidate for a tag, perhaps
site would be sufficient as a name, though? Depending on the scenario, this could either be a scaling tag, or dependent tag. If robots are deployed on a single site, and stay there forever, it’s a dependent tag. If robots will visit multiple sites, then it becomes a scaling tag.
Now let’s do some off-the-wall base-less estimations!
If you have:
- 100 robots
type_of_measurements each (see the note above about considerations there)
- 2 units per measurement (again, see note above)
- 1 type per robot
- 10 sites per robot
You will have 700,000 series (this could be greatly reduced)
Using your original values, an example series key would be ~150 bytes in length, but let’s say the measured overhead per-series is actually about 1KB per series (wild estimation, probably over-estimate), but that means that you would end up with a 700MB index for that measurement.
If you consolidate units into the fields, and have multiple fields rather than a single tag per measured value, that number starts dropping a lot.
I would recommend that you also check out our schema design page for some more recommendations.