How can i estimate the index size based on tags?

Hello,

I was wondering how i could estimate the size of the index (RAM usage) based on tag set.
First of all, is there a formula or a tool to simulate index size or ram usage ?

Then, here is my use case :

I have informations that will probably need to be a tag (type of measure taken & unit used), but i might have 350 entries in this “type_of_measure” tag, for each robot, with maybe a 100 robots.

Based on UTF-8 encoding, here is what i estimated for tags & ram usage:

Robot_id : about 10 char = 10 bytes
type_of_measure : about 30 char = 30 bytes
unit_type : about 10 char = 10 bytes
robot_type : about 10 char = 10 bytes
sitecode : about 30 char = 30 bytes

As robot_id is the “unique identifier” for my robot, i will multiply everything else that is not shared among all robots by it.
Assuming 100 robot, 300 type for each robot, 30 shared unit types, 4 shared types of robots, and one shared sitecode

(robot_id * number of type_of_measure) * (shared things => number of unit_type * number of robot_type * sitecode)
((10 * 1) * (30 * 300)) * ((10 * 30) * (10 * 4) * (30 * 1 ) ) = 32400000000 bytes => 32.4 GB , which is already far too much, considering that there might be 1 or 2 more tags, with low occurency but still, and that’s one measurement …

Would that be correct ? Or not at all ?
Maybe i should split measurement with unit types or something ? I don’t know …

Hi @NoxWorld,

I think your estimate is a little high, since InfluxDB only stores information about the actual series that exist in the database, and not the entire set of possibilities. I’ll try explaining a little bit, and then using some of the information you provided to give you a little better estimate for your expected memory usage.

Trying to predict exact memory usage is very difficult as we don’t have a fixed schema database, so our in-memory structures are dynamically-sized. That being said, we’ve generally found memory usage to grow linearly, so if you have 10,000 series, it would be about 1/10th the memory usage as if you have 100,000 series (again, roughly).

The index is based around series, which are made up of the measurement name and the tag key-value pairs. As an example, given the data:

cpu,host=a.example.com,region=us-east-1,core=0 user=54 idle=35 system=11
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                      Series Key

The series key is highlighted, and is 46 bytes, but it’s going to take more than a single 46 byte allocation to represent that using our in-memory index.

The most important aspect of estimating memory usage and load on the server is what we call “series cardinality”, simply put: the total number of unique series. You read our short explanation here, but I’ll reiterate the ideas here.

Going back to our example data, we have the tags host, region, and core. If we have 200 hosts, each with 8 cores, that would be 1600 series. If those hosts are spread across 8 regions, we still only have 1600 series. That is because region will only ever have one value per-host. One reason we include region as a tag when it could probably be deduced from host is because of queries: tags allow for more efficient filtering than values would. If region were a value instead of a tag, a query with where region = 'us-east-1' would have to scan through all of the data, correlate the region value with the value we’re interested in (matching the timestamps), and then filter. Since region is a tag, we can just scan for values in the series that are appropriate.

Let’s go back to your example, and consider a few things: What tags will only contain dependent values (like region in our example), what tags can we eliminate (things that don’t necessarily make sense to group by/filter on), and what tags will have a scaling effect (core in our example).

  • robot_id – unavoidable if you want to query a single robot, perhaps id would be a better name (like we used host instead of hostname in our example), since hopefully the database or measurement name would be enough to deduce that we’re tracking robot measurements.
  • type_of_measurement – Without more examples, I’m not sure I follow what kind of values would appear here, and it might be better off as just the field name. Unless there will be multiple values for each type and the same value fields could appear under multiple type_of_measurements, and you’d like to be able to filter them out. Again, I would try to find a shorter name that is still descriptive enough to be valuable.
  • unit_type – For the purposes of analysis, I would recommend settling the units before writing data. Example: “What is the average temperature?” You probably want a single value, and not one average for Celsius and one for Fahrenheit. Trying to average the raw values, if both °C and °F are used, would give a very strange/incorrect answer. InfluxDB doesn’t have the ability to conditionally apply math to value based on a tag-value. In which case you might be better off just calling the fields temp_C and temp_F, but trying to do analysis would still be difficult.
  • robot_type – This seems like it would be analogous to region in our example, something that is directly tied to the id, and thus does not contribute to cardinality. However, also like region, you might want to aggregate or filter results based on this type, and would probably be a good candidate for a tag. Again, perhaps with a shorter-but-still-descriptive name.
  • sitecode – Definitely sounds like another good candidate for a tag, perhaps site would be sufficient as a name, though? Depending on the scenario, this could either be a scaling tag, or dependent tag. If robots are deployed on a single site, and stay there forever, it’s a dependent tag. If robots will visit multiple sites, then it becomes a scaling tag.

Now let’s do some off-the-wall base-less estimations!

If you have:

  • 100 robots
  • 350 type_of_measurements each (see the note above about considerations there)
  • 2 units per measurement (again, see note above)
  • 1 type per robot
  • 10 sites per robot

You will have 700,000 series (this could be greatly reduced)

Using your original values, an example series key would be ~150 bytes in length, but let’s say the measured overhead per-series is actually about 1KB per series (wild estimation, probably over-estimate), but that means that you would end up with a 700MB index for that measurement.

If you consolidate units into the fields, and have multiple fields rather than a single tag per measured value, that number starts dropping a lot.

I would recommend that you also check out our schema design page for some more recommendations.

Thanks a lot for your answer.

So in my use case :

  • robot_id - will be unique per robot, and yes i’m very verbose in my name, but i really don’t want people to confuse this for a data id as we could find it in other relationnal database.
  • type_of_measurement - This is because long story short, all the datas contained are coming from robot Y, but robot Y may take multiple kind of data from an installation. For example i can have temperatures, watt or other units. And these units can often be converted between each other, so just writing the unit doesn’t necessarily mean something. I really need something like this because it’s the only way to track the meaning of the value and from which sensor of robot Y it is coming.
  • unit_type - actually this is the unit taken (F°, C°, kwh, V , …)
  • robot_type - there should be only a very few records here, but our robot can take same kind of measurement, so it’s nice to know the type when there is a problem (can be OS related from a specific robot type for example).
  • sitecode - This is actually shitty to put this here, because in fact a robot can only be present on one site and have one sitecode associated. So i’ll move that in my " infos_robot" measurement.

I’ll be coming back with more details asap.

Thanks again !

Regarding units, hopefully you have some control of whatever is writing this data to the database. I would recommend naming the fields appropriately, (temp, power, batt_voltage, etc.), and decide on a unit ahead of time. power will always be in Watts, batt_voltage is always in Volts, and pick one scale for temperature, converting at the source if necessary. Then on the analysis/application side of things, you will already know the units, and don’t need to add the extra tag.

I would really suggest you consider providing full example data, including fields and some representative queries.

That’s one of my main problem.

  • I have no idea of the queries they will perform. I can only guess
  • I know they will do queries to get results over large period of time (months or even years)
  • I have no control on the unit it’s coming from, and i’m asked to show specific unit according to data wanted.

The good news is : i’m the only one writing on the database, every single data goes through one of my file who process them and insert them into influx.
The bad : i have no fucking idea of the unit coming, and it can even potentially change over time …
EG : “we forgot to put an unit so the unit is N.D.” , We now have set an unit, the unit is now " kwh ".

And i need the database to keep working without all these things…

I’ll try providing some example data but i don’t know if i’m allowed to do that ^^