Query for sum of gpu used with nvidia-smi



hi, i’m using the nvidia-smi input to get statistics from my gpu cluster, i have numerous gpus in a few servers.

there are two searches that i would like to show from the data:

  1. how well utilised the gpus are (how efficient the codes running on them are)
  2. how many gpus are in use (as the cards are set to exclusive use mode)

i naively thought i could do a sum(), but then realised that that is a sum over the number of data points. i could do mean() but that will be skewed somewhat if the gpu isn’t is use.

i finally came up with

SELECT max("utilization_gpu") FROM "nvidia_smi" WHERE ("host" =~ /cryoem-gpu.*/) AND time >= now() - 7d GROUP BY time(30m), "host", "uuid" fill(null)"

however, what i really what to show is the sum of the max gpu utilisations per host (ie like a stack of the gpu utilisations). can i do this in influxql?

for the second representation, i guess what i want is if the utilisation is greater than 0, then count the gpu as being in use (not necessarily accurate, but good enough). then i want to show the total number of gpus that are in use (like the sum of each host in the above).