hi, i’m using the nvidia-smi input to get statistics from my gpu cluster, i have numerous gpus in a few servers.
there are two searches that i would like to show from the data:
- how well utilised the gpus are (how efficient the codes running on them are)
- how many gpus are in use (as the cards are set to exclusive use mode)
i naively thought i could do a
sum(), but then realised that that is a sum over the number of data points. i could do
mean() but that will be skewed somewhat if the gpu isn’t is use.
i finally came up with
SELECT max("utilization_gpu") FROM "nvidia_smi" WHERE ("host" =~ /cryoem-gpu.*/) AND time >= now() - 7d GROUP BY time(30m), "host", "uuid" fill(null)"
however, what i really what to show is the sum of the max gpu utilisations per host (ie like a stack of the gpu utilisations). can i do this in influxql?
for the second representation, i guess what i want is if the utilisation is greater than 0, then count the gpu as being in use (not necessarily accurate, but good enough). then i want to show the total number of gpus that are in use (like the sum of each host in the above).