Advice needed for storing, aggregating and showing distributed web "hits"

I’m stuck with some annoying corner-cases that can create wrong values or nasty drops/peaks in my graphs, so I’d appreciate some advice from anyone who has faced similar issues!

I have multiple web servers on which I’m gathering the following extra info using telegraf every 10s :

  • Hits (i.e. pages served)
  • Time (page load average time, single value already averaged for the total number of Hits)

This looks like this :

mystats, hits=512i,perf=0.0269 1516201001679044096

First question : Should I have “hits” be a counter or a gauge?
I’ve tried both : Counter seems best in order to have correct values for “total hits for the last 24h”, but I’ve discovered that because of potential counter resets, it’s not that simple. Gauge is nice for graphing rates and doesn’t have counter resets issues, but I’ve then had other issues with “late” data creating peaks.

What I’m looking for is :

  • Graphs for the total hits/s
  • Graphs for the average time

Given there are a lot of servers involved, I already created a continuous query for this :

 SELECT sum(*) INTO "mystats_sum"."autogen".:MEASUREMENT FROM "mystats" GROUP BY time(10s), domain

The problem is that I initially had counters, so a single server’s counter reset affects the sum() in a bad way.