Hello all, this is my first post to this forum. I’m really new to influxDB and machine learning, and hoping someone will be able to help me with what I assume would be a common goal for anyone monitoring network devices (routers/switches) for interface bandwidth usage. My immediate application is for a carrier environment, but I think it would apply to an enterprise environment easily enough.
The ultimate goal is to be collecting interface stats (in/out octets) on enabled and operational interfaces. I’d like to have a machine learning algorithm processing the data and making predictions about what the level should be for the next hour. Ideally, the model would be aware of the daily fluctuations and how they vary on weekdays, weekends, and holidays, to more accurately make predictions. This would need to scale to thousands of interfaces over hundreds of devices, and alerts would be generated for interfaces that are way off of their predicted level. The most common example would be if traffic bottomed out (0bps) or maxed out during “normal business hours”, and those hours would vary from entity to entity. However, during “non business hours” the system wouldn’t generate alerts on low bandwidth usage (because no one is using the network, maintenance, etc.) or high bandwidth usage (e.g. backups, updates, etc.) I think this gets especially tricky since different entities have their own ideas of “business hours”. A small law firm might be 8-5 on weekdays only, while a restaurant may be 6 am to 10 pm all week long, yet the carrier core equipment would be 24/7 and would need to know of under/over used links at all times.
My first concern is that my goal is not realistic, since I have so little experience with these technologies and it’s possible I may just be expecting too much. I went cross eyed on the examples given in the influxdb docs about machine learning, and hoping it’s just because I am so far out of my element that I am missing some things that become easier with experience. An example of what messed me up on the machine learning docs was the example was about water depth as affected by the tide, and the basic high/low cycle was approx 4 hours. For my purposes the cycle is more complicated than that, and may preclude using any of influxDB’s built in machine learning algorithms.
From my research, I think I know what needs to be done, but I’m not sure I know how to do it. (Please feel free to correct me if I appear to go in the wrong direction.)
I think I should use a TICK script for batch processing the data, and then creating the predictions. After that, it should be a simple matter to have another TICK script running every few minutes to look for interfaces that are experiencing anomalies.
I intend to use LibreNMS to gather the data and forward to influxDB, and have this working in a lab. In the future, I might even have multiple instances so core devices/interfaces are polled every minute, while aggregate and access devices/interfaces are polled every 5 minutes. I envision the data is kept for a month (or two) and then down sampled for longer term storage. This easily provides a lot of data (CPU, temp, bandwidth/interface stats, etc.) that can be experimented on from various vendors.
Assuming I have not wandered into the impossible, I could use some pointers/guidance on setting up a template to run a model for all interfaces that have data points in the past hour. (I’ll worry about identifying disabled/disconnected interfaces later.) It would calculate the rate for in and out traffic for that interface and insert/update those values at one minute intervals for the next 60 minutes. I think getting to a point that the predicted and actual data rates can be graphed, would be a good milestone to work torwards.