Interface Bandwidth - Machine Learning

kapacitor
influxdb
#1

Hello all, this is my first post to this forum. I’m really new to influxDB and machine learning, and hoping someone will be able to help me with what I assume would be a common goal for anyone monitoring network devices (routers/switches) for interface bandwidth usage. My immediate application is for a carrier environment, but I think it would apply to an enterprise environment easily enough.

The ultimate goal is to be collecting interface stats (in/out octets) on enabled and operational interfaces. I’d like to have a machine learning algorithm processing the data and making predictions about what the level should be for the next hour. Ideally, the model would be aware of the daily fluctuations and how they vary on weekdays, weekends, and holidays, to more accurately make predictions. This would need to scale to thousands of interfaces over hundreds of devices, and alerts would be generated for interfaces that are way off of their predicted level. The most common example would be if traffic bottomed out (0bps) or maxed out during “normal business hours”, and those hours would vary from entity to entity. However, during “non business hours” the system wouldn’t generate alerts on low bandwidth usage (because no one is using the network, maintenance, etc.) or high bandwidth usage (e.g. backups, updates, etc.) I think this gets especially tricky since different entities have their own ideas of “business hours”. A small law firm might be 8-5 on weekdays only, while a restaurant may be 6 am to 10 pm all week long, yet the carrier core equipment would be 24/7 and would need to know of under/over used links at all times.

My first concern is that my goal is not realistic, since I have so little experience with these technologies and it’s possible I may just be expecting too much. I went cross eyed on the examples given in the influxdb docs about machine learning, and hoping it’s just because I am so far out of my element that I am missing some things that become easier with experience. An example of what messed me up on the machine learning docs was the example was about water depth as affected by the tide, and the basic high/low cycle was approx 4 hours. For my purposes the cycle is more complicated than that, and may preclude using any of influxDB’s built in machine learning algorithms.

From my research, I think I know what needs to be done, but I’m not sure I know how to do it. (Please feel free to correct me if I appear to go in the wrong direction.)

I think I should use a TICK script for batch processing the data, and then creating the predictions. After that, it should be a simple matter to have another TICK script running every few minutes to look for interfaces that are experiencing anomalies.

I intend to use LibreNMS to gather the data and forward to influxDB, and have this working in a lab. In the future, I might even have multiple instances so core devices/interfaces are polled every minute, while aggregate and access devices/interfaces are polled every 5 minutes. I envision the data is kept for a month (or two) and then down sampled for longer term storage. This easily provides a lot of data (CPU, temp, bandwidth/interface stats, etc.) that can be experimented on from various vendors.

Assuming I have not wandered into the impossible, I could use some pointers/guidance on setting up a template to run a model for all interfaces that have data points in the past hour. (I’ll worry about identifying disabled/disconnected interfaces later.) It would calculate the rate for in and out traffic for that interface and insert/update those values at one minute intervals for the next 60 minutes. I think getting to a point that the predicted and actual data rates can be graphed, would be a good milestone to work torwards.

2 Likes
#2

I can suggest that you search the forum for a topic around setting up a Facebook Prophet UDF. I think it’s something you can try to apply to predict network usage patterns (if you don’t need sub-hour predictions), maybe it works for your case.

The high-level steps would be:

  1. Collect real usage metrics into Influx
  2. Install fbprophet (I suggest with Python2)
  3. Create Python UDF that implements the interface for Kapacitor to work with fbprophet (see the topic for info on that)
  4. Create .tick script that reads a batch of data from the network usage measurement and feeds into the Prophet UDF, and writes the predicted points back to InfluxDB (see same topic)
  5. Visually inspect the plot for the predicted points, adjust Prophet parameters if required
  6. If/Once happy with the results, you can create another tick script that compares the real values with the predicted ones.

Hope this helps!

#3

This article gives a good run down on using Loud ML for predictions.

Article

Ive recently started looking into this myself and watched an interesting webinar on Loud ML and InfluxDB. My knowledge of it isn’t great at the moment but set up and getting my first predictions and graphing them in Chronograf and Grafana was fairly straight forward using the above article.

It also looks like the newer version of Chronograf has been updated to incorporate machine learning also.

Kapacitor has some basic holt winters preditcors at the moment but the documenation states it is a very basic form. The tick script idea would work, but the machine learning software seems like too much fun not to try.