CPU Thresholds, Warnings, and Risk Scoring

The following splunk query can be used to do as the title says. Risk scoring is done via color codes in the source information of the chart (see below for an example). The following query looks at the average CPU time for 30 minute chunks of time over a selected time range (ie: what was my average CPU usage on a given server over the last 24 hours).

 

sourcetype="Perfmon:CPU Load" |bucket _time span=30m | eval Load=round(Value,2) | stats avg(Load) as AVGL by host, _time | chart avg(AVGL) as AverageCPU by host |  eval "CPULoad"=case(AverageCPU>90, "Critical", AverageCPU>=70 AND AverageCPU<=89, "Warning", AverageCPU>=40 AND AverageCPU<=69, "Elevated", AverageCPU<40, "Normal") | stats count by CPULoad | sort - count CPULoad

The above assumes that you want your chunks of time (for the average) to be 30 minutes, you can change that by modifying the “span” section. The query also assumes that your Critical level of CPU is anything above 90%, Warning level is between 70% and less than 90%, Elevated is 40% but less than 70%, and Normal behavior is anything less than 40%.

To define the corresponding colors you’ll need to add the following section to the simple XML source behind your dashboard. To do this edit the source, find the panel that references the above query, and add the following to the chart options:

<option name="charting.fieldColors">{"Critical":0xD64541,"Warning":0xF89406,"Elevated":0x3498DB,"Normal":0x2ECC71}</option>
Share This:

Comments

Leave A Comment?