SoftIRQ 高的排查方法
SoftIRQ 高的排查方法
SoftIRQ 是软件中断。FortiGate 收到需要 CPU 处理、没有被 NPU 加速的流量时,可能看到 softirq 持续升高。CPU 高时如果 get system performance status 中 softirq 占比较高,应优先检查二层环路、广播风暴、异常流量、未卸载会话以及接口上的 Device Detection。
查看方法
查看 CPU 使用状态,确认是否为
softirq占用高。# get sys performance status CPU states: 0% user 0% system 0% nice 67% idle 0% iowait 0% irq 33% softirq CPU0 states: 0% user 0% system 0% nice 55% idle 0% iowait 0% irq 45% softirq CPU1 states: 0% user 0% system 0% nice 19% idle 0% iowait 0% irq 81% softirq CPU2 states: 1% user 0% system 0% nice 32% idle 0% iowait 0% irq 67% softirq CPU3 states: 0% user 0% system 0% nice 66% idle 0% iowait 0% irq 34% softirq Memory: 1911192k total, 1002652k used (52.5%), 645292k free (33.8%), 263248k freeable (13.8%) ...... Average sessions: 291687 sessions in 1 minute, 293226 sessions in 10 minutes, 293696 sessions in 30 minutes ...... Average NPU sessions: 35 sessions in last 1 minute, 36 sessions in last 10 minutes, 36 sessions in last 30 minutes ......使用
diagnose sys mpstat持续观察各 CPU 核心的软件中断变化。提示
diagnose sys mpstat默认每 5 秒刷新一次,按Ctrl+C停止。diagnose sys mpstat 3 5表示每 3 秒输出一次,共输出 5 次。 :::
# diagnose sys mpstat # diagnose sys mpstat 3 5 Gathering data, wait 3 sec, press any key to quit. ..0..1..2 TIME CPU %usr %nice %sys %iowait %irq %soft %steal %idle 10:51:45 AM all 0.42 0.00 0.18 0.00 0.00 74.36 0.00 25.04 0 0.99 0.00 0.99 0.00 0.00 96.04 0.00 1.98 1 0.00 0.00 0.00 0.00 0.00 99.00 0.00 1.00 2 0.00 0.00 0.00 0.00 0.00 98.02 0.00 1.98 3 0.99 0.00 0.00 0.00 0.00 100.00 0.00 0.00 TIME CPU %usr %nice %sys %iowait %irq %soft %steal %idle 10:51:48 AM all 0.29 0.00 0.12 0.00 0.00 78.21 0.00 21.38 0 0.00 0.00 0.99 0.00 0.00 97.03 0.00 1.98 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 2 0.99 0.00 0.00 0.00 0.00 99.01 0.00 0.00 3 0.00 0.00 0.99 0.00 0.00 98.02 0.00 0.99 ......
常见原因
- 二层环路、广播风暴、大量 ARP 请求、异常 IPv6/组播/ESP 等流量持续打到 FortiGate。
- 防火墙策略关闭了硬件卸载,或流量经过不支持 NPU 卸载的接口。
- 大量被策略拒绝的流量命中 FortiGate,默认不保留 denied session 时,FortiGate 会持续创建并删除临时拒绝会话。
config system settings中启用了ses-denied-traffic时,大量不同的被拒绝会话进入会话表后增加 CPU 负载。- 接口启用了 Device Detection,内核需要复制报文并映射给用户态进程检查。
- Software Switch 流量未被 NP 卸载,尤其是
inter-switch-policy、接口类型和硬件支持不匹配时。 - HA 使用 hardware switch 时 STP 未启用,上游交换网络异常后可能形成环路。
排查步骤
检查接口丢包计数,重点观察
Host TX dropped是否持续增长。# diagnose hardware deviceinfo nic <interface> ============ Counters =========== Rx_CRC_Errors :0 Rx_Frame_Too_Longs:0 rx_undersize :0 Rx Pkts :64880428536 Rx Bytes :29923981233538 Tx Pkts :82496472350 Tx Bytes :42412599845273 rx_rate :0 tx_rate :0 nr_ctr_reset :0 Host Rx Pkts :64867748559 Host Rx Bytes :28413202957398 Host Tx Pkts :88100655721 Host Tx Bytes :48030145695805 Host Tx dropped :1316 FragTxCreate :0 FragTxOk :0 FragTxDrop :0如果怀疑广播风暴或二层环路,重复查看 CPU 状态,确认是否有大量核心的
softirq持续接近 100%。# get sys perf status CPU states: 0% user 0% system 0% nice 55% idle 0% iowait 0% irq 45% softirq CPU0 states: 1% user 0% system 0% nice 57% idle 0% iowait 0% irq 42% softirq CPU1 states: 0% user 0% system 0% nice 0% idle 0% iowait 0% irq 100% softirq CPU2 states: 0% user 0% system 0% nice 0% idle 0% iowait 0% irq 100% softirq CPU3 states: 0% user 0% system 0% nice 0% idle 0% iowait 0% irq 100% softirq CPU4 states: 0% user 0% system 0% nice 0% idle 0% iowait 0% irq 100% softirq CPU5 states: 0% user 0% system 0% nice 15% idle 0% iowait 0% irq 85% softirq CPU6 states: 0% user 0% system 0% nice 0% idle 0% iowait 0% irq 100% softirq CPU7 states: 0% user 0% system 0% nice 0% idle 0% iowait 0% irq 100% softirq CPU8 states: 0% user 0% system 0% nice 100% idle 0% iowait 0% irq 0% softirq CPU9 states: 0% user 0% system 0% nice 100% idle 0% iowait 0% irq 0% softirq CPU10 states: 0% user 0% system 0% nice 100% idle 0% iowait 0% irq 0% softirq CPU11 states: 0% user 0% system 0% nice 100% idle 0% iowait 0% irq 0% softirq CPU12 states: 2% user 0% system 0% nice 98% idle 0% iowait 0% irq 0% softirq CPU13 states: 6% user 0% system 0% nice 94% idle 0% iowait 0% irq 0% softirq CPU14 states: 1% user 0% system 0% nice 99% idle 0% iowait 0% irq 0% softirq CPU15 states: 1% user 0% system 0% nice 99% idle 0% iowait 0% irq 0% softirq查看接口收发包速率,确认是否存在某个接口异常高包速。
# diagnose netlink interface packet-rate Interface RX-rate(per second) TX-rate(per second) port1 600 496504920 port_ha 25 47 ha 26 38抓取无过滤条件或低过滤条件的报文,定位是否存在未知流量、泛洪、IPv6、ESP 或其他异常报文。
无过滤条件的通用抓包,但排除 SSH 和默认 HTTPS 管理访问:
diagnose sniffer packet any "!port 22 and !port 443" 4 0 1无过滤条件:
diagnose sniffer packet <interface> '' 6 2000 l如果怀疑 ESP 报文:
diagnose sniffer packet any 'esp' 6 2000 l未打 Vlan Tag 报文的抓包示例如下。如果是带标签报文,抓包中会显示 VLAN 信息。
2024-08-27 19:14:20.969882 port1-- 202.103.1.1 -> 221.5.3.1: ESP(spi=0xdba59c0a,seq=0x35b) 2024-08-27 19:14:20.973437 port1-- 202.103.1.1 -> 221.5.3.1: ESP(spi=0xdba59c0a,seq=0x35b) ......
重要
SoftIRQ 已经很高的设备上,抓包本身也会消耗 CPU。抓包数量和详细级别应按现场风险控制,必要时在维护窗口执行。
对抓包中发现的异常流量执行 debug flow,检查是否存在持续增长的反向路径检查失败或其他 CPU 丢弃。
# diagnose debug flow filter addr 192.0.2.10 # diagnose debug flow trace start 100 # diagnose debug enable id=20085 trace_id=1107 func=ip_route_input_slow line=1704 msg="reverse path check fail, drop" ......如果怀疑二层环路,按 LAN、WAN、DMZ 或可疑物理口逐个临时关闭接口,观察
softirq是否立即下降。检查会话是否没有被硬件卸载,重点查看 session 中的
no_ofld_reason字段。字段说明可参考:故障排查 → 会话工具 → 会话无法加速原因。session info: proto=6 proto_state=01 duration=10 expire=3590 timeout=3600 flags=00000000 sockflag=00000000 sockport=0 av_idx=0 use=4 state=may_dirty statistic(bytes/packets/allow_err): org=840/10/1 reply=760/8/1 tuples=2 orgin->sink: org pre->post, reply pre->post dev=5->6/6->5 gwy=192.0.2.1/198.51.100.1 hook=pre dir=org act=noop 192.0.2.10:53124->198.51.100.10:443(0.0.0.0:0) hook=post dir=reply act=noop 198.51.100.10:443->192.0.2.10:53124(0.0.0.0:0) misc=0 policy_id=10 auth_info=0 chk_client_info=0 vd=0 serial=00000123 tos=ff/ff app_list=0 app=0 url_cat=0 npu_state=0x000000 no_ofld_reason: non-npu-intf如果接口启用了
device-identification,且现场不依赖该功能,可在 GUI 或 CLI 下关闭device-identification后观察 CPU。config system interface edit "lan1" set device-identification disable next end如果 HA 使用 hardware switch,可确认 STP 是否启用。
config system interface edit "hw1" set type hard-switch set stp enable next end如果 CPU 高与被策略拒绝的流量有关,需要先区分拒绝流量的类型。
大量单会话单播/组播流量被 FortiGate 拒绝,且默认配置下,拒绝流量也会产生临时会话。由于会话不会保留在会话表中,FortiGate 会不停创建和删除拒绝会话,可能导致 CPU 升高。此时可启用
ses-denied-traffic/ses-denied-multicast-traffic,让被拒绝流量的会话保留在会话表中,后续报文直接匹配该会话并丢弃。提示
后续应在上游排查和阻断异常流量,或补充正确的放通策略。
# 单播会话 config system settings set ses-denied-traffic enable end # 组播会话 config system settings set ses-denied-multicast-traffic enable end大量多会话的单播流量被 FortiGate 拒绝,且开启了
ses-denied-traffic时,大量被拒绝流量会新建并占用会话表,可能会对 CPU 负载产生影响。启用denied session offload可以让这些会话被 NP 处理。相关信息
- 仅针对 UDP 流量生效。
- 该功能仅支持NP6、NP7平台设备,不支持 NP6XLITE、NP7XLITE 平台设备。
- NP7 平台设备在 FortiOS 7.6.5 及之后版本支持该功能(NP6 平台设备之前已支持)。
config system npu set session-denied-offload enable end
如果确认是交换机侧持续向 FortiGate 转发异常流量,应在交换机侧排查环路、未打标签流量、合法性和限速策略;FortiGate 侧可按型号能力评估是否用 ACL 在物理接口阻断异常流量。
config firewall acl edit 1 set interface "port1" set srcaddr "all" set dstaddr "all" set service "IKE" "ESP" next end提示
config firewall acl不是所有型号都支持。使用前应确认设备型号、FortiOS 版本和业务影响。
判断方向
Average sessions很高但Average NPU sessions很低,通常说明大量会话正在走 CPU。- 多个 CPU 核心
softirq接近 100%,尤其是伴随某个接口包速异常,优先怀疑二层环路、广播风暴或交换机侧异常转发。 - ESP 报文中相同序列号重复出现,可能表示报文在防火墙和交换设备之间被循环转发。
- 如果 FortiGate 本身可能产生环路,应收集
get system performance status、diagnose sys mpstat、diagnose netlink interface packet-rate、抓包和 debug flow 输出后联系 TAC。