MISC-26: Ubuntu Dmesg being spammed with rtl_counters_cond == 1



Issue Information

Issue Type: Bug
 
Priority: Major
Status: Closed

Reported By:
Ben Tasker
Assigned To:
Ben Tasker
Project: Miscellaneous (MISC)
Resolution: Fixed (2019-03-19 13:21:26)
Environment: Ubuntu 16.04.5 LTS (Xenial)
Linux 4.4.0-72-generic #93-Ubuntu SMP

Created: 2019-03-19 08:21:26
Time Spent Working


Description
My dmesg is getting spammed with lots of identical entries
[213308.745502] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[213308.756755] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[213308.767991] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[213308.779243] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[213308.790516] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[213308.801655] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[213308.812869] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[213308.824014] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[213309.704324] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[213309.715648] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[213310.941632] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).


Since it started, I'm also regularly getting apport pop up to claim Slack and Skype have crashed (they're still open, but you can see a thread segfaulting in dmesg). As r8169 is a NIC driver, my guess is those electron apps are picking up on a "change" in network state and failing to handle it.


Toggle State Changes

Activity


We know r8169 is a NIC driver (painful memories of troubleshooting it in the past). Line also gives us the bus address and the interface name:
[213310.941632] r8169 0000:1c:00.0 p3p1: rtl_counters_cond == 1 (loop: 1000, delay: 10).


So it's interface p3p1 at address 0000:1c:00.0


That interface isn't actually cabled up, or in use:
ben@milleniumfalcon:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: p1p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 00:e0:ff:68:04:c4 brd ff:ff:ff:ff:ff:ff
3: p3p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 00:e0:4c:68:11:be brd ff:ff:ff:ff:ff:ff
4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 80:c1:6e:f1:cb:3c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.70/24 brd 192.168.1.255 scope global dynamic eth0
       valid_lft 532sec preferred_lft 532sec
    inet6 fe80::c2b6:54a1:bf5a:65db/64 scope link 
       valid_lft forever preferred_lft forever


It's also curious, because this box has just 2 NICs - 1 Broadcom (eth0) and the Realtek. Broadcom is on-board, Realtek is a PCI card.

Curious that the "two" Realtek interfaces have different MACs too.... For avoidance of doubt, I've just been on my knees behind the machine double-checking that there are in fact only 2 NICs.

Rather than just ifdowning it, https://askubuntu.com/questions/909213/ethernet-fails-to-initialize suggests we can rescan the bus

find /sys/devices -name p3p1
echo 1 > `dirname $dev`/../remove


Now, that page suggests we can then re-scan, but that doesn't work for me:
root@milleniumfalcon:~# echo 1 >/sys/devices/pci0000:00/0000:1c:00.0/rescan
bash: /sys/devices/pci0000:00/0000:1c:00.0/rescan: No such file or directory


Which makes sense given the thing doesn't actually exist.


The loglines have stopped hitting dmesg but it feels like an incomplete fix as all we've actually done is remove the interface. So, lets try pulling the module out and putting it back to make sure it stays fixed
root@milleniumfalcon:~# rmmod r8169; modprobe r8169


dmesg looks happy, and the phantom interface no longer shows up for ip
root@milleniumfalcon:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 80:c1:6e:f1:cb:3c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.70/24 brd 192.168.1.255 scope global dynamic eth0
       valid_lft 481sec preferred_lft 481sec
    inet6 fe80::c2b6:54a1:bf5a:65db/64 scope link 
       valid_lft forever preferred_lft forever
5: p1p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 00:e0:ff:68:04:c4 brd ff:ff:ff:ff:ff:ff


Which resolves the issue, but doesn't really explain why r8169 decided to show a phantom interface with a hardware address that doesn't actually exist on the system. The OUI is definitely Realtek though.
I don't know where that other MAC came from, but there is an old, old bug with the Realtek driver that I've been hit by before which might be relevant. The bug itself only applies to systems using bonded interfaces (which this isn't), but the underlying fuck-up probably has other applications.

When 2 (or more) interfaces are in a bond, they share a MAC address (generally taken from one of the slaves). The new - shared - MAC is written into the NICs on-chip memory.

However, if (at least) one of those interfaces is Realtek (and it was specifically the r8169 driver I was hit with this on), following reboot the bond may fail to come up, or may come up with just one slave. A look in system logs will generally show complaints that the bond's MAC address is already in use. This is because, on boot, when the r8169 driver polls the NIC for it's MAC address, it fetches it from the nic's memory, rather than from the PHY. So, it get's the bond's MAC back rather than the true MAC. The "fix" is to isolate the system from power for 5 mins so that the NIC loses any residual charge.

There's a mailing thread post somewhere from around 2010 where this was identified, and it remained unfixed as of 2 years ago (don't know about now, but I can take an educated guess).

Now, this system doesn't have a bond, and never has, but I'm wondering if something similar is going on - I do occasionally run some VMs on this machine which get bridged to an adaptor (though I don't see any with that MAC). There was a (very brief) power interruption the other night as a result of the trip switch being thrown (cause of that is unrelated), so the system powered down suddenly and uncleanly, but didn't stay isolated from power for the NIC to lose any charge.

What that wouldn't explain though is the phantom bus address... clearly I'm missing something
Recording for posterity's sake
# modinfo r8169
filename:       /lib/modules/4.4.0-72-generic/kernel/drivers/net/ethernet/realtek/r8169.ko
firmware:       rtl_nic/rtl8107e-2.fw
firmware:       rtl_nic/rtl8107e-1.fw
firmware:       rtl_nic/rtl8168h-2.fw
firmware:       rtl_nic/rtl8168h-1.fw
firmware:       rtl_nic/rtl8168g-3.fw
firmware:       rtl_nic/rtl8168g-2.fw
firmware:       rtl_nic/rtl8106e-2.fw
firmware:       rtl_nic/rtl8106e-1.fw
firmware:       rtl_nic/rtl8411-2.fw
firmware:       rtl_nic/rtl8411-1.fw
firmware:       rtl_nic/rtl8402-1.fw
firmware:       rtl_nic/rtl8168f-2.fw
firmware:       rtl_nic/rtl8168f-1.fw
firmware:       rtl_nic/rtl8105e-1.fw
firmware:       rtl_nic/rtl8168e-3.fw
firmware:       rtl_nic/rtl8168e-2.fw
firmware:       rtl_nic/rtl8168e-1.fw
firmware:       rtl_nic/rtl8168d-2.fw
firmware:       rtl_nic/rtl8168d-1.fw
version:        2.3LK-NAPI
license:        GPL
description:    RealTek RTL-8169 Gigabit Ethernet driver
author:         Realtek and the Linux r8169 crew <netdev@<Domain Hidden>>
srcversion:     A6873B87C6B6105AB10CC7F
alias:          pci:v00000001d00008168sv*sd00002410bc*sc*i*
alias:          pci:v00001737d00001032sv*sd00000024bc*sc*i*
alias:          pci:v000016ECd00000116sv*sd*bc*sc*i*
alias:          pci:v00001259d0000C107sv*sd*bc*sc*i*
alias:          pci:v00001186d00004302sv*sd*bc*sc*i*
alias:          pci:v00001186d00004300sv*sd*bc*sc*i*
alias:          pci:v00001186d00004300sv00001186sd00004B10bc*sc*i*
alias:          pci:v000010ECd00008169sv*sd*bc*sc*i*
alias:          pci:v000010ECd00008168sv*sd*bc*sc*i*
alias:          pci:v000010ECd00008167sv*sd*bc*sc*i*
alias:          pci:v000010ECd00008136sv*sd*bc*sc*i*
alias:          pci:v000010ECd00008129sv*sd*bc*sc*i*
depends:        mii
intree:         Y
vermagic:       4.4.0-72-generic SMP mod_unload modversions 
parm:           use_dac:Enable PCI DAC. Unsafe on 32 bit PCI slot. (int)
parm:           debug:Debug verbosity level (0=none, ..., 16=all) (int)


And visible interfaces
root@milleniumfalcon:~# lspci | grep Eth
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 03)
Out of curiosity, what happens if I trigger a rescan via the bus the realtek is on?
root@milleniumfalcon:~# echo 1 > /sys/devices/pci0000:00/0000:00:01.0/rescan


Hmmm, we have p3p1 back. Dmesg shows
[231587.286019] pci 0000:1c:00.0: [10ec:8168] type 00 class 0x020000
[231587.286061] pci 0000:1c:00.0: reg 0x10: [io  0x0000-0x00ff]
[231587.286084] pci 0000:1c:00.0: reg 0x18: [mem 0x00000000-0x00000fff 64bit]
[231587.286100] pci 0000:1c:00.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]
[231587.286175] pci 0000:1c:00.0: supports D1 D2
[231587.286177] pci 0000:1c:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[231587.291168] lpc_ich 0000:00:1f.0: BAR 13: [io  size 0x0080] has bogus alignment
[231587.291171] lpc_ich 0000:00:1f.0: BAR 14: [io  size 0x0040] has bogus alignment
[231587.291179] pci 0000:1c:00.0: BAR 4: assigned [mem 0xee000000-0xee003fff 64bit pref]
[231587.291191] pci 0000:1c:00.0: BAR 2: assigned [mem 0xec000000-0xec000fff 64bit]
[231587.291203] pci 0000:1c:00.0: BAR 0: assigned [io  0xe000-0xe0ff]
[231587.291267] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[231587.291278] r8169 0000:1c:00.0: can't disable ASPM; OS doesn't have ASPM control
[231587.291283] r8169 0000:1c:00.0: enabling device (0000 -> 0003)
[231587.291731] r8169 0000:1c:00.0 eth1: RTL8168d/8111d at 0xffffc9000395a000, 00:e0:4c:68:11:be, XID 083000c0 IRQ 28
[231587.291735] r8169 0000:1c:00.0 eth1: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[231587.305043] r8169 0000:1c:00.0 p3p1: renamed from eth1
[231587.340221] IPv6: ADDRCONF(NETDEV_UP): p3p1: link is not ready
[231587.396391] r8169 0000:1c:00.0 p3p1: link down
[231587.396494] IPv6: ADDRCONF(NETDEV_UP): p3p1: link is not ready


And we can see it in ip
root@milleniumfalcon:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 80:c1:6e:f1:cb:3c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.70/24 brd 192.168.1.255 scope global dynamic eth0
       valid_lft 395sec preferred_lft 395sec
    inet6 fe80::c2b6:54a1:bf5a:65db/64 scope link 
       valid_lft forever preferred_lft forever
5: p1p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 00:e0:ff:68:04:c4 brd ff:ff:ff:ff:ff:ff
6: p3p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 00:e0:4c:68:11:be brd ff:ff:ff:ff:ff:ff


MAC Address is the same as before.
root@milleniumfalcon:~# lspci | grep Eth
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 03)
1c:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 03)

I've just got onto my knees again and there are definitely only 2 NICs in this box.

Fucking Muppet.

There is in fact a second (unused) NIC hiding behind the big fat plug that comes of my NVIDIA graphics card. That at least solves that mystery, and the driver's still stopped spamming the original messages into dmesg and syslog, even with the device visible again, so I'll consider this resolved and hopefully find time for the lie-down I so obviously need...
btasker changed status from 'Open' to 'Resolved'
btasker added 'Fixed' to resolution
btasker changed status from 'Resolved' to 'Closed'