2020-05-15

Why is my Samba connection failing?

Or how nice it is to have a problem that has long eluded you finally explained.

Part 1: The horror

You see, I've been using a FriendlyArm/FriendlyElec RK3399-based NanoPC-T4 to run Linux services, such as a staging web server for Rufus, network print host, various other things as well as a Samba File Server...

However, this Samba functionality seemed to be plagued like there was no tomorrow: Almost every time I tried to fetch a large file from it, Windows would freeze during transfer, with no recovery and not even the possibility of cancelling unless the Samba service was restarted manually on the server.

But what was more vexing is that these problems with Samba did not manifest themselves until I switched from using an old Lubuntu distribution, that was provided by the manufacturer of that device, to a more up to date Armbian. With Lubuntu, Samba seemed rock-solid, but with Armbian, it was hopeless.


This became so infuriating that I had to giveup on using Samba on that machine altogether and, considering that things usually seemed to be okay-ish after the service had restarted, I dismissed it as a pure Samba/arch64 bug, that newer versions of Samba or Debian had triggered, and that would eventually get fixed. But of course, that long awaited fix never seemed to manifest itself and I had better things to do than invest time I didn't have trying to troubleshoot a functionality that wasn't that critical to my workflow.


Besides, the Samba logs were all but useless. Nothing in there seemed to provide any indication that Samba was even remotely unhappy. And of course, you can forget about Windows giving you any clue about why the heck your Samba file transfers are freezing...

Part 2: The light at the end of the tunnel

Recently however, in the course of the Raspberry Pi 4 UEFI firmware experiments, it turns out that I was using that same server to test UEFI HTTP boot of a large (900 MB) ISO, that was being served from the Apache server running on that NanoPC machine, and had no joy with getting the full transfer complete either. Except, there, it wasn't freezing. It just seemed to produce a bunch of TcpInput: received a checksum error packet before giving up on the transfer altogether...

URI: http://10.0.0.7/~efi/ubuntu.iso
File Size: 916357120 Bytes
Downloading...1%
TcpInput: received a checksum error packet TcpInput: Discard a packet TcpInput: received a checksum error packet TcpInput: Discard a packet TcpInput: received a checksum error packet TcpInput: Discard a packet TcpInput: received a checksum error packet TcpInput: Discard a packet TcpInput: received a checksum error packet TcpInput: Discard a packet TcpInput: received a checksum error packet TcpInput: Discard a packet HttpTcpReceiveNotifyDpc: Aborted! Error: Server response timeout.

Yet, serving the same content from the native python3 HTTP server (python3 -m http.server 80, which is a super convenient command to know as it acts as an HTTP server and serves any content from the current directory through the specified port) appeared to be okay, albeit with the occasional checksum errors. This is suddenly starting to look like a lot of compounded network errors... Could this be related to that Samba issue?


Now, the first thing you do when you get reports of TCP checksum errors, is try a different cable, a different switch and so on, to make sure that this is not a pure hardware problem. But I had of course tried that during the process of trying to troubleshoot the failing Samba server, and, once again, the results of switching equipment and cabling around were all negative.

But at least a bunch of checksum errors does give you something to start to work with.

For one thing, you can monitor these errors with tcpdump (tcpdump -i eth0 -vvv tcp | grep incorrect) and, more importantly, you may find some very relevant articles that point you to the very root of the problem.

Long story short, if tcpdump -i eth0 -vvv tcp | grep incorrect produces loads of checksum errors on the platform you serve content from, you may want to look into disabling offloading from the network adapter with something like:

ethtool -K eth0 rx off tx off

Or you may continue to hope that the makers of your distro will take action, but that might just turn out to be wishful thinking...

4 comments:

  1. IMHO switching off the crc checks is very dangerous because retransmission of faulty packets might no longer occur?

    ReplyDelete
    Replies
    1. My understanding from the ethtool manual is that this only controls the offloading of checksum (-K option), and it does *NOT* disable checksumming completely. In other words, it moves the checksumming from the hardware NIC to the software, which is precisely what you want to try if you find that offloading may be causing issues.

      From what I could see, there are a few NICs out there that are supposed to support checksum offloading but where the hardware implementation has issues, in which case Linux will default to disabling it in the same manner as ethtool would do with option -K.

      Delete
    2. OMG I did not realise it is moving the CRC calculation from the ethernet chip to the CPU...

      Do you have any ideas why the calculation in the NIC fails only in case of samba transfers?

      Delete
    3. It doesn't fail only in the case of Samba. It's just that Samba clients (such as the Windows native one) are a lot less forgiving about to CRC errors and give up after they see a few too many errors, instead of asking the server to resend the data.

      Delete