January 18, 2023

Kyu networking -- Orange Pi slow memcpy

The previous page gave details on how we discovered that memcpy is at the root of our problems with poor TCP performance on the Orange Pi.

I added some printout to the ip_output() routine in tcp/kyu_main.c to get this:

memcpy(1) 52 bytes
memcpy(2) 500 bytes
This is telling us that ip_output() is being called with an mbuf chain with two components. The first has 52 bytes and has the various headers. The second has 500 bytes and is the payload. My timing code times (sort of by accident) just the memcpy of the second (500 bytes) and it takes 0.34 milliseconds to do the copy of 500 bytes. This is copying from a BSD mbuf into a Kyu netbuf. Of course we do the copy a second time later to copy from the netbuf into the buffer used by the network chip. They always did tell me that copying was a big deal when it came to network performance.

The question of course arises, how does the BBB run so fast? It uses the same code and has to perform the same copy, at least the one from the mbuf to the netbuf, and probably to a network device buffer as well. What is different? A look at the assembly code shows us that the Orange Pi is using some unoptimized C code to do a byte by byte copy. About the worst possible thing. A look at the BBB code though shows that it is using the exact same code.

At this point it seems pretty clear what the issue is. The data cache has not been enabled for the Orange Pi -- that is my bet. Looking at locore.S I see that the BBB uses an older (and very simple) bit of assembly startup code. The Orange Pi has different code because it supports multiple cores. There are notes that when the BBB used the Orange Pi startup code, the system "ran slow". (So, on the BBB, I avoid the Orange Pi setup code and use the old tried and true BBB setup).

So now we have our fingers on a reason why TCP is so different on the BBB versus the Orange Pi.

I am now trying to switch back to using the simpler single core initialization in arm/locore.S that the BBB uses. This used to work for the BBB, but the Orange Pi is not happy with it. So I am going to take time to learn about ARM MMU and cache setup.

Side note: old Kyu blinking LED diagnostic

I was excited (briefly) when I saw in the Kyu IO test menu:
"Test 21: LED blink test (via delay)"
Notes in the source say that it is supposed to tell me if the D cache is enabled or not. This was true only in the context of the BBB. I used a stopwatch to work up the delay routine (which counts some number of times calibrated by the stopwatch). I did this first with a system that properly enabled the cache. Later when I had a setup that did not enable the cache, the blink would be very slow.

Proper behavior is that it causes the on board LED to blink twice, once a second.

This is useless for the Orange Pi for the following reason. It currently blinks just fine, but this is because I stopwatch calibrated the delay for the current system with the botched cache setup. Someday when I get the cache working right this will run really fast. When that happens, I will need to recalibrate the delay count.


Have any comments? Questions? Drop me a line!

Kyu / [email protected]