January 2, 2023

BSD TCP/IP for Kyu - The Kyu hang continue

A new year has begun and we are still at it. What do we know about this problem?

There is a fair summary of what we know thus far.

I add some debug and run a test. After 489 transfers it is hung. The last message on the Kyu console is "Send blocked". Typing "u wang_thr" yields:

PC = 40010910 ( resched+288 )
Called from thr_block+7c -- 4000fa40
Called from sem_block_cpu+4c -- 40010e3c
Called from sem_block+24 -- 40010e68
Called from sbwait+38 -- 40030fbc
Called from sosend+31c -- 4002f19c
Called from tcp_send+80 -- 4002ee68
Called from big_wangdoodle+f4 -- 4003230c
Called from one_wangdoodle+124 -- 40032580
This is the relatively infrequent block in "sbwait". This looks like a bonafide and legitimate deadlock, and we may as well investigate it.

The first thing to notice is that this seems to happen every time sosend decides that it needs to call sbwait and block. This doesn't happen very often, but when it does it does not seem to be handled properly. Who is supposed to awaken us? There are two cases in tcp_input() that call so_wwakeup_ext(). I added some debug to these and by the time I got back to the kyu console there had been more activity, the message "Send unblocked" along with "out of netbufs" and it was entirely hung. So I am just going to reboot, run another test, and perhaps the next time the added debug will be helpful (so_wwakeup now prints a message).

I now get 3 messages and the end of each transfer from so_wwakeup() when we are disconnecting and disconnected. I think I can just ignore these. Just watching the network activity LED lets me know at a glance if the test is running (it blinks rapidly and steadily) or if the test has hung (it blinks slowly and intermittently). Now we wait for a hang to analyze. Each transfer now takes about 2 seconds (the console messages slow it down), so 489 transfers would take 978 seconds (16 minutes). The next run hangs after 1195 transfers (30 minutes). No message about being blocked or awakened. The traceback is:

PC = 400104c8 ( change_thread+1f8 )
Called from thr_unblock+178 -- 4000fd14
Called from sem_unblock+70 -- 40010c20
Called from user_unlock+1c -- 40023a3c
Called from tcp_send+88 -- 4002ee80
Called from big_wangdoodle+f4 -- 4003231c
Typing "t 9" shows that we are rapidly cycling between 3 threads:
Switch wang_thr --> tcp-bsd
Switch tcp-bsd --> tcp-timer
Switch tcp-timer --> wang_thr
Why are we continually running tcp-bsd? It should only run when Kyu has a TCP packet that has arrived and needs to be processed. Running tcp-timer of course makes sense (it should run every 100 ms). I add debug to tcp-bsd to tell me how many actual packets are processed every time it awakens to pull packets off the queue, and now we run again and wait for a hang.

Away we go again with this debug in place. Now we are setting records, having gone for 2400+ transfers. It makes me wonder if just the tiny bit of delay that has now been added to tcp-bsd has made the difference. Now I am up to 3870 transfers (each taking 1.8 seconds according to netcat). Round this to 2 seconds each and this is nearly 2 hours. Finally we get this:

Wangdoodle connection 4126
cksum: out of data
 *** *** *** tcp_input 2, bad cksum = 3f
m, tlen, len = 401686f0, 12849, 12869
.....
.....
Wangdoodle connection 4148
 *** *** *** tcp_input 2, bad cksum = 4201
m, tlen, len = 40168230, 552, 572
Bad checksums should not lead to a hang though, the packet should just get dropped and then will ultimately get resent. I type "u wang_thr" to get:
PC = 40010464 ( change_thread+194 )
Called from thr_unblock+178 -- 4000fd14
Called from sem_unblock+70 -- 40010c20
Called from user_unlock+1c -- 40023a3c
Called from tcp_send+88 -- 4002eed8
Called from big_wangdoodle+f4 -- 40032374

Typing "t 9" shows absolutely nothing, which is odd. What is happening though is that tcp-bsd is probably in the very loop I wanted to diagnose (at priority 14). But the test code runs the "test" (i.e. t 9) at priority 20, so it is getting starved by the looping tcp-bsd.

DEAD threads

The above long test led to a bunch of these:
Kyu, ready> l
  Thread:       name (  &tp   )    state      pc       sp     pri
  Thread:      shell (400a799c)   READY J 40010464 40568000   11
  Thread:        net (400a75dc)     SEM J 40010464 40578000   12
  Thread:   net_slow (400a74ec)  REPEAT C 40015a78 4057c000   13
* Thread:    tcp-bsd (400a77bc)   READY I 40024b08 40570000   14
  Thread:  tcp-timer (400a76cc)   READY C 40023c48 40574000   15
  Thread: wangdoodle (400a730c)     SEM J 40010464 40584000   30
  Thread:   wang_thr (400a721c)    DEAD I 4000f2ec 40580000   35
  Thread:   wang_thr (400a703c)    DEAD I 400117e0 40580000   35
  Thread:   wang_thr (400a6f4c)    DEAD I 400117d0 40588000   35
  Thread:   wang_thr (400a6e5c)    DEAD I 4000f2f8 4058c000   35
  Thread:   wang_thr (400a6d6c)    DEAD I 4000f2dc 4058c000   35
  Thread:   wang_thr (400a6c7c)    DEAD I 4000f2fc 4058c000   35
  Thread:   wang_thr (400a6a9c)    DEAD I 4000f280 4058c000   35
  Thread:   wang_thr (400a69ac)    DEAD I 4000f2fc 40590000   35
  Thread:   wang_thr (400a68bc)    DEAD I 4000f2dc 40590000   35
  Thread:   wang_thr (400a66dc)    DEAD I 4000f2d8 40590000   35
  Thread:   wang_thr (400a65ec)    DEAD I 4000f2d8 40590000   35
  Thread:   wang_thr (400a64fc)    DEAD I 4000f27c 40590000   35
  Thread:   wang_thr (400a640c)    DEAD I 4000f2d8 40594000   35
  Thread:   wang_thr (400a631c)    DEAD I 4000f2fc 40594000   35
  Thread:   wang_thr (400a622c)    DEAD I 4000f2d8 40594000   35
  Thread:   wang_thr (400a613c)   READY J 40010464 40594000   35
  Thread:       idle (400a78ac)   READY C 40010964 4056c000 1234
The traceback for any of these looks like the following (with perhaps different exact addresses in cleanup_thread())
PC = 4000f2d8 ( cleanup_thread+88 )
Called from xxx_arm+bffeeddc -- 00000004
Called from thr_exit -- 4000f770

My guess as to what is happening is that an interrupt happens during thr_exit(), and since the thread is marked as "DEAD" the scheduler will never return to it to finish the exit processing. Putting a lock on exit processing should solve this. But an examination of the thr_exit() code shows the expected locks. This will have to remain a mystery for the time being.
I rearranged a bunch of priorities as follows - the idea being to get some elbow room to run things above the network priorities. In particular, we can now type "t 9" and the thread that runs that "test" and displays debug information can now run at priority 14.

  Thread:       name (  &tp   )    state      pc       sp     pri
  Thread:      shell (400a799c)   READY I 4000460c 40568000   11
  Thread:        net (400a75dc)   READY I 400109d4 40578000   20
  Thread:   net_slow (400a74ec)  REPEAT C 40015b40 4057c000   21
  Thread:    tcp-bsd (400a77bc)     SEM J 4001052c 40570000   24
* Thread:  tcp-timer (400a76cc)   READY I 400109d4 40574000   25
  Thread: wangdoodle (400a730c)     SEM I 400109d4 40584000   31
  Thread:       idle (400a78ac)   READY C 40010a2c 4056c000 1234
Away we go again with a series of transfers. Starting at 5:54 PM. It hangs after only 114 transfers, this time with the "sbwait bug". It does announce that it is blocking with the "Send blocked" message. And, lo and behold there is no message from so_wwakeup, so it is simply not being notified of new buffer space. Why not? Either I screwed up and eliminated that wakeup that was once in the BSD code, or I am holding a lock and preventing other code from running that would free up space. Or maybe there is a more basic problem that is clogging up the output channel. Things are piling up waiting to be sent, and whatever is causing that is never getting rectified.

After a long wait, I get the message "Send unblocked", but without ever seeing a message that so_wwakeup did anything. How can this happen? This can only happen if sbwait() returned.

Wireshark reveals even more curious and confusing information.

Kyu sends 500 bytes with seq 157001, next seq 157501 (data is PK311)
Linux gives the ACK for this (ack 157501)
Kyu next sends 500 bytes with seq 151501, next seq 152001 (data is PK300)
Linux repeats the ACK for 157501

Why has TCP regressed to the tune of 6000 bytes or so?

Also worth noting is that while hung in this way, every TCP facility is stuck.
Happily though the Kyu console is alive and well.

What information do we need to understand this better?

When TCP wants to send a packet, it calls ip_output(), which calls the Kyu routine ip_send(). This calls ip_arp_send -- net_send -- board_net_send -- emac_send. Nowhere is there a queue (and no overall locking either). I can see at least two problems here. One is that in general tcp_output() is at the top of all of this and is holding the overall tcp_lock. It would certainly be better to have a TCP queue so that we could quickly put a packet on the queue and return. What happens for example if ARP needs to get involved? There will be ARP packets getting send and received while the TCP packet sits and waits. At any rate, there is nothing like an output queue with packets pending -- but there probably should be.

So, with some new features to display sizes of buffers and queues, away we go again. After 530 transfers it is still going. This has taken 16 minutes (960 seconds) which exactly matches 1.8 seconds per transfer times 530. So, a person could estimate 15 minutes for 500 transfers or 2000 transfers per hour. Each transfer is 200,000 bytes, so we are moving 400 megabytes per hour or 0.4 gigabyte per hour. Not terrific for a 100 megabit ethernet, but adequate for things I want to do -- and I'll be happy if it just works reliably.

And after only 1216 transfers we are hung! I see the messages "Send blocked" as well as "Send unblocked" (but I was busy and distracted, so who knows how long it was from one to the other). Also I see:

Kyu input queue size: 737
Kyu has an active console, but will not respond to ping and all aspects of TCP are hung. There are 3 wang_thr -- all DEAD. Wireshark shows the strange seq. number regression we saw on the previous hang. My bet though is that if I had been on the ball and caught this shortly after the hang I might have see an sbwait situation.

Interlude: producer and consumer

I usually handle this with a pair of semaphores. One is used for mutual exclusion (locking) and the other is used for signaling. I have always worried about a brief interval where a race condition seems possible. The producer will acquire the lock and add an "item" to a queue. Then it will release the lock and signal. It would be nice (and perhaps essential) to signal and release the lock as an atomic action.

Conclusions

I am calling it a day.

First of all, since I added some code to tcp-bsd where it pulls packets off the queue, I have had much longer runs and none of the "thr_unblock" style hangs. Clearly this is some tricky timing issue having to do with the semaphore announcing packet arrivals, but it is progress in that it points to the root of the problem. But the last run of the day went only 272 transfers then got one of these, so I am hardly out of the woods here.

The sbwait problem has more drastic effects. Every aspect of Kyu networking stops. It is as though some thread is holding the interrupt lock, although if that was true my interrupt driven console would not work. So it is some fundamental semaphore thing. It could even be the same bug as the "thr_unblock" bug, but this time having to do with pulling packets off the Kyu main packet queue. Wouldn't that be something. If I can figure out why packet processing is frozen up, I will have learned something. Here is my guess. Kyu needs to deliver TCP packets via tcp_rcv() and this needs to get a lock to put them on the TCP input queue. If it is waiting to get that lock, that would explain this hang.

Look at sem_block_cpu ( sem ). This is designed to do the atomic blocking on a semaphore while holding the INT lock.

A handy addition to Kyu would be a "clock" that just counts seconds since bootup or some such. It would let me print times and then be able to evaluate timing between events (such as Send blocked/unblocked). It turns out I have this as get_timer_count_s() which returns a value in seconds, so as my last trick for the day I add this to the Send block/unblocked printout, and start another run.

Also, I am convinced that I do need a TCP output queue (and probably a network output queue as well).


Have any comments? Questions? Drop me a line!

Kyu / [email protected]