A close look at the backtrace and exact PC address should have enabled me to understand this long ago. It was more or less hidden in plain sight.
What the "u wang_thr" output gives is a captured snapshot. Most importantly, it indicates where the thread will run when resumed. As it is, the thread will resume in change_thread() and then run code that immediately switches to another thread (either tcp-input or tcp-timer, and it doesn't really matter which). And if it gets resumed again, it will do exactly this again, and again, ... etc.
So it is clear why the thread is jammed, but how did we get into this mess? Apparently we took an interrupt at a bad moment. We had decided to switch and then that decision had been frozen into stone. The answer of course is to lock all this in a critical section. We need to decide what thread to resume and then resume it as an atomic action.
I once tried adding a lock (a call to INT_lock) to thr_unblock to fix this, but got me into hot water immediately. I got a "Panic: do_irq, resume" repeatedly when I did this. This is in armv7/interrupts.c at the end of do_irq(). It expects finish_interrupt() to never return.
However, now when I add the INT_lock (uncommenting it) in thr_block, I get none of that. Apparently something else I fixed has fixed this. This did get me into a situation where every thread ended up DEAD, which forced me to find and fix that bug.
Both thr_unblock() and sem_unblock() boldly advertise that they may be called from interrupt code (as well they should). Something to bear in mind perhaps.
Kyu / [email protected]