Since you asked basic questions I'm going to start this at a basic level. Apologies if it covers some stuff you already know or if I misinterpreted the questions. Note that I haven't actually looked at the patch that went in so this is generally wrt the original.
The first thing is get the word "lock" out of your mind because we aren't really locking anything. Yes, that API is in use but it's only to create a semaphore or baton. Nobody is ever prevented from doing anything. It just happens that on Unix the most portable (i.e. oldest) way of implementing a semaphore is with the advisory locking API. All cooperating processes agree not to proceed unless and until they are able to acquire the exclusive lock on a shared file descriptor, but it's not necessary to ever actually write anything to that descriptor.
Second, the original implementation (not sure if I ever sent that one one in though) actually created a temp file to use as the semaphore fd. But then I discovered that stdout can be locked in the same way, which is simpler. But applying the lock to stdout is just a frill; it could be a temp file, especially if some platform turned out to need it that way. I just figured that stdout is always available, or at least if it's closed you don't have to worry about synchronizing output.
Third, yes, nothing is locked while the child runs. If a shared resource was locked during child runs it would have the effect of re-serializing the build as each supposedly parallel child waited on the lock. So what happens here is really very simple: each child (aka recipe) runs asynchronously, assuming -j of course, and dumps its output to one or two temp files. Only when the child has finished and wants to report results does it enter the queue waiting for the baton. When it gets it, it holds it just long enough to copy its output from the temp files to stdout/stderr and then lets the next guy have his turn. Thus, assuming the average job runs for a significant amount of time (multiples of a write() system call anyway) there will not be much contention on the semaphore and it won't be a bottleneck.
You're right that simply writing to temp files and dumping everything at once when the job finished would be likely to reduce the incidence of garbling even without the semaphore, but not to zero.
It may be that the locking of stdout is only useful on Unix due to the fact that it's inherited into child processes. I don't know what Paul or Frank is thinking, and as mentioned I haven't looked at the current version, but my thinking originally was that Windows could easily handle this using its own far richer set of semaphore/locking APIs. I'd actually expect this to be easier and more natural on Windows than Unix. All that's required is to choose a semaphore to synchronize on, dump output to temp files, and copy it to stdout/stderr only after acquiring the semaphore. And remove the temp files of course.
-David Boyce