<color><param>0100,0100,0100</param>On 3 Jan 2003, at 2:12, Henrik Nordstrom <<hno@squid-cache.org> wrote:


<color><param>7F00,0000,0000</param>> Robert Collins wrote:

> 

> > Nope. cond_signal is 1 or more releases.

> 

> In all documentation I have seen cond_signal signals exacly 1 thread per

> call while there is threads waiting on the condition.


</color> It seems to me you have abit too bright expectations from threads.


 Thread coding is quite some style. It insists that every shared data be

 accessed only under mutex lock. It is designed to scale for tens of

 cpus and thousands of concurrent threads. That easily means there are

 hundreds of thousands mutex locks and unlocks per second. They CAN'T

 be heavy calls, they just can't always pass through scheduler, or any

 hardware platform would melt down instantly. Therefore they don't.


 Mutex locks are primitives, fundamentals, and they are simple.

 Mutex lock is just memory instruction that all cpus can see happen.

 If mutex is locked, scheduler is called that blocks the thread.

 If mutex is not locked, then no scheduler is called.


 LWP is given time quanta, and it runs to use it up in fullest,

 therefore taking away cpu on simple mutex unlock is wrong and never

 happens. cpu is taken away when it either blocks or time quanta expires

 by systick. Only then is scheduler called. LWP may have many threads,

 and LWP may implement its own scheduler within threads lib. If threads

 are not kernel threads, libthread would prefer thread switch on every

 pass through it, as these are real fast switches and gives sense of

 more parallelism. But when thread switch implies context switch, it

 means high overhead. libthread may indeed do that, but thats waste.


 Ever thought of mutex deadlocks "why the hell it can't detect it

 itself owns the bloody mutex?" Its just simple compare of long, but

 they don't do it, because its "timeconsuming"... Everything for raw

 performance. Do you really think scheduler is checked at every mutex?


 Mutex unlock is even simpler, clear the lock and thats it. Thats fast.

 To kickstart any blocked thread, OS would need to run scheduler. slow.

 Lib must provide decent progress for every thread. So if there are many

 threads in process, they are made to progress by some lightwheight

 means, like round-robin. Lib MAY (rfc speak) make thread switch at

 suitable points, like mutex unlock if it doesn't cost anything. Such

 case is for intra-CPU/intra-process/LWP thread switch. If its implemented

 to make RR progress of threads, it may decide to switch threads on

 every switch-point (or every Nth). But then, it may not. Its a matter

 of tuning, implementation. In any case, not our business. God knows.


 On SMP system, unlocking mutex does nothing to magically kickstart a

 thread on other cpu. If other cpu is busy, it works. If its idle, it

 sits halted until hw IRQ awakes it and it then runs scheduler to

 discover thread that was unblocked by mutex unlock. Mutex unlock

 may at best add list of unblocked threads at tail of runnable threads.

 Currently running LWP cannot kickstart other LWP to run. Only kernel

 can do that through scheduler. Current LWP may only block on smth

 to allow kernel reschedule, or maybe yield cpu. Again matter of

 libthread implementation. For coder - unpredictable.


 Cond_signal is documented to unblock "at least one thread".

 There is good reason for the "at least" part. Cond variable is nothing

 more than just plain boolean flag. Signalling is nothing more than

 setting it to true. Now, when kernel scheduler is called, it goes

 through all runnable threads and gives them time quanta by their prio.

 If there is 1 cpu, only 1 thread at a time can be run, and cond can

 only unblock exactly one thread. But if there is SMP and many cpus,

 then many threads are run concurrently, and they all are put to

 "compete" for the same mutex. They spin on mutex, and eventually only

 one will unblock. IF, mutex is used at all. cond_signal does not

 require mutex, in this case, potentially several threads are unblocked

 in MP system. But whether 1 or more depends on many things, including

 microtiming, therefore they say "at least one thread".


<color><param>7F00,0000,0000</param>> > Huh? The state should be (slightly compressed):

> > thread S,          W1,                  W2

> > active          B-waitingforsignal     B-waitingforsignal

> 

> for clarity this should be added:

>   mutex_lock()	  B-waitingforsignal     B-waitingforsignal

> 

> > cond_signal()   B-inpthread_mutex_lock B-waitingforsignal

> > cond_signal()   B-inpthread_mutex_lock B-inpthread_mutex_lock

> > mutex_unlock()  active                 B-inpthread_mutex_lock

> > active          mutex_unlock()         active

> > active          active                 active


</color> You miss few very crucial steps on the list: runnable thread

 scheduler and CPU context/stack/VM switches. This above is assumption

 result of vague documentation meant to give general idea and

 avoiding details. But nowhere, not a single serious threading

 model will tell you that threads are unblocked in exactly above

 way. On contrary, if you look closer, they'll tell that threads

 are run in unspecified moments, and only way to force the order

 is to use mutex syncronisation. Only hitting locked mutex is

 guaranteed to make thread switch. They use "unspecified" to account

 for all the blabla I made above and awful lot of details of thread

 implementation not mentioned, that depends on UP/MP, thread scope,

 mutex scope, number of cpus, threads, processes, hell knows what else.


 cond_var and its mutex are not related in any way for OS. mutex

 unlock here is not any more important than any other mutex unlock

 anywhere in the code. And whether cond_signal is made doesn't

 change its importance.

 So if you think that mutex_unlock after signal always causes

 thread switch, then you are wrong. If you think that it never does

 thread switch, you are wrong. If you think that its likely to make

 switch, you are wrong, if you think its unlikely, you are wrong.


 If you want to know exactly, your only option is to dig into kernel

 and libthread sources and understand one specific implementation.


 There's obviously hell of a lot more to the details than I could

 ever type here. But the point is that each thread is independant

 entity and coder must make no assumptions about its progress moments

 unless he forces that by algoritm.


 All of the above is completely irrelevant for coding, unless you

 are stuck with some wrong assumptions. The only assumption I've

 found to hold is that _anything_ that CAN happen, WILL happen.

 About 100 times sooner than you'd expect. for eg. take 2 threads

 changing (read-modify-write) single atomic int once per request

 without a mutex lock. Likelyhood that they'll hit it concurrently

 is neglible, especially on an UP system. But you can bet money that

 this prog will crash in less than a day. This means that thread

 switch happens at moment when one thread has read the var, modified

 it, but not yet written out. imagine. nanosec timing.


<color><param>7F00,0000,0000</param>> > Ok, whats high overhead in the current model? I'll check the code in

> > detail in the weekend.


</color> Current overhead is quite small. I meant more like to not introduce

 any new overhead.


<color><param>7F00,0000,0000</param>> > Ah. Well, what about:

> > pthread_mutex_lock(&queuemutex)

> > do {

> > schedule_io_request()

> > pthread_cond_signal()

> > } for <<count> requests

> > pthread_mutex_unlock(&queuemutex)

> > pthread_mutex_lock(&queuemutex)

> > pthread_mutex_unlock(&queuemutex)


</color> this will loose all signals but last one. Only 1 thread will run.


<color><param>7F00,0000,0000</param>> By intention the main thread never uses pthread_mutex_lock, only

> pthread_mutex_trylock. This to avoid lock congestion.


</color> yes, when schedule per request. But if we'd append to queue and after

 comm run append whole queue to requestqueue, we'd want thread run.


 Creating reliable thread switch for specific job is quite a brain

 exercise. It should be possible with 2 mutexes alone, but try, knowing

 that mutex unlock may but also might not cause a thread switch...


 I thought we might only send 1 signal to worker threads if there is

 work to do and there are idle threads. Then let threads themselves

 make sure they awake up enough workers. We make only sure that first

 worker gets reliably chance to run, before we proceed with poll().


 Main:


 while comm_loop {

  ...comm_stuff... // gather io-requests to ioqueue

  if (!ioqueue) continue;

  m_lock(queue);

  append(request, ioqueue) // send requests to worker threads

  c_signal(queue,run);

  // to make sure backsignal isn't made before we block

  // unlock queue later..


  m_lock(main);

   if (idle_threads>0)

     m_unlock(queue); 

     c_wait(main,allrun); // maybe timed

   else

     m_unlock(queue);

  m_unlock(main);

   // by here at least 1st iothread has been reliably run

 }


 io_thread:

 while () {

  m_lock(queue);

  idle_threads++;

  c_wait(queue,run);

  pop(request);

  idle_threads--;

  if (request && idle_threads>0)

    c_signal(queue,run);

  else

    m_lock(main);

    c_signal(main,allrun);

    m_unlock(main);

  m_unlock(queue);

  ... do_aio_stuff()

 }


------------------------------------
 Andres Kroonmaa <andre@online.ee>
 CTO, Microlink Data AS
 Tel: 6501 731, Fax: 6501 725
 Pärnu mnt. 158, Tallinn
 11317 Estonia