[GE users] semaphore leftovers

David Farrell d-farrell2 at northwestern.edu
Thu Feb 10 00:38:34 GMT 2005


On Feb 9, 2005, at 4:33 PM, Reuti wrote:

> Hi,
>
> are you using MPICH as you mentioned cleanipcs? One solution would be 
> to
> compile MPICH without shared memory support. cleanipcs will simply 
> remove all
> ipcs stuff from one user on a node. If there are two jobs from the 
> same user,
> the second might be killed also by removing the "wrong" semaphores by 
> accident.
> But instead of a cron job, the cleanipcs could be put in the 
> stop_proc_args or
> queue_epilog.
Yes this is MPICH, I will give this bit a try. The issue here is that 
the users tend to use ctrl-C sorts to kill a job when running in 
interactive mode, rather than using a more elegant technique. It seems 
that in a way, the users are bypassing SGE by doing this and this may 
be part of the problem. I think that having an epilog script to clean 
up after they have ended the job may work well, but I wonder if there 
are other ways (just so I know). It appears they are using the cluster 
to debug their code on, and in so doing they end up using it like a 
series of workstations through interactive shells. Perhaps there is a 
better way to go about them using the nodes in this manner?

>
> Another solution for dynamically linked application:
>
> A wrapper lib which will trap semget(), shmget(), msgget() and so on 
> which is
> loaded before the job by using LD_PRELOAD for this wrapper lib. This 
> wrapper
> will call the real semget() and remember the assigned ids on the 
> return to the
> application. When you shutdown the application by qdel, you would know 
> all the
> ids of the semaphores you have to remove. It's just an idea, but it 
> would be a
> cool addition to SGE.
That does sound interesting, but I am not sure my skills are up to the 
task

Thanks again,

Dave

>
> Cheers - Reuti
>
>
> Quoting David Farrell <d-farrell2 at northwestern.edu>:
>
>> I am running into the issue where semaphores build up and create a
>> situation in which uses can no longer start jobs, and on occasion, 
>> jobs
>> die prematurely. The errors point to the semaphore issue and cleaning
>> it manually has become a bother, as users wish to use the machine for
>> testing(so abnormal exits are common). Is there any good solution to
>> this problem? I have heard that making a cleanup script into a cron 
>> job
>> sometimes results in jobs being killed, so I would be interested in
>> other possibilities. In addition, I sometimes see a situation in which
>> running the cleanipcs script as root does not clean out some of the
>> semaphores. Is there any solution?
>>
>> Thanks in advance,
>>
>> Dave
>>
>>
>>
>> David E. Farrell
>> Graduate Student
>> Mechanical Engineering
>> Northwestern University
>> email: d-farrell2 at northwestern.edu
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
David E. Farrell
Graduate Student
Mechanical Engineering
Northwestern University
email: d-farrell2 at northwestern.edu



More information about the gridengine-users mailing list