[GE users] complex use of complexes

reuti reuti at staff.uni-marburg.de
Fri May 14 12:40:30 BST 2010


Am 12.05.2010 um 23:36 schrieb gragghia:

> One idea I had was to modify the terminate_method for this job's queue 
> so that it isn't killed when the rank zero MPI process uses too much 
> RAM.  The side effect would be that I couldn't stop the job using qdel 
> for any reason (without making the terminate script more 
> sophisticated).  Surely there is a cleaner way to exempt one job from 
> the h_vmem limits?

To be honest: I fear there is no clean solution available in SGE right now. All your machines have 128GB or more? Depending on your MPI requirements, you could try to use machines exclusively:

- the master process of a job gets one machine on its own (no h_vmem necessary at all)
- all other processes are bundled per machine, but as the exclusive attribute is per job (not process) you can have an allocation of:

1 - master process on slave1
64 - processes on slave2
64 - processes on slave3

with:

qsub -pe mpich 129 -masterq all.q at pc1 -q slave.q@@commonhosts -l exclusive test.sh

The available slots in slave.q could then limit the number of allowed processes per slave node.

Unfortunately: it was fixed in the past to handle disjunct -masterq and -q options, but it seems to be broken again (see following post).

-- Reuti


> - Gerald
> 
> On 5/11/2010 4:37 PM, reuti wrote:
>> Am 11.05.2010 um 21:52 schrieb gragghia:
>> 
>> 
>>> Are you suggesting to break the job up into two jobs with different
>>> resource requests?  They would have to be running at the same time
>>> (something that I don't think you can guarantee), and MPI wouldn't
>>> know
>>> how to communicate with the processes of a different job.
>>> 
>> In principle it's possible to hijack slots from another parallel job.
>> So you could submit one job with a request for 128 GB, and one
>> parallel job (which will only have a `sleep` or alike inside and
>> "job_is_first_task FALSE" [it could also wait for a file "+DONE"
>> written by the master job to quit automatically]) with e.g. 7 slots
>> requesting 2 GB for each slot as usual. Then the master job can submit
>> something with `qrsh -inherit` to the slots from the other job when
>> you change the $JOB_ID to be the one from the 7-slots job. Depending
>> on the used MPI version, it might be tricky anyway.
>> 
>> Bigger problem as you mentioned: how to force SGE to run both jobs at
>> the same time or not at all.
>> 
>> -- Reuti
>> 
>> 
>> 
>>>> Would it be possible to restructure the job so that the first process
>>>> is a "master", which requests 128G for a single process, which that
>>>> single process then fires off the remaining parts requesting 2G?
>>>> 
>>>> 
>>> -- 
>>> Gerald Ragghianti
>>> 
>>> Newton HPC Program http://newton.utk.edu/
>>> Office of Information Technology
>>>   Research Computing Support
>>>   Professional Technical Services
>>> 
>>> The University of Tennessee
>>> 2309 Kingston Pike
>>> Knoxville, TN 37996
>>> Phone: 865-974-2448
>>> 
>>> /-------------------------------------\
>>> | One Contact       OIT: 865-974-9900 |
>>> | Many Solutions         help.utk.edu |
>>> \-------------------------------------/
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256980
>>> 
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
>>> ].
>>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256984
>> 
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>> 
> 
> -- 
> Gerald Ragghianti
> 
> Newton HPC Program http://newton.utk.edu/
> Office of Information Technology
>   Research Computing Support
>   Professional Technical Services
> 
> The University of Tennessee
> 2309 Kingston Pike
> Knoxville, TN 37996
> Phone: 865-974-2448
> 
> /-------------------------------------\
> | One Contact       OIT: 865-974-9900 |
> | Many Solutions         help.utk.edu |
> \-------------------------------------/
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257116
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257269

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list