[GE users] complex use of complexes

gragghia gragghia at utk.edu
Tue May 11 21:09:35 BST 2010


That's what I was thinking.  The memory usage has to be managed on each 
node because that is where the potential over-subscription problem is.

- Gerald

>> another workaround would be to make h_vmem consumable = JOB instead of YES. (I believe I have the terminology correct)
>>
>> You would then request h_vmem as the sum for the entire job and not per task.
>>
>> You'd have to train your users appropriately to think in terms of summing their RAM requirements.
>>      
> I fear this won't work: it will be subtracted once and only from the master queue. So all the consumptions on the slave nodes are ignored, although they take place.
>
> -- Reuti
>
>
>    
>>
>> From: gragghia [mailto:gragghia at utk.edu]
>> Sent: Monday, May 10, 2010 5:44 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] complex use of complexes
>>
>> I was planning to set it up exactly as you describe, however it won't work for this particular job (as far as I know).  The problem is that when a single large MPI job makes a complex request (-l h_vmem=128G), that request is understood to be _per process_:  A 480 process job with "-l h_vmem=128" will actually be requesting 61TB of RAM.  I need to be able to request 128GB of RAM for only the first process of the job and then 2GB of RAM for all the other processes.  This is not possible (please prove me wrong though), so I am looking at ways to make an exception for this particular job.
>>
>> - Gerald
>>
>> On 5/10/2010 6:08 PM, iank wrote:
>> Hi Gerald,
>>
>> On Mon, May 10, 2010 at 2:43 PM, gragghia<gragghia at utk.edu>  wrote:
>> We are wanting to add management of RAM usage on our system by setting
>> h_vmem to consumable and adding a reasonable default value.  The problem
>> is that we have one parallel job that needs to use 128GB on its rank
>> zero "master" process but only 2GB of RAM on the slave processes.
>> Setting h_vmem=2GB will cause the rank zero process to fail, but using
>> hvmem=128G will not be able to run because the other compute nodes do
>> not have that much RAM.  How can we make this work?  If it isn't
>> possible, can we make an exception to the management of vmem so that
>> this job doesn't get killed due to over-usage?
>>
>> Thanks,
>>
>>
>> You should be able to make h_vmem consumable and requestable. That way, the default may be set at 2GB, which means if the job is not submitted with -l h_vmem=XG, it will by default grab 2GB, but can request more. And, the "master" process can be submitted with -l h_vmem=128G, this ensuring it can only run on the one system with the correct amount of RAM. You can also require the -l h_vmem flag to be used so that every job submitted must ask for the proper amount of RAM or else it will not run.
>>
>> Ian
>> -- 
>> Ian Kaufman
>> Research Systems Administrator
>> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
>>
>>
>>
>> -- 
>> Gerald Ragghianti
>>
>> Newton HPC Program http://newton.utk.edu/
>> Office of Information Technology
>>    Research Computing Support
>>    Professional Technical Services
>>
>> The University of Tennessee
>> 2309 Kingston Pike
>> Knoxville, TN 37996
>> Phone: 865-974-2448
>>
>> /-------------------------------------\
>> | One Contact       OIT: 865-974-9900 |
>> | Many Solutions         help.utk.edu |
>> \-------------------------------------/
>>      
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256966
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>    

-- 
Gerald Ragghianti

Newton HPC Program http://newton.utk.edu/
Office of Information Technology
   Research Computing Support
   Professional Technical Services

The University of Tennessee
2309 Kingston Pike
Knoxville, TN 37996
Phone: 865-974-2448

/-------------------------------------\
| One Contact       OIT: 865-974-9900 |
| Many Solutions         help.utk.edu |
\-------------------------------------/

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256981

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list