[GE users] qhost MEMUSE

futurity neil at futurity.co.uk
Tue Feb 17 18:52:26 GMT 2009


We are using vf as a consumable.  

Our users have used our legacy grid for many years without any restrictions.
As a result they are very worried about configuring the grid to kill off
their jobs when they exceed specific limits (although I see the benefits of
killing of such rouge jobs).

Many users have the qsub commands so embedded into their scripts (scripts
that submit other scripts etc), that it'll be really hard for them to change
their arguments.  For users submitting jobs with large memory requirements
we are going to have to force them to state their required memory as they
will force memory to swap, but for most users we'll have to use some
sensible default values for vf.  

To do this I have to try and calculate a memory value that most jobs will
fall under.  I can't just make this a large number, otherwise our 8 core
machines will have fully allocated vf, but will also have unused cores.  I
also have to try and police user jobs so that if I see jobs exceeding these
default values, then I can force them to declare their memory requirements.

After reading your and Andreas' responses, I can now see why it's not quite
as simple as monitoring the memory consumed by a users job.  I think I've
also seen that jobs have a knock on effect to the OS, so a job may be 256MB
in size, but may cause the OS to use up additional memory in support
services.  Do the grid processes on the machine also consume significantly
more memory when each additional job runs?

I'm wondering if the easiest solution is to measure the memory use when no
jobs are running on a machine.  Then submit lots of jobs of the same type
until either a machine's slots are filled.  Take the "no job memory used"
away from the "all slots filled memory used" and then divide the result by
the number of slots?

Neil

-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: 17 February 2009 18:08
To: users at gridengine.sunsource.net
Subject: Re: [GE users] qhost MEMUSE

Am 17.02.2009 um 18:14 schrieb futuritymmx:

> Yes we are experiencing jobs not being able to reserve memory.  At 
> such times the physical and swap memory appears to have been totally 
> used up.
>
> Thanks to your last response about the difference between "free -m"  
> value
> and "qhost" value, it appears that when there is free memory it may be 
> used by buffers and caches, but when the processes require all the 
> memory that these buffers and caches disappear as expected.
>
> I'm just trying to track down which users are submitting the largest 
> memory jobs so that they can provide accurate "vf" values to qsub.  As 
> you say, you have to track down the sum of all the memory usage by all 
> the process created by each job.

You made vf consumable? Another option is to use h_vmem in a similar manner.
Difference is, that h_vmem will be enforced, hence the jobs being killed if
they consume too much memory. vf is only a guidance.


-- Reuit

> Scary task!
>
> Neil
>
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: 17 February 2009 13:31
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] qhost MEMUSE
>
> Hi,
>
> Am 16.02.2009 um 20:43 schrieb futurity:
>
>> Thanks Reuti.
>>
>> Is there any easy way to gather job memory usage?
>
> well you could sum up in a script the consumption of all processes 
> belonging to the sgeexecd. Do you need this to get the information 
> about used memory by local interactive usage of a workstation outside 
> of SGE?
>
> -- Reuti
>
>
>> Regards
>>
>> Neil
>>
>> -----Original Message-----
>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: 16 February 2009 17:38
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] qhost MEMUSE
>>
>> Hi,
>>
>> Am 16.02.2009 um 18:22 schrieb futurity:
>>
>>> I was wondering if the MEMUSE value returned by "qhost" represents 
>>> the memory used by all processes on a machine, or just the memory 
>>> used by grid jobs running on it?
>>
>> It's from all processes on a node. Just the output you get also from 
>> a command like:
>>
>> $ free -m
>>
>> (or -g) next to "+/- buffers". I.e. a system information. Otherwise 
>> the output should read zero in an empty cluster.
>>
>> -- Reuti
>>
>>
>>> Regards
>>>
>>> Neil
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=107424
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=107497
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=1
> 08123
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=108261
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=1
08290

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=108317

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list