[GE users] Strange Consequence of Changing h_vmem .. N1GE 6.1

Reuti reuti at staff.uni-marburg.de
Sat Sep 29 15:10:17 BST 2007

Am 29.09.2007 um 04:33 schrieb Graham Jenkins:

> We're running N1GE 6.1 on an AMD-64 cluster with Scientific Linux 5,
> where each Node has 2 CPUS (and therefore 2 Slots). And we're using
> environment-modules 3.2.4 to adjust path and other environment  
> variables
> for specific applications like Intel compilers, etc.
> And we have a problem where a job running in one Slot on a CPU can
> progressively consume almost all of the available 4Gb memory, thereby
> causing problems with a job running in the other Slot.
> To get around this, we decided to change the value of h_vmem on our
> 'Short' queue from INFINITY to 2.0g, with a view to subsequently
> changing it on our 'Default' queue.
> This actually achieved the desired result. But there was one
> un-anticipated consequence. Our 'module load' statement no longer  
> worked
> for jobs in 'sque'; came out with nasty messages like:
>   Tcl_InitNotifier: unable to start notifier thread
>   ../bin/pi3a: error while loading shared libraries:
>   libmpi_f77.so.0:cannot open shared object file: No such file or
>   directory
> The only differences between the 2 queues on the cluster (apart  
> from the
> hostlist) are as follows:
>   sdiff -s /tmp/sq /tmp/dq
>   qname  sque     | qname  dque
>   seq_no 10       | seq_no 100
>   h_rt   00:15:00 | h_rt   INFINITY
>   h_vmem 2.0g     | h_vmem INFINITY
> So .. what did we do wrong here??

With h_vmem also h_data and h_stack will be set. Some applications  
need the limit to be smaller, i.e. if h_vmem & h_stack is infinity  
all is fine, but if h_vmem is set, h_stack needs considerably  
smaller, around 128M is often fine.

Any changes when you try this?

-- Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list