[GE users] Strange Consequence of Changing h_vmem .. N1GE 6.1

Graham Jenkins Graham.Jenkins at its.monash.edu.au
Sat Sep 29 03:33:41 BST 2007


We're running N1GE 6.1 on an AMD-64 cluster with Scientific Linux 5,
where each Node has 2 CPUS (and therefore 2 Slots). And we're using
environment-modules 3.2.4 to adjust path and other environment variables
for specific applications like Intel compilers, etc.

And we have a problem where a job running in one Slot on a CPU can
progressively consume almost all of the available 4Gb memory, thereby
causing problems with a job running in the other Slot.

To get around this, we decided to change the value of h_vmem on our
'Short' queue from INFINITY to 2.0g, with a view to subsequently
changing it on our 'Default' queue.

This actually achieved the desired result. But there was one
un-anticipated consequence. Our 'module load' statement no longer worked
for jobs in 'sque'; came out with nasty messages like:

  Tcl_InitNotifier: unable to start notifier thread
  ../bin/pi3a: error while loading shared libraries: 
  libmpi_f77.so.0:cannot open shared object file: No such file or
  directory

The only differences between the 2 queues on the cluster (apart from the
hostlist) are as follows:

  sdiff -s /tmp/sq /tmp/dq

  qname  sque     | qname  dque
  seq_no 10       | seq_no 100
  h_rt   00:15:00 | h_rt   INFINITY
  h_vmem 2.0g     | h_vmem INFINITY  

So .. what did we do wrong here??

-- 
Graham Jenkins
Senior Software Specialist, E-Research

Email: Graham.Jenkins at its.monash.edu.au
Tel:   +613 9905-5942
Mob:   +614 4850-2491

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list