[GE users] Errors after setting h_vmem to 16G and consumable

reuti reuti at staff.uni-marburg.de
Tue Feb 24 11:52:28 GMT 2009


Am 23.02.2009 um 22:59 schrieb prentice:

> I set h_vmem to 16G on all of my execution hosts like this:
>
> for i in $(seq -w 64); do qconf -mattr exechost complex_values
> h_vmem=16G node${i}; done
>
> looking at the hosts in qmon shows that this worked. I then set h_vmem
> to be consumable using qmon, with a default of 2G:
>
> qconf -sc | grep h_vmem
>
> qconf -sc | grep h_vmem
> h_vmem              h_vmem     MEMORY      <=    YES         YES
> 2G       0
>
> Now when I submit a job, it runs briefly (I have sleep statements, so
> the program should run for at least 90 seconds), and then the state  
> goes
> to 'dr'. All the output files are empty.
>
> Here's my job submission script:
>
> #!/bin/bash
> #$ -N mpihello
> #$ -pe orte 2
> #$ -l h_vmem=8G
> #$ -cwd
> #$ -V
> #$ -R y
>
> MPI=/usr/local/openmpi/pgi/x86_64
> PATH=${MPI}/bin:${PATH}
> LD_LIBRARY_PATH=${MPI}/lib:${LD_LIBRARY_PATH}
>
> mpirun ./mpihello

This is not the sleep statement you mentioned above. The best way is  
so change only thing at a time and observe the results.

- Make h_vmem consumable again and submit a serial job with a plain  
"sleep 90" - no mpirun or alike, no requests. On the node where the  
job runs you should see in `qhost -F h_vmem` the correct value with  
the subtracted default.

- Submit the same job, this time with an extra request of 8GB. Is  
qhost showing the correct output again?

- Submit the same job as a parallel one without request: do you see  
on each of the granted nodes the correct subtraction?

- Submit the same job as a parallel one with the 8 GB request: do you  
see on each of the granted nodes the correct subtraction?

If all this runs well, we can start to use mpirun. Is the mpihello  
the one from the sunsource site? It's supposed to run endless, as  
it's main purpose is to see ther correct tight integration of the job  
into SGE. Often the parallel test josbs are too short to ssh to a  
node and check all the things. Therefore it's intended to be killed  
only by a qdel.

-- Reuti


>
> Any ideas?
>
> When I remove the '-l h_vmem=8G' line from the submit script, the job
> just seems to hang indefinitely in the run state. Any ideas?
>
> -- 
> Prentice
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=112954
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=113414

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list