AW: AW: [GE users] non-advancing jobs in gridengine

carsten carsten.ochtrup at eds.com
Thu Aug 27 09:49:51 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Joe,

maybe you have "only" a problem with a single node you hit all the time you submit the job via SGE. Try to force your SGE job to use exactly the same nodes you use in your manual calls. Submit the job with a list of nodes and make this a hard option for the SGE. (or was this meant by "forced node by hand" ?)
 
Have you checked your Infiniband switch and made a fabric cleaning? Is a submitted Pallas Benchmark running fine? 

If the job runs fine with a host file, try to start it as if your OpenMPI has no SGE support implemented and start it via the PE start procedures you do with other MPI versions.

Should not be a problem, but have you checked that all MPI ranks run where they should? Years ago I had a problem with another MPI implementation, that it did not start the correct number of jobs on the nodes as given by the host file.

Don't know if it makes any difference, increase the value for "pending signals" and "max user process" to 268288  and add the "ulimit -a" to your ./run_script_SGE.bash script, to be sure. Qrsh might give different results (I think there was something mentioned in the link). 
 
Carsten

-----Ursprüngliche Nachricht-----
Von: joelandman [mailto:landman at scalableinformatics.com] 
Gesendet: Montag, 24. August 2009 22:48
An: users at gridengine.sunsource.net
Betreff: Re: AW: [GE users] non-advancing jobs in gridengine

joelandman wrote:

> It looks like
> 
> 	ulimit -s unlimited
> 
> in the very top of the SGE execd script helped here.
> 

I spoke too soon.  Looks like it ran once, but not the way I wanted. 
Restarted it correctly, and we get the same problem.  I can confirm

landman at scalable:~> qrsh ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 71680
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 71680
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

so we aren't running out of limits.

If I let SGE select the hosts, and don't use a machinefile, the job 
fails to advance.  If I force those by hand, the job works.

job gets submitted with

	qsub -pe openmpi 128 -cwd ./run_script_SGE.bash

and

landman at scalable:~> qconf -sp openmpi
pe_name            openmpi
slots              128
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE




-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214040

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214525

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list