[GE users] How to limit the minimum slots for parallel jobs?

wzlu wzlu at gate.sinica.edu.tw
Thu Nov 1 01:54:22 GMT 2007


    [ The following text is in the "Big5" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Reuti ??:

>
> this is the way to go and in principle it should work.
>
> -) the prolog script is executable?
>
> -) the exec node is also a submit node, in case you want to issue a
> qdel there?
>
> -- Reuti
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
Thanks for Reuti.
But the error message is the same.
Have something need to change to fix the problem? Thanks again.

wzlu

1) The prolog script is executable. And the script seems not execute.
$ ls -l /prj/tmp/sge
total 4
-rwxr-xr-x 1 sgeadmin sgeadmin 340 Oct 31 14:41 prolog.p0

$ cat /prj/tmp/sge/prolog.p0
##!/bin/sh

touch /prj/tmp/sge/xxx
echo "\$USER: $USER" >> /prj/tmp/sge/prolog_p0.log
echo "\$JOB_ID: $JOB_ID" >> /prj/tmp/sge/prolog_p0.log
echo "\$HOSTNAME: $HOSTNAME" >> /prj/tmp/sge/prolog_p0.log
echo "\$NSLOTS: $NSLOTS" >> /prj/tmp/sge/prolog_p0.log
echo "----------------------------------------------" >>
/prj/tmp/sge/prolog_p0.log

2) I added all hosts to submit host
$ qconf -ss
test00101
test00102
test00103
test00105
test00106
test00107
test00108
test00109
test00110
test00111
test00112
test00113
test00114
test00115
test00116
test00117
test00118
test00119
test00120
test00121
test00122

3) The error message is the same
11/01/2007 09:48:57|qmaster|test00101|W|job 632.1 failed on host
test00114 general in prolog because: 11/01/2007 09:48:56 [302:21433]:
exit_status of prolog = 1
11/01/2007 09:48:57|qmaster|test00101|W|rescheduling job 632.1
11/01/2007 09:48:57|qmaster|test00101|E|queue p0-x86-ge marked QERROR as
result of job 632's failure at host test00114

4) Some hosts were error, but others were normal

5) The job state in queue was qw -> r -> qw -> r -> qw .... -> qw
$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
632 0.55500 para2.sh wzlu qw 11/01/2007 09:07:24 4
$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
632 0.55500 para2.sh wzlu r 11/01/2007 09:07:34 p0-x86-ge at test00120 4
$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
632 0.55500 para2.sh wzlu qw 11/01/2007 09:07:24 4
$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
632 0.55500 para2.sh wzlu r 11/01/2007 09:07:54 p0-x86-ge at test00105 4
$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
632 0.55500 para2.sh wzlu qw 11/01/2007 09:07:24 4
$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
632 0.55500 para2.sh wzlu qw 11/01/2007 09:07:24 4



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list