[GE users] Interactive jobs not starting

VS Ang vs_ang at yahoo.com
Thu Nov 29 23:12:06 GMT 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

When I attempt to submit interactive jobs using qrsh or qlogin commands, the job never starts. The "qrsh" command simply returns after a while:

$ qrsh -verbose
Your job 45 ("QRLOGIN") has been submitted
waiting for interactive job to be scheduled ...timeout (3 s) expired while waiting on socket fd 4

Could not start interactive job.

Same thing happens with qlogin:

$ qlogin -verbose
Your job 46 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...timeout (4 s) expired while waiting on socket fd 4

Could not start interactive job.

Also, in the messages of the compute nodes, I see the following errors.

11/29/2007 16:01:26|execd|compute-1-5|E|shepherd of job 26.1 exited with exit status = 9
11/29/2007 16:01:26|execd|compute-1-5|W|reaping job "26" ptf complains: Job does not exist
11/29/2007 16:07:23|execd|compute-1-5|E|shepherd of job 27.1 exited with exit status = 9
11/29/2007 16:07:23|execd|compute-1-5|W|reaping job "27" ptf complains: Job does not exist
11/29/2007 16:44:37|execd|compute-1-5|W|reaping job "33" ptf complains: Job does not exist
11/29/2007 17:57:05|execd|compute-1-5|E|shepherd of job 43.1 exited with exit status = 11
11/29/2007 17:57:05|execd|compute-1-5|W|reaping job "43" ptf complains: Job does not exist
11/29/2007 18:07:34|execd|compute-1-5|W|reaping job "46" ptf complains: Job does not exist


And, on the qmaster host:

11/29/2007 17:57:06|qmaster|admin|W|job 43.1 failed on host compute-1-5.local general before job because: 11/29/2007 17:57:05 [0:18592]: can't open file /tmp/43.1.all.q/pid: No such file or directory
11/29/2007 17:57:06|qmaster|admin|W|rescheduling job 43.1
11/29/2007 18:06:27|qmaster|admin|W|job 44.1 failed on host compute-1-4.local assumedly after job because: job 44.1 died through signal KILL (9)
11/29/2007 18:07:03|qmaster|admin|W|job 45.1 failed on host compute-1-1.local assumedly after job because: job 45.1 died through signal KILL (9)
11/29/2007 18:07:35|qmaster|admin|W|job 46.1 failed on host compute-1-5.local assumedly after job because: job 46.1 died through signal KILL (9)

I am using SGE 6.1u2 (compiled out of sources). Any pointers will be appreciated!

Srihari



More information about the gridengine-users mailing list