[GE users] sge_qmaster doesn't respond or works very slowly?

Alex Chekholko chekh at pcbi.upenn.edu
Wed Jun 4 20:27:44 BST 2008


Hi,

It looks like we had a user submit ~75k jobs at once and sge_qmaster and sge_schedd crashed.  I'm trying to start them up again, and it looks like sge_qmaster starts up, but is unresponsive, so sge_schedd can't do anything useful.

I found an older mailing list post:
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=21014

So I did "export SGE_ND=''", and manually started sge_qmaster and sge_schedd.

sge_qmaster takes a very long time to start up, and a long time to read in the jobs:

# /gpfs/fs0/share/ge-6.1u3/bin/lx24-amd64/sge_qmaster 
local configuration beta.genomics.upenn.edu not defined - using global configuration
Reading in complex attributes.
Reading in execution hosts.
Reading in administrative hosts.
Reading in submit hosts.
Reading in host group entries:
        Host group entries for group "@allhosts".
Reading in usersets:
        Userset "defaultdepartment".
        Userset "deadlineusers".
Reading in queues:
        Queue "all.q".
Reading in parallel environments:
        PE "make".
        PE "DJ".
Reading in Master_Job_List.
.....................................................................................critical error: job file "jobs/00/0067/1167" has zero size
................................................................................................................................................................................................................................................................................................................critical error: job file "jobs/00/0029/9895" has zero size
critical error: job file "jobs/00/0029/9897" has zero size
critical error: job file "jobs/00/0029/9898" has zero size
critical error: job file "jobs/00/0029/9899" has zero size
critical error: job file "jobs/00/0029/9900" has zero size
critical error: job file "jobs/00/0029/9901" has zero size
critical error: job file "jobs/00/0029/9902" has zero size
critical error: job file "jobs/00/0029/9903" has zero size
critical error: job file "jobs/00/0029/9904" has zero size
.....................................................................................................................................................................................................................................................................................................................................................

read job database with 72966 entries in 2285 seconds
local configuration beta.genomics.upenn.edu not defined - using global configura
tion
Reading in users:
        User "smonni".
        User "caspian".
        User "shukai".
        User "mingyao".
        User "jichun".
        User "kai".
        User "weili1".
qmaster hard descriptor limit is set to 8192
qmaster soft descriptor limit is set to 8192
qmaster will use max. 8172 file descriptors for communication
qmaster will accept max. 99 dynamic event clients
starting up GE 6.1u3 (lx24-amd64)
error: commlib error: got read error (closing "beta.genomics.upenn.edu/qstat/1")
error: execd at node-r1-u5-c30-p10-o4.local reports running job (671167.1/master) i
n queue "all.q at node-r1-u5-c30-p10-o4.local" that was not supposed to be there - 
killing
error: event client "drmaa" (beta.genomics.upenn.edu/drmaa/941) reregistered - i
t will need a total update
error: event client "drmaa" (beta.genomics.upenn.edu/drmaa/1009) reregistered - 
it will need a total update
error: acknowledge timeout after 600 seconds for event client (drmaa:941) on hos
t "beta.genomics.upenn.edu"
error: acknowledge timeout after 600 seconds for event client (drmaa:1009) on ho
st "beta.genomics.upenn.edu"
error: acknowledge timeout after 600 seconds for event client (schedd:1) on host
 "beta.genomics.upenn.edu"
failed to deliver job 671175.1 to queue "all.q at node-r1-u28-c9-p11-o20.local"
failed to deliver job 671176.1 to queue "all.q at node-r1-u31-c6-p10-o22.local"
[snip]
failed to deliver job 671217.1 to queue "all.q at node-r1-u19-c16-p10-o14.local"
failed to deliver job 671218.1 to queue "all.q at node-r1-u10-c25-p11-o6.local"
failed to deliver job 671219.1 to queue "all.q at node-r1-u4-c31-p11-o3.local"
[snip]
failed to deliver job 671233.1 to queue "all.q at node-r1-u4-c31-p11-o3.local"
failed to deliver job 671234.1 to queue "all.q at node-r1-u10-c25-p11-o6.local"
error: acknowledge timeout after 600 seconds for event client (drmaa:1009) on ho
st "beta.genomics.upenn.edu"
error: acknowledge timeout after 600 seconds for event client (drmaa:941) on hos
t "beta.genomics.upenn.edu"
error: acknowledge timeout after 600 seconds for event client (schedd:1) on host
 "beta.genomics.upenn.edu"
error: no event client known with id 1 to modify
failed to deliver job 671251.1 to queue "all.q at node-r1-u21-c14-p10-o23.local"
failed to deliver job 671255.1 to queue "all.q at node-r1-u5-c30-p10-o4.local"
...



Meantime, sge_schedd:
# /gpfs/fs0/share/ge-6.1u3/bin/lx24-amd64/sge_schedd 
local configuration beta.genomics.upenn.edu not defined - using global configuration
starting up GE 6.1u3 (lx24-amd64)
Q:30, AQ:60 J:72939(72939), H:63(64), C:47, A:1, D:1, P:2, CKPT:0, US:7, PR:0, RQS:0, S:nd:0/lf:0 
failed receiving gdi request
--------------STOP-SCHEDULER-RUN-------------
Q:30, AQ:60 J:72939(72939), H:63(64), C:47, A:1, D:1, P:2, CKPT:0, US:7, PR:0, RQS:0, S:nd:0/lf:0 
failed receiving gdi request
NULL ptr passed to sge_gdi_extract_answer()
--------------STOP-SCHEDULER-RUN-------------
qmaster alive timeout expired
error: commlib error: got read error (closing "beta.genomics.upenn.edu/qmaster/1")
error: failed receiving gdi request
error: failed receiving gdi request
Q:30, AQ:60 J:72863(72863), H:63(64), C:47, A:1, D:1, P:2, CKPT:0, US:7, PR:0, RQS:0, S:nd:0/lf:0 
failed receiving gdi request
--------------STOP-SCHEDULER-RUN-------------

Meantime, user commands don't work:
[chekh at beta ~]$ qhost
error: failed receiving gdi request

Any suggestions?  I've been waiting around for a couple of hours for this thing to work its way through the jobs, but at this rate it will be days or weeks for it to get through the 72k jobs.  Can I stop sge_qmaster and just delete the directories under jobs/00/... ?

Regards,
-- 
Alex Chekholko 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list