[GE users] Re: new error messages on large # job submissions

Reuti reuti at staff.uni-marburg.de
Thu Aug 5 19:18:49 BST 2004


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi again,

>So I submitted about 10,000 jobs that do a simple iteration with find and
>dd that seemed to be going well and completing, besides the fact they
>would heavily slow down the nfs server. I've now checked on these a few
>days later and I see all the jobs are now in state "Eqw" and the following
>message is included scrolled in the qmaster messages file:

do you have the spool directories on the the master for all the nodes or on 
each node in something like /var/spool/sge? There is a HowTo at Sunsource to 
minimize the NFS traffic of SGE and have the spool directories local on the 
nodes.


>3.q" that was not supposed to be there - killing
>Wed Aug  4 04:04:20 2004|qmaster|linga|E|execd at compute-9-3.local reports
>running job (21361.1/master) in queue "compute-9-3.q" that was not
>supposed to be there - killing

Yes, I also see this from time to time. Shut down the execd on the node, and 
remove all the remaining stuff in the /var/spool/sge/<nodename>/active_jobs, 
job_scripts and jobs (if it's a central spool directory 
$SGE_ROOT/default/spool/... of course). Then restart the execd and it shouldn't 
appear any longer.


Cheers - Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list