[GE users] Unexpected behavior during simultaneous job submissions?

reuti reuti at staff.uni-marburg.de
Fri Nov 14 12:38:09 GMT 2008


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

Am 14.11.2008 um 02:03 schrieb Jonathan Pierce:

> One of our users has a script that submits 494 jobs, which he ran this

could your user make just one array job out of this job?

> morning; based on qstat, the first job was reported at 09:17:59 and
> the last at 09:18:53. While most are now happily executing, 19 of
> those jobs started out in a "zombie" state, so to speak.  'qstat -j
> [jobID]' returns information, and qstat -f shows the job in state
> 'r'.  However, 'qacct -j [jobID]', reports the error job id not found
> (and a quick inspection on the node the job is supposedly running
> confirms nothing is executing).

The accounting record is written, when the job finishs. As it's still  
(for SGE) in state "r", nothing was written so far. You can get rid  
of these job with "qdel -f <job_id>".

What are the messages files of the qmaster and the node saying  
$SGE_ROOT/default/spool/qmaster/messages?

Was the queue put into error state?

> Has anybody seen this behavior before?
>
> Taking a step back, we've been discovering a number of zombie jobs
> recently, most of which do not originate from this script.  Is this
> behavior indicative of some greater failure?

It could be a problem with a particular node. All jobs ending up  
there will get this state - like a block hole in the cluster.

-- Reuti


> Thank you very much,
> Jonathan
>
> ·····················
> Jonathan Pierce
> Systems Administrator
> Laboratory of Neuro Imaging, UCLA
> 635 Charles E. Young Drive South,
> Suite 225 Los Angeles, CA 90095-7334
> Tel: 310.267.5076
> Fax: 310.206.5518
> jonathan.pierce at loni.ucla.edu
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=88716
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88754

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list