[GE users] Error jobs hanging SGE

Andy Schwierskott andy.schwierskott at sun.com
Thu Apr 15 11:08:40 BST 2004


Bryan,

> So yesterday one of our users submitted several thousand jobs that all went
> into error state because of write permissions on the -o directory.
> Eventually, no more jobs were dispatched to exec hosts.  Submissions were
> still accepted with normal speed and the communication between qmaster and
> exec hosts was ok.  The CPU usage by qmaster and schedd looked normal given
> that there were thousands of pending jobs.
>
> The qmaster log was getting
>
> Wed Apr  7 12:57:29 2004|qmaster|hai7|W|could not decrease "max_u_jobs" job
> counter
>
> about 50 times/sec.  I rebooted the qmaster host, which didn't change
> anything.  Finally I deleted all the error and pending jobs of the user; once
> this had been done jobs started running normally and the max_u_jobs message
> stopped going into the log.
>
> When the qdel began, the log started getting
>
> Wed Apr  7 12:57:30 2004|qmaster|hai7|E|job "717322" does not exist
>
> for the jobs being deleted, but I suppose that's normal(?)

which version are you using?

> Is there a way for jobs in error state to be deleted automatically?

No.

Andy


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list