[GE users] Error jobs hanging SGE

Bryan Bayerdorffer bryan.bayerdorffer at analog.com
Thu Apr 8 22:51:32 BST 2004


So yesterday one of our users submitted several thousand jobs that all went 
into error state because of write permissions on the -o directory. 
Eventually, no more jobs were dispatched to exec hosts.  Submissions were 
still accepted with normal speed and the communication between qmaster and 
exec hosts was ok.  The CPU usage by qmaster and schedd looked normal given 
that there were thousands of pending jobs.

The qmaster log was getting

Wed Apr  7 12:57:29 2004|qmaster|hai7|W|could not decrease "max_u_jobs" job 
counter

about 50 times/sec.  I rebooted the qmaster host, which didn't change 
anything.  Finally I deleted all the error and pending jobs of the user; once 
this had been done jobs started running normally and the max_u_jobs message 
stopped going into the log.

When the qdel began, the log started getting

Wed Apr  7 12:57:30 2004|qmaster|hai7|E|job "717322" does not exist

for the jobs being deleted, but I suppose that's normal(?)


Is there a way for jobs in error state to be deleted automatically?

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list