[GE users] Error jobs hanging SGE

Bryan Bayerdorffer bryan.bayerdorffer at analog.com
Thu Apr 15 14:08:10 BST 2004


Andy Schwierskott wrote:
> Bryan,
> 
> 
>>So yesterday one of our users submitted several thousand jobs that all went
>>into error state because of write permissions on the -o directory.
>>Eventually, no more jobs were dispatched to exec hosts.  Submissions were
>>still accepted with normal speed and the communication between qmaster and
>>exec hosts was ok.  The CPU usage by qmaster and schedd looked normal given
>>that there were thousands of pending jobs.
>>
>>The qmaster log was getting
>>
>>Wed Apr  7 12:57:29 2004|qmaster|hai7|W|could not decrease "max_u_jobs" job
>>counter
>>
>>about 50 times/sec.  I rebooted the qmaster host, which didn't change
>>anything.  Finally I deleted all the error and pending jobs of the user; once
>>this had been done jobs started running normally and the max_u_jobs message
>>stopped going into the log.
>>
>>When the qdel began, the log started getting
>>
>>Wed Apr  7 12:57:30 2004|qmaster|hai7|E|job "717322" does not exist
>>
>>for the jobs being deleted, but I suppose that's normal(?)
> 
> 
> which version are you using?

5.3p4

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list