[GE users] Queue on Error state

Fco. Javier Modrego modrego at unizar.es
Tue Jul 29 12:18:15 BST 2008

Frequently I found my queues in error state and new jobs cannot 
start. An example of the error messages are below but the problem 
seems to be (I think...)  that a preceeding job has erased all the 
content (files and subdirectories) of the local spooling directory at 
the computing nodes (/tmp/sge in my installation) and the spooling 
files for  new jobs cannot be created. As far as the problem I think 
that I understand what is happening but I have no clue how to solve 
My main suspects are parallel Turbomole jobs but I cannot find 
nothing in their scripts which can justify this behaviour. I would be 
grateful if anybody with experience integrating Turbomole in SGE 
could give my a hand...May be a symbol in conflict with SGE and an 
assasin "rm"... I have no clue
Also clearing the error state does not reduce just to using qmod 
-cq... as it doesn't work straight away. The daemons in the node are 
running and must be killed and then the queue stopped and started...

	Thanks in advance
	F.J. Modrego

Note: the installed version of SGE is 6.1u4

07/29/2008 05:59:29|qmaster|ml350|W|job 1304.1 failed on host 
nodo01.localdomain general assumedly before job because: can't create 
directory active_
jobs/1304.1: No such file or directory
07/29/2008 05:59:29|qmaster|ml350|W|rescheduling job 1304.1
07/29/2008 05:59:29|qmaster|ml350|E|queue larga marked QERROR as 
result of job 1304's failure at host nodo01.localdomain
07/29/2008 05:59:29|qmaster|ml350|W|queue "larga at nodo01.localdomain" 
is marked QERROR

  Dr. F.J. Modrego
  Department of Inorganic Chemistry
  Facultad de Ciencias
  University of Zaragoza
  50009 ZARAGOZA
  Tel <34>-976-762288
  Fax <34>-976-761187
  E-mail:  modrego at unizar.es

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list