[GE users] Re: Queue on Error state

Reuti reuti at staff.uni-marburg.de
Thu Jul 31 21:58:52 BST 2008


Am 31.07.2008 um 21:19 schrieb Fco. Javier Modrego:

> The installed version is Turbomole 5.10 but is just a suspect I  
> cannot swear it's guilty...may be the origin of the problem is  
> elsewhere. One sympthom is that all the queue instances where a  
> multi CPU parallel job has been running enter into E state for the  
> next incomig jobs because all the nodes' spooling directories are  
> empty and the spool files cannot be created...
> The cluster is running RHLE4 with HP Proliant nodes which have two  
> Xeon quad core CPUs each...

- What PE did you define for Turbomole (you will need  
"job_is_first_task FALSE" as Turbomole (except ricc2) always needs  
one task more than requested to collect the data and increases the  
slot count on its own)?

- As 5.10 uses HP-MPI, did you:

export MPI_REMSH=rsh
export PBS_NODEFILE=$TMPDIR/machines
export TURBOTMPDIR=$TMPDIR
export PARA_ARCH=MPI
export PARNODES=$NSLOTS

- Can you check with:

ps -e f -o pid,ppid,pgrp,command

(f w/o -) whether all tasks are kids of sge_shepherd during a  
parallel run?

- Maybe you need in the jobscipt also a:

kdg scratch

(During one run Turbomole will add it to the control file and reuse  
it in the next run. When you use the ususal $TMPDIR for it, it won't  
exist the next time, as the jobnumber is part of the name. Turbomole  
will only add a control group "scratch" on its own, but it will never  
alter an existing one, even if you specify a different TURBOTMPDIR  
before the next run.)

-- Reuti


> 	F.J. Modrego
>
>
>
>
>
>> users Digest 31 Jul 2008 12:53:04 -0000 Issue 1475
>>
>> Topics (messages 25445 through 25455):
>>
>> Queue on Error state
>> 	25445 by: Fco. Javier Modrego
>> 	25446 by: Reuti
>>
>>
>> --gbkhkblhfgefcgipgono
>> Mime-Version: 1.0 (Apple Message framework v753.1)
>> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>> Message-Id: <E285C950-F291-4C43-B53D-0B5F1944F97A at staff.uni- 
>> marburg.de>
>> Content-Transfer-Encoding: 7bit
>> From: Reuti <reuti at staff.uni-marburg.de>
>> Date: Tue, 29 Jul 2008 13:30:02 +0200
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Queue on Error state
>>
>> Hi,
>>
>> Am 29.07.2008 um 13:18 schrieb Fco. Javier Modrego:
>>
>>> Frequently I found my queues in error state and new jobs cannot  
>>> start. An example of the error
>
> .....
>
>>> Also clearing the error state does not reduce just to using qmod - 
>>> cq... as it doesn't work straight away. The daemons in the node  
>>> are running and must be killed and then the queue stopped and  
>>> started...
>>
>> which version of Turbomole?
>>
>> -- Reuti
>>
>>> 	Thanks in advance
>>> 	F.J. Modrego
>>>
>>> Note: the installed version of SGE is 6.1u4
>>>
>>
>
>
> -- 
>  Dr. F.J. Modrego
>  Department of Inorganic Chemistry
>  Facultad de Ciencias
>  University of Zaragoza
>  50009 ZARAGOZA
>  SPAIN
>  Tel <34>-976-762288
>  Fax <34>-976-761187
>  E-mail:  modrego at unizar.es
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list