[GE users] some MPI jobs disappear

Reuti reuti at staff.uni-marburg.de
Tue May 10 16:37:14 BST 2005


Hello again,

as the spool directories are local to the nodes: is there enough space 
left on the disk? Or other asked: are always the same nodes hit, when a 
MPI job fails? If it's the case, like node087, can you please check 
(while there is no job running), that active_jobs, jobs and job_scripts 
dirs are empty?

Another idea: maybe "loglevel" set to "log_info" in the config will give 
you more information. Is there any further hint in the messages file in 
the spool dir of the node?

Cheers - Reuti


Karen Brazier wrote:
> Hi Reuti,
> 
> Yes, the spool directories are local to the nodes.  The active_jobs
> directory is owned by user and group 'codine' (sorry!) and has 755
> permissions.  I've had no reports that non-MPI jobs are affected, and
> even MPI jobs are only affected sometimes.  The startmpi.sh script from
> before is identical to the one I have now.  The only non-standard bits
> I found during the upgrade were in the start-up script in /etc/init.d, to
> handle failover to a standby master.
> 
> Many thanks,
> Karen
> 
> On Tue, 10 May 2005, Reuti wrote:
> 
> 
>>Hi Karen,
>>
>>do you have a local spool directory on the nodes, or is it central on a
>>NFS server? Only MPI jobs are affected? Had you modified the startmpi.sh
>>before the upgrade?
>>
>>CU - Reuti
>>
>>
>>Karen Brazier wrote:
>>
>>>Hi,
>>>
>>>I've recently upgraded from 6.0u1 to 6.0u3, which means that the memory
>>>leak in schedd is cured but now some MPI jobs are failing to start and
>>>others produce an error message on exit.
>>>
>>>A summary of symptoms for the jobs that fail is:
>>>
>>>. startmpi.sh can't read (or can't find) the pe_hostfile
>>>. jobs fail with error "21 : in recognising job"
>>>. qacct has very little info: it gives qsub time as 1 Jan 01:00:00 1970
>>>  and has 'UNKNOWN' for hostname, group and owner
>>>. SGE messages file reports, e.g:
>>>05/09/2005 09:18:57|qmaster|hamilton|W|job 18560.1 failed on host
>>>node087.beowulf.cluster in recognising job because: execd doesn't know
>>>this job
>>>05/09/2005 09:19:03|qmaster|hamilton|E|execd node087.beowulf.cluster
>>>reports running state for job (18560.1/master) in queue
>>>"ether.q at node087.beowulf.cluster" while job is in state 65536
>>>05/09/2005 09:19:43|qmaster|hamilton|E|execd at node087.beowulf.cluster
>>>reports running job (18560.1/master) in queue
>>>"ether.q at node087.beowulf.cluster" that was not supposed to be there -
>>>killing
>>>
>>>Other MPI jobs complete, but produce error "100 : assumedly after job" and
>>>the messages file states:  05/09/2005 15:28:01|qmaster|hamilton|E|tightly
>>>integrated parallel task 18578.1 task 1.node025 failed - killing job
>>>
>>>
>>>Can anyone point me towards the problem?
>>>
>>>Many thanks,
>>>Karen
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list