[GE users] high CPU load for sge_qmaster

Reuti reuti at staff.uni-marburg.de
Tue May 10 15:36:14 BST 2005


Hi Karen,

do you have a local spool directory on the nodes, or is it central on a 
NFS server? Only MPI jobs are affected? Had you modified the startmpi.sh 
before the upgrade?

CU - Reuti


Karen Brazier wrote:
> Hi,
> 
> I've recently upgraded from 6.0u1 to 6.0u3, which means that the memory
> leak in schedd is cured but now some MPI jobs are failing to start and
> others produce an error message on exit.
> 
> A summary of symptoms for the jobs that fail is:
> 
> . startmpi.sh can't read (or can't find) the pe_hostfile
> . jobs fail with error "21 : in recognising job"
> . qacct has very little info: it gives qsub time as 1 Jan 01:00:00 1970
>   and has 'UNKNOWN' for hostname, group and owner
> . SGE messages file reports, e.g:
> 05/09/2005 09:18:57|qmaster|hamilton|W|job 18560.1 failed on host
> node087.beowulf.cluster in recognising job because: execd doesn't know
> this job
> 05/09/2005 09:19:03|qmaster|hamilton|E|execd node087.beowulf.cluster
> reports running state for job (18560.1/master) in queue
> "ether.q at node087.beowulf.cluster" while job is in state 65536
> 05/09/2005 09:19:43|qmaster|hamilton|E|execd at node087.beowulf.cluster
> reports running job (18560.1/master) in queue
> "ether.q at node087.beowulf.cluster" that was not supposed to be there -
> killing
> 
> Other MPI jobs complete, but produce error "100 : assumedly after job" and
> the messages file states:  05/09/2005 15:28:01|qmaster|hamilton|E|tightly
> integrated parallel task 18578.1 task 1.node025 failed - killing job
> 
> 
> Can anyone point me towards the problem?
> 
> Many thanks,
> Karen
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list