[GE users] high CPU load for sge_qmaster

Karen Brazier karen.brazier at durham.ac.uk
Tue May 10 14:13:33 BST 2005


I've recently upgraded from 6.0u1 to 6.0u3, which means that the memory
leak in schedd is cured but now some MPI jobs are failing to start and
others produce an error message on exit.

A summary of symptoms for the jobs that fail is:

. startmpi.sh can't read (or can't find) the pe_hostfile
. jobs fail with error "21 : in recognising job"
. qacct has very little info: it gives qsub time as 1 Jan 01:00:00 1970
  and has 'UNKNOWN' for hostname, group and owner
. SGE messages file reports, e.g:
05/09/2005 09:18:57|qmaster|hamilton|W|job 18560.1 failed on host
node087.beowulf.cluster in recognising job because: execd doesn't know
this job
05/09/2005 09:19:03|qmaster|hamilton|E|execd node087.beowulf.cluster
reports running state for job (18560.1/master) in queue
"ether.q at node087.beowulf.cluster" while job is in state 65536
05/09/2005 09:19:43|qmaster|hamilton|E|execd at node087.beowulf.cluster
reports running job (18560.1/master) in queue
"ether.q at node087.beowulf.cluster" that was not supposed to be there -

Other MPI jobs complete, but produce error "100 : assumedly after job" and
the messages file states:  05/09/2005 15:28:01|qmaster|hamilton|E|tightly
integrated parallel task 18578.1 task 1.node025 failed - killing job

Can anyone point me towards the problem?

Many thanks,

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list