[GE users] Yet anaother MPICH tight-integration problem

Andreas Haas Andreas.Haas at Sun.COM
Wed Sep 8 13:31:00 BST 2004


This indicates the task is hung in function handle_task()
in daemons/execd/execd_job_exec.c. This means the job is at
least known by execd. Possibly all slots are already is
consumed by tasks or we got an error in execd?

To further diagnose into that problem you should run an
execd in monitoring mode. This will give us further diagnosis
information that hopefully gives us new insights. Since you got
so many execd's and the error always happens at master node you
should pick out one machine to become the master node. Then
shutdown that execd using

      qconf -ke <host>

and restart it as described under "Running daemons in
monitoring mode" in

   http://gridengine.sunsource.net/unbranded-source/browse/%7Echeckout%7E/gridengine/source/libs/rmon/rmon.html?content-type=text/html

when you watch the code in handle_task() you'll see that for each
task a monitoring messages such as

   "got task ( ... "

must appear in your monitoring output. I recommend _not_ to use
monitoring level higher than "dl 1". Otherwise you won't actually
see anything due to too many output ...

Please note that for specfifying a particular machine to become
the master node you can use the

   -masterq "qname@<host>"

option with your submission for the test job.

Regards,
Andreas


On Tue, 7 Sep 2004, David S. wrote:

>
> > Your allocation rule should be fine. But try switching
> > job_is_first_task from TRUE to FALSE. It this does not
> > resolve the problem you need to look into execds messages
> > file. There you should find diagnosis information that should
> > help you.
>
> Changing the value of 'job_is_first_task' makes no difference.
> In either case, the grid engine appears to start the master
> process and one slave process on a node, walks through
> '$TMPDIR/machines' starting slaves on the nodes listed there,
> then tries and fails to start a second slave on the node
> running the master.  At that point the job aborts.  All that's
> in the 'messages' file in the spool directory of the master's
> node is a message like
>
> 	09/07/2004 20:48:25|execd|eee006|E|no free queue for job 75 of user dgs at eee008.grid.gs.washington.edu (localhost = eee006.grid.gs.washington.edu)
>
> David S.
>
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list