[GE users] nodes overloaded: processes placed on already full nodes

steve_s elcortogm at googlemail.com
Tue Dec 21 18:21:48 GMT 2010

On Dec 21 18:22 +0100, reuti wrote:
> >> What does:
> >> 
> >> ps -e f
> >> 
> >> (f w/o -) show on such a node? Are all the processes bound to an
> >> sge_shepherd, or did some jump out of the processes tree and weren't
> >> killed?
> > 
> > There are no sge_shepherd's on the nodes. I did not set up SGE on the
> > machine but what I understand from the documentation is that
> > sge_shepherd is only used in the case of "tight integration" of PEs.
> > In our case, the PE starts the MPI processes.
> Well, even with a loose integration, you have to honor the lost of
> granted machines for your job. What do you mean in detail by "the PE
> starts the MPI processes"? You will need at least a sgeexecd on the
> nodes, so that SGE is aware of its existence and can make a suitable
> slot allocation for your job. (The sgeexecd will then start the
> shepherd in case of a tight integration.)

Yes, sge_execd is present on each node, as well as sge_shepherd-$JOB_ID
on the master node, where the job-script is executed:

 4693 ?        Sl    33:32 /cm/shared/apps/sge/current/bin/lx26-amd64/sge_execd
12165 ?        S      0:00  \_ sge_shepherd-60013 -bg
12389 ?        S      0:00                  \_ python /cm/shared/apps/intel/impi/ ....

Apparently, we have tight integration then. I did look for sge_shepherd
on the wrong node (not the master node). This is the first time I take a
closer look at these daemons, that's why a little confusion here (we got
the machine pre-configured and all, getting familiar with the system
always takes a factor of pi longer than expected). Sorry for the noise.

Now that we know what to look for, we can search for jobs which do not



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list