[GE users] Unused add_grp_id error? Cannot get connection error?

Heywood, Todd heywood at cshl.edu
Fri Dec 14 14:57:57 GMT 2007


Hi,

I am getting job errors, where the job stderr says (for example):

error: cannot get connection to "shepherd" at host "blade14"

When I go to look at the /var/spool/sge/blade14/messages file (for example),
there are 4 of these messages:

can't start job "4353897": can not find an unused add_grp_id

OK. But my sge_conf has gid_range set to 20000-20100, and only 4 jobs are
allowed to run per blade/node (hence 4 messages in the /var/spool/...
messages file).

So I look in the qmaster .../spool/qmaster/messages file, and see this:

12/13/2007 23:02:15|qmaster|bhmnode2|E|tightly integrated parallel task
4353897.1 task 
2623.blade7 failed - killing job

This time stamp is a couple of minutes after the other errors (same job ID).

So I go to blade7, and the messages file there says:

12/13/2007 23:00:36|execd|blade7|E|slave shepherd of job 4353897.1 exited
with exit status = 11
12/13/2007 23:00:36|execd|blade7|E|slave shepherd of job 4353897.1 exited
with exit status = 11

SO, I'm totally confused. Any idea what is going on?

Thanks,

Todd Heywood



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list