[GE users] Unused add_grp_id error? Cannot get connection error?

Rayson Ho rayrayson at gmail.com
Fri Dec 14 15:24:08 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

On Dec 14, 2007 9:57 AM, Heywood, Todd <heywood at cshl.edu> wrote:
> error: cannot get connection to "shepherd" at host "blade14"
>
> When I go to look at the /var/spool/sge/blade14/messages file (for example),
> there are 4 of these messages:
>
> can't start job "4353897": can not find an unused add_grp_id

What is the "gid_range" of the host??

Also see the sge_conf(5) manpage.

Rayson



>
> OK. But my sge_conf has gid_range set to 20000-20100, and only 4 jobs are
> allowed to run per blade/node (hence 4 messages in the /var/spool/...
> messages file).
>
> So I look in the qmaster .../spool/qmaster/messages file, and see this:
>
> 12/13/2007 23:02:15|qmaster|bhmnode2|E|tightly integrated parallel task
> 4353897.1 task
> 2623.blade7 failed - killing job
>
> This time stamp is a couple of minutes after the other errors (same job ID).
>
> So I go to blade7, and the messages file there says:
>
> 12/13/2007 23:00:36|execd|blade7|E|slave shepherd of job 4353897.1 exited
> with exit status = 11
> 12/13/2007 23:00:36|execd|blade7|E|slave shepherd of job 4353897.1 exited
> with exit status = 11
>
> SO, I'm totally confused. Any idea what is going on?
>
> Thanks,
>
> Todd Heywood
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list