[GE users] sge_shepherd : free(): invalid pointer crash for more than 1032 slots

henk h.a.slim at durham.ac.uk
Thu May 27 14:45:17 BST 2010


Reuti

Discussion

http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=255555

answers the 1032 slots limit.

> did you run into an openMPI/sun HPC Cluster Tools related limit:
>
> % ompi_info -all | grep plm_rsh_num_concurrent
>  MCA plm: parameter "plm_rsh_num_concurrent" (current value: "128",
data source: default value)

Apparently 128+1 hosts also works but a higher number fails?

However the crash in the shepherd sometimes also fails for less than 129
hosts so I think this is a different issue.

Thanks

Henk

> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: 14 May 2010 10:30
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] sge_shepherd : free(): invalid pointer crash
> for more than 1032 slots
> 
> Hi,
> 
> Am 13.05.2010 um 16:15 schrieb henk:
> 
> > On our system the gridengine 6.2u5 shepherd crashes for a simple
> > parallel job with 1040 slots. It is fine for 1032 slots (each server
> has
> > 8 cores and I increment the job size by adding a server). I attach
> the
> > error file with a memory map. MPI is OpenMPI 1.4.1 and OS is SLES
> 11.0
> > AS a test I kept the 1032 slots on fixed servers and varied the
> server
> > that supplied the additional 8 slots, all giving this problem.
> > Is there some magcical number beyond 1032 that causes a problem for
> the
> > shepherd exe?
> 
> I don't know for sure, but 1032 sounds like 8 on the master host of
the
> parallel job, plus 1024 slaves - and 1024 is a usual taken value for a
> limit somewhere. But it should generate an error then and not crash.
> 
> -- Reuti
> 
> 
> > Thanks
> >
> > Henk
> >
> > ------------------------------------------------------
> >
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> eId=257181
> >
> > To unsubscribe from this discussion, e-mail: [users-
>
unsubscribe at gridengine.sunsource.net].<shepherd_crash_1040slots.e1284.t
> xt>
> 
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> eId=257258
> 
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259015

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list