[GE users] schedd hangs with infinite loop :-((

Andy Schwierskott andy.schwierskott at sun.com
Thu Apr 1 13:01:33 BST 2004


Christian,

We need to find out why your got this problem (I undersntad you didn't have
it before). Three questions

   - did you recently upgrade your glibc version
   - or did you move the master machine to this new machine
   - or did you begin to use functional tickets

Please send your scheduler config (qconf -ssconf) as well.

I'm asking because we've seen a similar problem on AMD64 wit hthe same glibc
version in the functional ticket calculation. We've changed the code however
we think that the code was correct and there could be a bug in this glibc
version.

Please send us your feedback as soon as possible since we need to find out
if this problems requires a fix in SGE 5.3p6 which we are currently
preparing.

Andy

> Hello!
>
> I've got a serious problem with our SGEEE 5.3p5 installation since
> yesterday: for no apparent reason the scheduler just goes into an
> infinite loop and does no more scheduling.
>
> I would really appreciate quick help on this, since our cluster isn't
> working anymore.
>
> The scheduler runs on Linux x86, Kernel 2.4.22 with glibc 2.3.2 on
> Debian testing. We've just a moderate number of jobs waiting in the
> queue. To circle the problem. I've set loglevel to 2 with dl.sh.
>
> This is the backtrace:
>
> # gdb /usr/local/sge53/bin/glinux/sge_schedd core.16309
> GNU gdb 5.3-debian
> This GDB was configured as "i386-linux"...
> Core was generated by `sge_schedd'.
> Program terminated with signal 3, Quit.
> Reading symbols from /lib/libm.so.6...done.
> Loaded symbols for /lib/libm.so.6
> Reading symbols from /lib/libc.so.6...done.
> Loaded symbols for /lib/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...done.
> Loaded symbols for /lib/ld-linux.so.2
> Reading symbols from /lib/libnss_files.so.2...done.
> Loaded symbols for /lib/libnss_files.so.2
> #0  0x400b73f6 in mallopt () from /lib/libc.so.6
> (gdb) bt
> #0  0x400b73f6 in mallopt () from /lib/libc.so.6
> #1  0x400b727e in mallopt () from /lib/libc.so.6
> #2  0x400b5faf in free () from /lib/libc.so.6
> #3  0x080a3520 in lFreeElem ()
> #4  0x080a3a11 in lRemoveElem ()
> #5  0x080a356d in lFreeList ()
> #6  0x0807bab9 in free_fcategories ()
> #7  0x0807f135 in sge_calc_tickets ()
> #8  0x08080f59 in sge_scheduler ()
> #9  0x0804e7a3 in dispatch_jobs ()
> #10 0x0804df06 in scheduler ()
> #11 0x0805168a in event_handler_default_scheduler ()
> #12 0x0804a86a in main ()
> #13 0x4005adc6 in __libc_start_main () from /lib/libc.so.6
> (gdb)
>
> I've attached the schedd debug log.
>
> Regards
>   Christian

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list