[GE users] one node keeps going into error state

David Mathog mathog at mendel.bio.caltech.edu
Tue Nov 23 23:26:05 GMT 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



> >I put some debug lines into shepherd.c and rebuilt sge_shepherd
> >on Solaris.  Traced the problem to here:
> >
> >   if (getcwd(shepherd_job_dir, 2047) == NULL) {
> >
> >which was coming back NULL. This is really odd.
> 
> Can you put add perror(3) right after the call to getcwd??
> 
> Something like:
> 
> if (getcwd(shepherd_job_dir, 2047) == NULL) {
>  perror("getcwd : ");
> }

perror went off into the ozone.  I used this instead:

  (void) fprintf(flog,"DEBUG 2.5 strerror %s\n",strerror(errno));

and it showed up in the log file this way:

DEBUG 2.5 strerror Permission denied
DEBUG 2.52 PWD >/usr/SGE/gridengine_v53p6/source<
DEBUG 2.53 LOGNAME >root<
DEBUG 2.54 PATH
>/usr/SGE/bin/solaris64:/usr/ccs/bin:/opt/SUNWspro/bin:/usr/SGE/bin/solaris64:/usr/sbin:/usr/bin:/usr/local/bin:/opt/SUNWppro/bin:/usr/sadm/bin<
DEBUG 2.55 SHELL >/sbin/sh<

Turns out PWD is the directory where root was when it started
SGE on solaris, ie,

% cd /tmp
% /etc/init.d/rcsge start

then shows PWD as /tmp

Why whould running getcwd while in /tmp as root possibly fail?
Maybe it's really writing over something in memory and it just happens
to blow up here?

Note, I just tried the prepackaged sge_shepherd binary from
the Sun site and it also locks up (of course, without logging
anything useful.)

Tried using getenv("PWD") to replace getcwd and it went on a bit
further and then stopped here:

      shepherd_error("can't write to \"trace\" file");

So I commented THAT out and it went quite a bit further on.  This
time it ran showinfo.sh, created the .e and .o output files (with
the correct output in the latter, the former empty).  Then it hung
up somehwere or other after the last line of main() which was:

   return return_code;

/usr/SGE/default/spool/mendel/messages ended with:

Tue Nov 23 15:19:15 2004|execd|mendel|I|checking for old jobs
Tue Nov 23 15:19:15 2004|execd|mendel|I|no old jobs at startup
Tue Nov 23 15:19:34 2004|execd|mendel|E|"abnormal termination of
shepherd for job 4384.1: no "exit_status" file"
Tue Nov 23 15:19:34 2004|execd|mendel|E|cant open file
active_jobs/4384.1/error: No such file or directory
Tue Nov 23 15:19:35 2004|execd|mendel|I|sending admin mail mail to user
"mathog at mendel.bio.caltech.edu"|mailer "/bin/mailx"|"SGE 5.3p6: Job 4384
failed"

Where are the shepherd_trace() calls supposed to be writing their
output???

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list