[GE users] one node keeps going into error state

David Mathog mathog at mendel.bio.caltech.edu
Tue Nov 23 22:11:23 GMT 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I put some debug lines into shepherd.c and rebuilt sge_shepherd
on Solaris.  Traced the problem to here:

   if (getcwd(shepherd_job_dir, 2047) == NULL) {

which was coming back NULL. This is really odd.

Put these lines in to get:

  envstring=getenv("PWD");
  if(envstring)fprintf(flog,"DEBUG 2.52 PWD >%s<\n",envstring);
  envstring=getenv("LOGNAME");
  if(envstring)fprintf(flog,"DEBUG 2.53 LOGNAME >%s<\n",envstring);
  envstring=getenv("PATH");
  if(envstring)fprintf(flog,"DEBUG 2.54 PATH >%s<\n",envstring);
  envstring=getenv("SHELL");
  if(envstring)fprintf(flog,"DEBUG 2.55 SHELL >%s<\n",envstring);
      sprintf(err_str, "can't read cwd - getcwd failed: %s",
strerror(errno));

And it emits to the log file:

DEBUG 2.52 PWD >/usr/SGE/gridengine_v53p6/source<
DEBUG 2.53 LOGNAME >root<
DEBUG 2.54 PATH
>/usr/ccs/bin:/opt/SUNWspro/bin:/usr/SGE/bin/solaris64:/usr/sbin:/usr/bin:/usr/local/bin:/opt/SUNWppro/bin:/usr/sadm/bin<
DEBUG 2.55 SHELL >/sbin/sh<

Now the _really_ odd thing about this is that the qsub is being
run from a different account and directory.  The directory in 
PWD is where 

%./aimk -only-core

was run.  Somehow or other that "sticks" in the resulting binary
and when sge_shepherd starts, that's where it starts.  But getcwd
comes back with NULL.

Here's what the rebuild did:

% ./aimk -only-core
making in SOLARIS64/ for SOLARIS64
________C_O_R_E__S_Y_S_T_E_M_________
cc -I../daemons/shepherd -Xc -DNLIST_STRUCT -xO4 -v  -DENABLE_438_FIX
-DSOLARIS -DSOLARIS64 -xarch=v9 -DSOLARIS7 -D__EXTENSIONS__
-D_POSIX_C_SOURCE=199506L  -DCOMPILE_DC -D__SGE_COMPILE_WITH_GETTEXT__ 
-D__SGE_NO_USERMAPPING__ -I../security/sec -I../common -I../libs/uti
-I../libs/gdi -I../libs/cull -I../libs/rmon -I../libs/comm
-I../libs/sched -I../daemons/common -I../daemons/commd
-I../daemons/qmaster -I../daemons/execd -I../daemons/schedd
-I../clients/common -I.  -c ../daemons/shepherd/shepherd.c
cc -xildoff -xarch=v9 -L. -o sge_shepherd shepherd.o  builtin_starter.o
 am_chdir.o  setrlimits.o  signal_queue.o  setjoblimit.o config_file.o 
err_trace.o  execution_states.o  job.o  qlogin_starter.o  setenv.o 
setosjobid.o  sge_parse_num_par.o  pdc.o  procfs.o  sge_processes_irix.o
-lgdi   -lcull -lcom -luti -lrmon  -lsocket -lnsl -lm
%# these were done manually
% /etc/init.d/rcsge stop
% cp ./SOLARIS64/sge_shepherd /usr/SGE/bin/solaris64
% /etc/init.d/rcsge start

Then from another account

% qsub -q testm showinfo.sh

What's next???
 
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list