[GE users] Tight integration with AFS tokens or Kerberos ticket passing

jducom at nd.edu jducom at nd.edu
Wed May 24 18:26:11 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

We are using SGE5.3p2 on an heteregeneous environment (Opteron dual core 175 and
Xeon 3.Ghz) running RedHat enterprise 4.
The filesystem is AFS.

set_token_cmd (calling the SetToken program) and other AFS programs on top of
SGE allows us to run serial jobs as well as mpich1.2.7 parallel jobs with no
tight integration because the used version of ssh and rsh forwards AFS tokens
from the master.

The problem is when we try to use tight integration as qrsh/qsh controls
communications instead of the AFS tokens passing rsh version.
I attached the output form a 2nodes jobs (dcopt144 is the master, dcopt143 is
the slave).

Did anybody have any success with AFS tokens OR Kerberos ticket passing with
tight integration (mpich1.2.7)? Should I recompile SGE from source with
kerberos support?

Any help/insights would be greatly appreciated.

Jean-Christophe



----------
Master dcopt144:
Job 201920 caused action: Job 201920 set to ERROR
 User        = jducom
 Queue       = dcopt144.q
 Host        = dcopt144
 Start Time  = <unknown>
 End Time    = <unknown>
failed assumedly before job:AFS token does not exist or has zero length
Shepherd trace:
05/22/2006 15:06:39 [0:31045]: shepherd called with uid = 0, euid = 0
05/22/2006 15:06:39 [0:31045]: sigaction for signal 32 failed: Invalid argument
05/22/2006 15:06:39 [0:31045]: sigaction for signal 33 failed: Invalid argument
05/22/2006 15:06:39 [93541:31045]: starting up 5.3p7
05/22/2006 15:06:39 [93541:31045]: setpgid(31045, 31045) returned 0
05/22/2006 15:06:39 [93541:31045]: beginning AFS setup
05/22/2006 15:06:39 [93541:31045]: /afs/user37/sgeadmin/util/set_token_cmd
jducom 2592000
05/22/2006 15:06:39 [93541:31045]: sucessfully set AFS token
05/22/2006 15:06:39 [93541:31045]: AFS setup done
05/22/2006 15:06:39 [93541:31045]: no prolog script to start
05/22/2006 15:06:39 [93541:31045]: /afs/user37/sgeadmin/mpi/startmpi.sh
-catch_rsh $pe_hostfile
05/22/2006 15:06:39 [93541:31045]: /afs/user37/sgeadmin/mpi/startmpi.sh
-catch_rsh
/afs/user37/sgeadmin/hpcc/spool/dcopt144/active_jobs/201920.1/pe_hostfile
05/22/2006 15:06:39 [93541:31055]: pid=31055 pgrp=31055 sid=31055 old pgrp=31045
getlogin()=<no login set>
05/22/2006 15:06:39 [93541:31045]: forked "pe_start" with pid 31055
05/22/2006 15:06:39 [93541:31045]: using signal delivery delay of 120 seconds
05/22/2006 15:06:39 [93541:31045]: child: pe_start - pid: 31055
05/22/2006 15:06:39 [82784:31055]: closing all filedescriptors
05/22/2006 15:06:39 [82784:31055]: further messages are in "error" and "trace"
05/22/2006 15:06:39 [82784:31055]: using "/bin/tcsh" as shell of user "jducom"
05/22/2006 15:06:39 [82784:31055]: using stdout as stderr
05/22/2006 15:06:39 [82784:31055]: execvp(/afs/user37/sgeadmin/mpi/startmpi.sh,
/afs/user37/sgeadmin/mpi/startmpi.sh -catch_rsh
/afs/user37/sgeadmin/hpcc/spool/dcopt144/active_jobs/201920.1/pe_hostfile)
05/22/2006 15:06:39 [93541:31045]: wait3 returned 31055 (status: 0; WIFSIGNALED:
0,  WIFEXITED: 1, WEXITSTATUS: 0)
05/22/2006 15:06:39 [93541:31045]: pe_start exited with exit status 0
05/22/2006 15:06:39 [93541:31045]: reaped "pe_start" with pid 31055
05/22/2006 15:06:39 [93541:31045]: pe_start exited not due to signal
05/22/2006 15:06:39 [93541:31045]: pe_start exited with status 0
05/22/2006 15:06:39 [93541:31086]: pid=31086 pgrp=31086 sid=31086 old pgrp=31045
getlogin()=<no login set>
05/22/2006 15:06:39 [93541:31086]: setosjobid: uid = 0, euid = 93541
05/22/2006 15:06:39 [93541:31045]: forked "job" with pid 31086
05/22/2006 15:06:39 [93541:31086]: RLIMIT_CPU setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
05/22/2006 15:06:39 [93541:31086]: RLIMIT_FSIZE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
05/22/2006 15:06:39 [93541:31086]: RLIMIT_DATA setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
05/22/2006 15:06:39 [93541:31086]: RLIMIT_STACK setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
05/22/2006 15:06:39 [93541:31086]: RLIMIT_CORE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
05/22/2006 15:06:39 [93541:31086]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
05/22/2006 15:06:39 [93541:31086]: RLIMIT_RSS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
05/22/2006 15:06:39 [93541:31045]: child: job - pid: 31086
05/22/2006 15:06:39 [82784:31086]: closing all filedescriptors
05/22/2006 15:06:39 [82784:31086]: further messages are in "error" and "trace"
05/22/2006 15:06:39 [82784:31086]: using stdout as stderr
05/22/2006 15:06:39 [82784:31086]:
execvp(/afs/user37/sgeadmin/hpcc/spool/dcopt144/job_scripts/201920,
/afs/user37/sgeadmin/hpcc/spool/dcopt144/job_scripts/201920)



Slave:
Job 201920 caused action: Job 201920 set to ERROR
 User        = jducom
 Queue       = dcopt143.q
 Host        = dcopt143
 Start Time  = <unknown>
 End Time    = <unknown>
failed assumedly before job:AFS token does not exist or has zero length



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list