[GE users] qrsh config?

Alex Chekholko chekh at pcbi.upenn.edu
Fri Dec 19 19:15:40 GMT 2008


On Fri, 19 Dec 2008 20:03:39 +0100
reuti <reuti at staff.uni-marburg.de> wrote:

> Am 19.12.2008 um 19:27 schrieb Alex Chekholko:
> 
> > <snip>
> >>> qlogin_daemon                /usr/sbin/sshd -i
> >>> rlogin_daemon                /usr/sbin/sshd -i
> >>> rsh_daemon                   /usr/sbin/sshd -i
> >>> rsh_command                  /usr/bin/ssh -o StrictHostChecking=no
> >>> rlogin_command               /usr/bin/ssh -o StrictHostChecking=no
> >>
> >> Maybe a typo: StrictHostKeyChecking
> >
> > Good catch! I changed that.  It may or may not have helped.  qrsh  
> > now works as root (maybe it did before), but still not working as a  
> > regular user:
> 
> The sgeexec was started by root?
> 
> $ ps -e f -o user,ruser,command
> 

Yes.
root     root     /gpfs/fs0/share/ge-6.1u3/bin/lx24-amd64/sge_execd

> > [chekh at beta.genomics.upenn.edu] ~ [0]
> > $ qrsh uname -a
> > error: error reading returncode of remote command
> > [chekh at beta.genomics.upenn.edu] ~ [1]
> > $ qconf -sconf |grep ssh
> > qlogin_daemon                /usr/sbin/sshd -i
> 
> For qlogin with ssh this might help:
> 
> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html

I use the qlogin wrapper as on that page, and qlogin works fine for everyone.

> 
> > rlogin_daemon                /usr/sbin/sshd -i
> > rsh_daemon                   /usr/sbin/sshd -i
> > rsh_command                  /usr/bin/ssh -o StrictHostKeyChecking=no
> > rlogin_command               /usr/bin/ssh -o StrictHostKeyChecking=no
> > [chekh at beta.genomics.upenn.edu] ~ [0]
> > $ qrsh hostname
> > error: error reading returncode of remote command
> >
> > I see the command show up in qstat as running, and when I look on  
> > the node I see:
> >
> > root      5549  0.0  0.0  88392  4456 ?        S    Dec04  17:02 / 
> > gpfs/fs0/share/ge-6.1u3/bin/lx24-amd64/sge_execd
> > root     19322  0.0  0.0  32828  3356 ?        S    13:20   0:00   
> > \_ sge_shepherd-1178088 -bg
> > root     19323  0.0  0.0  33340  2908 ?        Ss   13:20    
> > 0:00      \_ sge_shepherd-1178088 -bg
> >
> > and then in the log after they disappear:
> > Dec 19 13:21:51 node-r1-u1-c34-p10-o2 kernel: sge_shepherd[19322]:  
> > segfault at 0000000000000001 rip 00002ae3c44087a7 rsp  
> > 00007fffe6fba440 error 4
> > Dec 19 13:21:51 node-r1-u1-c34-p10-o2 kernel: sge_shepherd[19323]:  
> > segfault at 0000000000000001 rip 00002ae3c44087a7 rsp  
> > 00007fffe6fbcce0 error 4
> 
> Is there any outout in /tmp on the node of these daemons?
> 

Nope.  There is some output in the spool messages:
12/19/2008 13:43:39|execd|node-r1-u21-c14-p10-o23|E|shepherd of job 1178094.1 died through signal = 11
12/19/2008 13:43:39|execd|node-r1-u21-c14-p10-o23|E|abnormal termination of shepherd for job 1178094.1: "exit_status" file is empty
12/19/2008 13:43:39|execd|node-r1-u21-c14-p10-o23|E|can't open usage file "active_jobs/1178094.1/usage" for job 1178094.1: No such file or directory
12/19/2008 13:43:39|execd|node-r1-u21-c14-p10-o23|E|shepherd exited with exit status 19

I tried stracing the sge_shepherd before it dies, but don't see any useful information:

root      3180  0.1  0.0  88420  4476 ?        S    Nov14  70:03 /gpfs/fs0/share/ge-6.1u3/bin/lx24-amd64/sge_execd
root     22776  0.0  0.0  32832  3336 ?        S    14:13   0:00  \_ sge_shepherd-1178096 -bg
root     22777  0.0  0.0  33344  2916 ?        Ss   14:13   0:00      \_ sge_shepherd-1178096 -bg
[root at node-r1-u29-c8-p10-o21.local] ~ [0] 
# strace -p 22776
Process 22776 attached - interrupt to quit
wait4(4294967295, 0x7fff8aa3176c, 0, 0x7fff8aa31810) = ? ERESTARTSYS (To be restarted)
--- SIGTSTP (Stopped) @ 0 (0) ---
rt_sigreturn(0x14)                      = -1 EINTR (Interrupted system call)
alarm(0)                                = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=3208, ...}) = 0
geteuid()                               = 0
getuid()                                = 0
geteuid()                               = 0
setresuid(-1, 0, -1)                    = 0
write(3, "12/19/2008 14:14:55 [0:22776]: w"..., 49) = 49
fstat(3, {st_mode=S_IFREG|0644, st_size=3257, ...}) = 0
geteuid()                               = 0
getuid()                                = 0
geteuid()                               = 0
setresuid(-1, 0, -1)                    = 0
write(3, "12/19/2008 14:14:55 [0:22776]: m"..., 65) = 65
fstat(3, {st_mode=S_IFREG|0644, st_size=3322, ...}) = 0
geteuid()                               = 0
getuid()                                = 0
geteuid()                               = 0
setresuid(-1, 0, -1)                    = 0
write(3, "12/19/2008 14:14:55 [0:22776]: q"..., 50) = 50
open("/tmp/1178096.1.all.q/pid", O_RDONLY) = -1 ENOENT (No such file or directory)
fstat(3, {st_mode=S_IFREG|0644, st_size=3372, ...}) = 0
geteuid()                               = 0
getuid()                                = 0
geteuid()                               = 0
setresuid(-1, 0, -1)                    = 0
write(3, "12/19/2008 14:14:55 [0:22776]: c"..., 99) = 99
fstat(3, {st_mode=S_IFREG|0644, st_size=3471, ...}) = 0
geteuid()                               = 0
getuid()                                = 0
geteuid()                               = 0
setresuid(-1, 0, -1)                    = 0
write(3, "12/19/2008 14:14:55 [0:22776]: k"..., 50) = 50
getuid()                                = 0
getgid()                                = 0
getuid()                                = 0
getegid()                               = 0
geteuid()                               = 0
open("/tmp/1178096.1.all.q/pid", O_RDONLY) = -1 ENOENT (No such file or directory)
fstat(3, {st_mode=S_IFREG|0644, st_size=3521, ...}) = 0
geteuid()                               = 0
getuid()                                = 0
geteuid()                               = 0
setresuid(-1, 0, -1)                    = 0
write(3, "12/19/2008 14:14:55 [0:22776]: c"..., 99) = 99
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
Process 22776 detached

Regards,
Alex

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=93429

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list