[GE users] qrsh config?

Alex Chekholko chekh at pcbi.upenn.edu
Mon Dec 22 18:54:00 GMT 2008


Hi all,

Just to follow up, I upgraded from 6.1u3 to 6.2u1 following Lubomir's excellent guide:
https://slx.sun.com/1179271114

Using the new "builtin" interactive job support, I have qlogin/qrsh working without any issue.

Regards,
Alex

On Fri, 19 Dec 2008 14:15:40 -0500
Alex Chekholko <chekh at pcbi.upenn.edu> wrote:
[snip]
> 
> > 
> > > rlogin_daemon                /usr/sbin/sshd -i
> > > rsh_daemon                   /usr/sbin/sshd -i
> > > rsh_command                  /usr/bin/ssh -o StrictHostKeyChecking=no
> > > rlogin_command               /usr/bin/ssh -o StrictHostKeyChecking=no
> > > [chekh at beta.genomics.upenn.edu] ~ [0]
> > > $ qrsh hostname
> > > error: error reading returncode of remote command
> > >
> > > I see the command show up in qstat as running, and when I look on  
> > > the node I see:
> > >
> > > root      5549  0.0  0.0  88392  4456 ?        S    Dec04  17:02 / 
> > > gpfs/fs0/share/ge-6.1u3/bin/lx24-amd64/sge_execd
> > > root     19322  0.0  0.0  32828  3356 ?        S    13:20   0:00   
> > > \_ sge_shepherd-1178088 -bg
> > > root     19323  0.0  0.0  33340  2908 ?        Ss   13:20    
> > > 0:00      \_ sge_shepherd-1178088 -bg
> > >
> > > and then in the log after they disappear:
> > > Dec 19 13:21:51 node-r1-u1-c34-p10-o2 kernel: sge_shepherd[19322]:  
> > > segfault at 0000000000000001 rip 00002ae3c44087a7 rsp  
> > > 00007fffe6fba440 error 4
> > > Dec 19 13:21:51 node-r1-u1-c34-p10-o2 kernel: sge_shepherd[19323]:  
> > > segfault at 0000000000000001 rip 00002ae3c44087a7 rsp  
> > > 00007fffe6fbcce0 error 4
> > 
> > Is there any outout in /tmp on the node of these daemons?
> > 
> 
> Nope.  There is some output in the spool messages:
> 12/19/2008 13:43:39|execd|node-r1-u21-c14-p10-o23|E|shepherd of job 1178094.1 died through signal = 11
> 12/19/2008 13:43:39|execd|node-r1-u21-c14-p10-o23|E|abnormal termination of shepherd for job 1178094.1: "exit_status" file is empty
> 12/19/2008 13:43:39|execd|node-r1-u21-c14-p10-o23|E|can't open usage file "active_jobs/1178094.1/usage" for job 1178094.1: No such file or directory
> 12/19/2008 13:43:39|execd|node-r1-u21-c14-p10-o23|E|shepherd exited with exit status 19
> 
> I tried stracing the sge_shepherd before it dies, but don't see any useful information:
> 
> root      3180  0.1  0.0  88420  4476 ?        S    Nov14  70:03 /gpfs/fs0/share/ge-6.1u3/bin/lx24-amd64/sge_execd
> root     22776  0.0  0.0  32832  3336 ?        S    14:13   0:00  \_ sge_shepherd-1178096 -bg
> root     22777  0.0  0.0  33344  2916 ?        Ss   14:13   0:00      \_ sge_shepherd-1178096 -bg
> [root at node-r1-u29-c8-p10-o21.local] ~ [0] 
> # strace -p 22776
> Process 22776 attached - interrupt to quit
> wait4(4294967295, 0x7fff8aa3176c, 0, 0x7fff8aa31810) = ? ERESTARTSYS (To be restarted)
> --- SIGTSTP (Stopped) @ 0 (0) ---
> rt_sigreturn(0x14)                      = -1 EINTR (Interrupted system call)
> alarm(0)                                = 0
> fstat(3, {st_mode=S_IFREG|0644, st_size=3208, ...}) = 0
> geteuid()                               = 0
> getuid()                                = 0
> geteuid()                               = 0
> setresuid(-1, 0, -1)                    = 0
> write(3, "12/19/2008 14:14:55 [0:22776]: w"..., 49) = 49
> fstat(3, {st_mode=S_IFREG|0644, st_size=3257, ...}) = 0
> geteuid()                               = 0
> getuid()                                = 0
> geteuid()                               = 0
> setresuid(-1, 0, -1)                    = 0
> write(3, "12/19/2008 14:14:55 [0:22776]: m"..., 65) = 65
> fstat(3, {st_mode=S_IFREG|0644, st_size=3322, ...}) = 0
> geteuid()                               = 0
> getuid()                                = 0
> geteuid()                               = 0
> setresuid(-1, 0, -1)                    = 0
> write(3, "12/19/2008 14:14:55 [0:22776]: q"..., 50) = 50
> open("/tmp/1178096.1.all.q/pid", O_RDONLY) = -1 ENOENT (No such file or directory)
> fstat(3, {st_mode=S_IFREG|0644, st_size=3372, ...}) = 0
> geteuid()                               = 0
> getuid()                                = 0
> geteuid()                               = 0
> setresuid(-1, 0, -1)                    = 0
> write(3, "12/19/2008 14:14:55 [0:22776]: c"..., 99) = 99
> fstat(3, {st_mode=S_IFREG|0644, st_size=3471, ...}) = 0
> geteuid()                               = 0
> getuid()                                = 0
> geteuid()                               = 0
> setresuid(-1, 0, -1)                    = 0
> write(3, "12/19/2008 14:14:55 [0:22776]: k"..., 50) = 50
> getuid()                                = 0
> getgid()                                = 0
> getuid()                                = 0
> getegid()                               = 0
> geteuid()                               = 0
> open("/tmp/1178096.1.all.q/pid", O_RDONLY) = -1 ENOENT (No such file or directory)
> fstat(3, {st_mode=S_IFREG|0644, st_size=3521, ...}) = 0
> geteuid()                               = 0
> getuid()                                = 0
> geteuid()                               = 0
> setresuid(-1, 0, -1)                    = 0
> write(3, "12/19/2008 14:14:55 [0:22776]: c"..., 99) = 99
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> Process 22776 detached
> 
> Regards,
> Alex
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=93898

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list