[GE users] qrsh /bin/bash error mark all Queue to Error state

Angel Arancibia angel.arancibia at gmail.com
Fri Jun 27 18:29:03 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I don't know if its a bug or a missmach configuration, but seems to be
a real problem.

When a user who has call a "qrsh /bin/bash" close the terminal (not
with an "exit") the job start to cycle in to the queue.

This is one of the message error from the failed job.

Job 9086 caused action: Queue "sistint at era-q1.cluster.ifir.edu.ar" set to ERROR
 User        = ggrinblat
 Queue       = sistint at era-q1.cluster.ifir.edu.ar
 Host        = era-q1.cluster.ifir.edu.ar
 Start Time  = <unknown>
 End Time    = <unknown>
failed before job:06/26/2008 14:38:07 [0:11922]: can't get qrsh_exit_code
Shepherd trace:
06/26/2008 14:36:04 [0:11922]: shepherd called with uid = 0, euid = 0
06/26/2008 14:36:05 [0:11922]: starting up 6.1
06/26/2008 14:36:05 [0:11922]: setpgid(11922, 11922) returned 0
06/26/2008 14:36:05 [0:11922]: no prolog script to start
06/26/2008 14:36:05 [0:11923]: processing qlogin job
06/26/2008 14:36:05 [0:11923]: pid=11923 pgrp=11923 sid=11923 old
pgrp=11922 getlogin()=<no login set>
06/26/2008 14:36:05 [0:11923]: reading passwd information for user 'root'
06/26/2008 14:36:05 [0:11923]: setosjobid: uid = 0, euid = 0
06/26/2008 14:36:05 [0:11923]: setting limits
06/26/2008 14:36:05 [0:11923]: RLIMIT_CPU setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
06/26/2008 14:36:05 [0:11923]: RLIMIT_FSIZE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
06/26/2008 14:36:05 [0:11923]: RLIMIT_DATA setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
06/26/2008 14:36:05 [0:11923]: RLIMIT_STACK setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
06/26/2008 14:36:05 [0:11923]: RLIMIT_CORE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
06/26/2008 14:36:05 [0:11923]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
06/26/2008 14:36:05 [0:11923]: RLIMIT_RSS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
06/26/2008 14:36:05 [0:11923]: setting environment
06/26/2008 14:36:05 [0:11923]: Initializing error file
06/26/2008 14:36:05 [0:11922]: forked "job" with pid 11923
06/26/2008 14:36:05 [0:11922]: child: job - pid: 11923
06/26/2008 14:36:05 [0:11923]: switching to intermediate/target user
06/26/2008 14:36:10 [1018:11923]: closing all filedescriptors
06/26/2008 14:36:10 [1018:11923]: further messages are in "error" and "trace"
06/26/2008 14:36:10 [0:11923]: now running with uid=0, euid=0
06/26/2008 14:36:10 [0:11923]: start qlogin
06/26/2008 14:36:10 [0:11923]: calling
qlogin_starter(/local/sys/sge/era-q1/active_jobs/9086.1,
/usr/local/sbin/sshd -i );
06/26/2008 14:36:10 [0:11923]: uid = 0, euid = 0, gid = 0, egid = 0
06/26/2008 14:36:10 [0:11923]: using sfd 2
06/26/2008 14:36:10 [0:11923]: bound to port 38356
06/26/2008 14:36:10 [0:11923]: write_to_qrsh - data =
0:38356:/home/sys/sge/utilbin/lx24-amd64:/local/sys/sge/era-q1/active_jobs/9086.1:era-q1.cluster.ifir.edu.ar
06/26/2008 14:36:10 [0:11923]: write_to_qrsh - address =
cluster.ifir.edu.ar:54593
06/26/2008 14:36:10 [0:11923]: write_to_qrsh - host =
cluster.ifir.edu.ar, port = 54593
06/26/2008 14:36:15 [0:11923]: waiting for connection.
06/26/2008 14:36:15 [0:11923]: accepted connection on fd 5
06/26/2008 14:36:15 [0:11923]: daemon to start: |/usr/local/sbin/sshd -i |
06/26/2008 14:38:07 [0:11922]: wait3 returned 11923 (status: 0;
WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
06/26/2008 14:38:07 [0:11922]: job exited with exit status 0
06/26/2008 14:38:07 [0:11922]: reaped "job" with pid 11923
06/26/2008 14:38:07 [0:11922]: job exited not due to signal
06/26/2008 14:38:07 [0:11922]: job exited with status 0
06/26/2008 14:38:07 [0:11922]: found pid of qrsh client command: -11928
06/26/2008 14:38:07 [0:11922]: now sending signal KILL to pid -11928
06/26/2008 14:38:07 [0:11922]: get_exit_code_of_qrsh_starter - TMPDIR
= /local/9086.1.sistint, pe_task_id = 0
06/26/2008 14:38:07 [0:11922]: can't open file
/local/9086.1.sistint/qrsh_exit_code: No such file or directory
06/26/2008 14:38:07 [0:11922]: can't get qrsh_exit_code

06/26/2008 14:38:07 [0:11922]: write_to_qrsh - data = 1:can't get qrsh_exit_code

06/26/2008 14:38:07 [0:11922]: write_to_qrsh - address =
cluster.ifir.edu.ar:54593
06/26/2008 14:38:07 [0:11922]: write_to_qrsh - host =
cluster.ifir.edu.ar, port = 54593
06/26/2008 14:38:07 [0:11922]: error connecting stream socket:
Connection refused

Shepherd error:
06/26/2008 14:38:07 [0:11922]: can't get qrsh_exit_code


Shepherd pe_hostfile:
era-q1.cluster.ifir.edu.ar 1 sistint at era-q1.cluster.ifir.edu.ar <NULL>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list