[GE users] qrsh /bin/bash error mark all Queue to Error state

Angel Arancibia angel.arancibia at gmail.com
Fri Jul 4 15:58:57 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

2008/7/3 Reuti <reuti at staff.uni-marburg.de>:
> Am 03.07.2008 um 17:34 schrieb Angel Arancibia:
>
> What is "qacct -j 9483" saying - is there any record with an  error code?
> Are you getting any eMails with "-m bea" with a reason? The setting of
> "loglevel" is "log_info" in SGE's configuration and still no output about
> the cause of reason in any of SGE's messages in
> $SGE_ROOT/spool/qmaster/messages or era-q9/messages and so on files (job
> rescheduled because of...)?
>

now is job 9515. loglevel is set to log_info

aarancibia at cluster:~$qrsh -m bea -q sistint
Warning: Permanently added 'era-q1.cluster.ifir.edu.ar' (RSA) to the
list of known hosts.
aarancibia at era-q1:~$

mail arrive:
###############
Job 9515 (bash) Started
 User       = aarancibia
 Queue      = sistint
 Host       = era-q1.cluster.ifir.edu.ar
 Start Time = 07/04/2008 11:47:02
##########

now, I close the terminal.

mail for admin:
##################
ob 9515 caused action: Queue "sistint at era-q1.cluster.ifir.edu.ar" set to ERROR
 User        = aarancibia
 Queue       = sistint at era-q1.cluster.ifir.edu.ar
 Host        = era-q1.cluster.ifir.edu.ar
 Start Time  = <unknown>
 End Time    = <unknown>
failed before job:07/04/2008 11:47:34 [0:10297]: can't get qrsh_exit_code
Shepherd trace:
07/04/2008 11:47:02 [0:10297]: shepherd called with uid = 0, euid = 0
07/04/2008 11:47:02 [0:10297]: starting up 6.1
07/04/2008 11:47:02 [0:10297]: setpgid(10297, 10297) returned 0
07/04/2008 11:47:02 [0:10297]: no prolog script to start
07/04/2008 11:47:02 [0:10298]: processing qlogin job
07/04/2008 11:47:02 [0:10298]: pid=10298 pgrp=10298 sid=10298 old
pgrp=10297 getlogin()=<no login set>
07/04/2008 11:47:02 [0:10298]: reading passwd information for user 'root'
07/04/2008 11:47:02 [0:10298]: setosjobid: uid = 0, euid = 0
07/04/2008 11:47:02 [0:10298]: setting limits
07/04/2008 11:47:02 [0:10298]: RLIMIT_CPU setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/04/2008 11:47:02 [0:10298]: RLIMIT_FSIZE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/04/2008 11:47:02 [0:10298]: RLIMIT_DATA setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/04/2008 11:47:02 [0:10298]: RLIMIT_STACK setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/04/2008 11:47:02 [0:10298]: RLIMIT_CORE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/04/2008 11:47:02 [0:10298]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/04/2008 11:47:02 [0:10298]: RLIMIT_RSS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/04/2008 11:47:02 [0:10298]: setting environment
07/04/2008 11:47:02 [0:10298]: Initializing error file
07/04/2008 11:47:02 [0:10298]: switching to intermediate/target user
07/04/2008 11:47:02 [0:10297]: forked "job" with pid 10298
07/04/2008 11:47:02 [0:10297]: child: job - pid: 10298
07/04/2008 11:47:02 [1038:10298]: closing all filedescriptors
07/04/2008 11:47:02 [1038:10298]: further messages are in "error" and "trace"
07/04/2008 11:47:02 [0:10298]: now running with uid=0, euid=0
07/04/2008 11:47:02 [0:10298]: start qlogin
07/04/2008 11:47:02 [0:10298]: calling
qlogin_starter(/local/sys/sge/era-q1/active_jobs/9515.1,
/usr/local/sbin/sshd -i);
07/04/2008 11:47:02 [0:10298]: uid = 0, euid = 0, gid = 0, egid = 0
07/04/2008 11:47:02 [0:10298]: using sfd 1
07/04/2008 11:47:02 [0:10298]: bound to port 47001
07/04/2008 11:47:02 [0:10298]: write_to_qrsh - data =
0:47001:/home/sys/sge/utilbin/lx24-amd64:/local/sys/sge/era-q1/active_jobs/9515.1:era-q1.cluster.ifir.edu.ar
07/04/2008 11:47:02 [0:10298]: write_to_qrsh - address =
cluster.ifir.edu.ar:60008
07/04/2008 11:47:02 [0:10298]: write_to_qrsh - host =
cluster.ifir.edu.ar, port = 60008
07/04/2008 11:47:02 [0:10298]: waiting for connection.
07/04/2008 11:47:02 [0:10298]: accepted connection on fd 2
07/04/2008 11:47:02 [0:10298]: daemon to start: |/usr/local/sbin/sshd -i|
07/04/2008 11:47:34 [0:10297]: wait3 returned 10298 (status: 0;
WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
07/04/2008 11:47:34 [0:10297]: job exited with exit status 0
07/04/2008 11:47:34 [0:10297]: reaped "job" with pid 10298
07/04/2008 11:47:34 [0:10297]: job exited not due to signal
07/04/2008 11:47:34 [0:10297]: job exited with status 0
07/04/2008 11:47:34 [0:10297]: found pid of qrsh client command: -10303
07/04/2008 11:47:34 [0:10297]: now sending signal KILL to pid -10303
07/04/2008 11:47:34 [0:10297]: get_exit_code_of_qrsh_starter - TMPDIR
= /local/9515.1.sistint, pe_task_id = 0
07/04/2008 11:47:34 [0:10297]: can't open file
/local/9515.1.sistint/qrsh_exit_code: No such file or directory
07/04/2008 11:47:34 [0:10297]: can't get qrsh_exit_code

07/04/2008 11:47:34 [0:10297]: write_to_qrsh - data = 1:can't get qrsh_exit_code

07/04/2008 11:47:34 [0:10297]: write_to_qrsh - address =
cluster.ifir.edu.ar:60008
07/04/2008 11:47:34 [0:10297]: write_to_qrsh - host =
cluster.ifir.edu.ar, port = 60008
07/04/2008 11:47:34 [0:10297]: error connecting stream socket:
Connection refused

Shepherd error:
07/04/2008 11:47:34 [0:10297]: can't get qrsh_exit_code


Shepherd pe_hostfile:
era-q1.cluster.ifir.edu.ar 1 sistint at era-q1.cluster.ifir.edu.ar <NULL>
##################

another ....

####
Job 9515 (bash) Aborted
 Exit Status      = -1
 Signal           = unknown signal
 User             = aarancibia
 Queue            = sistint at era-q1.cluster.ifir.edu.ar
 Host             = era-q1.cluster.ifir.edu.ar
 Start Time       = <unknown>
 End Time         = <unknown>
 CPU              = NA
 Max vmem         = NA
failed before job because:
07/04/2008 11:47:34 [0:10297]: can't get qrsh_exit_code
####

The messages report:

###################
aarancibia at cluster:~$cat /home/sys/sge/default/spool/qmaster/messages
(...)
07/04/2008 11:47:35|qmaster|fs|W|job 9515.1 failed on host
era-q1.cluster.ifir.edu.ar general before job because: 07/04/2008
11:47:34 [0:10297]: can't get qrsh_exit_code
07/04/2008 11:47:35|qmaster|fs|W|rescheduling job 9515.1
07/04/2008 11:47:36|qmaster|fs|E|queue sistint marked QERROR as result
of job 9515's failure at host era-q1.cluster.ifir.edu.ar
07/04/2008 11:47:48|qmaster|fs|W|job 9515.1 failed on host
era-q2.cluster.ifir.edu.ar general before job because: 07/04/2008
11:47:47 [0:10408]: can't open file /local/9515.1.sistint/pid: No such
file or directory
07/04/2008 11:47:48|qmaster|fs|W|rescheduling job 9515.1
07/04/2008 11:47:48|qmaster|fs|E|queue sistint marked QERROR as result
of job 9515's failure at host era-q2.cluster.ifir.edu.ar
07/04/2008 11:48:03|qmaster|fs|W|job 9515.1 failed on host
era-q3.cluster.ifir.edu.ar general before job because: 07/04/2008
11:48:02 [0:10140]: can't open file /local/9515.1.sistint/pid: No such
file or directory
07/04/2008 11:48:03|qmaster|fs|W|rescheduling job 9515.1
07/04/2008 11:48:03|qmaster|fs|E|queue sistint marked QERROR as result
of job 9515's failure at host era-q3.cluster.ifir.edu.ar
07/04/2008 11:48:12|qmaster|fs|I|aarancibia has deleted job 9515
###########

And the qacct report:

aarancibia at cluster:~$qacct -j 9515
==============================================================
qname        sistint
hostname     era-q1.cluster.ifir.edu.ar
group        UNKNOWN
owner        aarancibia
project      NONE
department   defaultdepartment
jobname      bash
jobnumber    9515
taskid       undefined
account      sge
priority     0
qsub_time    Fri Jul  4 11:47:02 2008
start_time   -/-
end_time     -/-
granted_pe   NONE
slots        0
failed       11  : before job
exit_status  0
ru_wallclock 0
ru_utime     0
ru_stime     0
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    0
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     0
ru_nivcsw    0
cpu          0
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
==============================================================
qname        sistint
hostname     era-q2.cluster.ifir.edu.ar
group        UNKNOWN
owner        aarancibia
project      NONE
department   defaultdepartment
jobname      bash
jobnumber    9515
taskid       undefined
account      sge
priority     0
qsub_time    Fri Jul  4 11:47:02 2008
start_time   -/-
end_time     -/-
granted_pe   NONE
slots        0
failed       11  : before job
exit_status  0
ru_wallclock 0
ru_utime     0
ru_stime     0
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    0
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     0
ru_nivcsw    0
cpu          0
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
==============================================================


> As I can't reproduce it (openSUSE server & terminal, besides Mac terminal) -
> maybe it's not related to SGE but to Debian/Ubuntu (you mentioned to use
> this).

Maybe, I don't know, i'm going to try it from a windows machine.

Thanks

Angel

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list