[GE users] SGE 6 sending emails

Gavin Kelman gavin.kelman at uk.lionbioscience.com
Thu Jan 6 16:21:51 GMT 2005


Ron Chen wrote:
> --- Gavin Kelman wrote: 
> 
>>I compiled a new shepherd and replaced the old one.
>>Jobs
>>are still running and producing the emails, but
>>there's no
>>exit_status file in /tmp.
> 
> 
> Then look at the trace again, it should have something
> new.

I've since found out that all the jobs the users submit
are in fact interactive, as they submit them all by qrsh.

>>All jobs seem to produce these emails. Most of our
>>jobs are
>>run by one user, and their jobs on this one machine
>>produce
>>these emails.
> 
> 
> So if you run a hello world job, it also fails?

Running qrsh as a normal user, as oppose to the user who is
always running loads of jobs:

staines at dechirico> qrsh -now n -q all_times.q at watts -verbose "echo hello"
waiting for interactive job to be scheduled ...
Your interactive job 78303 has been successfully scheduled.
Establishing /usr/local/common/apps/sge/utilbin/lx24-x86/rsh session to 
host watts.lionbio.co.uk ...
hello
/usr/local/common/apps/sge/utilbin/lx24-x86/rsh exited with exit code 0
reading exit code from shepherd ... 0

And I get an email:

Job 78303 caused action: none
  User        = staines
  Queue       = all_times.q at watts.lionbio.co.uk
  Host        = watts.lionbio.co.uk
  Start Time  = 01/06/2005 16:17:12
  End Time    = 01/06/2005 16:17:12
failed before writing exit_status:shepherd exited with exit status 19
Shepherd trace:
01/06/2005 16:17:12 [2041:12161]: closing all filedescriptors
01/06/2005 16:17:12 [2041:12161]: further messages are in "error" and 
"trace"
01/06/2005 16:17:12 [0:12161]: calling 
qlogin_starter(/usr/local/common/apps/sge/default/spool/watts/active_jobs/78303.1, 
/usr/local/common/apps/sge/utilbin/irix65/rshd -l);
01/06/2005 16:17:12 [0:12161]: uid = 0, euid = 0, gid = 0, egid = 0
01/06/2005 16:17:12 [0:12161]: uid = 0, euid = 0, gid = 0, egid = 0
01/06/2005 16:17:12 [0:12161]: using sfd 1
01/06/2005 16:17:12 [0:12161]: bound to port 1277

01/06/2005 16:17:12 [0:12161]: write_to_qrsh - data = 
0:1277:/usr/local/common/apps/sge/utilbin/irix65:/usr/local/common/apps/sge/default/spool/watts/active_jobs/78303.1:watts.lionbio.co.uk
01/06/2005 16:17:12 [0:12161]: write_to_qrsh - address = 
dechirico.lionbio.co.uk:58775
01/06/2005 16:17:12 [0:12161]: write_to_qrsh - host = 
dechirico.lionbio.co.uk, port = 58775
01/06/2005 16:17:12 [0:12161]: waiting for connection.
01/06/2005 16:17:12 [0:12161]: accepted connection on fd 2
01/06/2005 16:17:12 [0:12161]: daemon to start: 
|/usr/local/common/apps/sge/utilbin/irix65/rshd -l|
01/06/2005 16:17:12 [2108:12159]: setosjobid: uid = 0, euid = 2108
01/06/2005 16:17:12 [0:12159]: in irix code
01/06/2005 16:17:12 [0:12159]: 45
01/06/2005 16:17:12 [0:12159]: can't get id for project "none"
01/06/2005 16:17:12 [2108:12160]: wait3 returned 12161 (status: 0; 
WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
01/06/2005 16:17:12 [2108:12160]: job exited with exit status 0
01/06/2005 16:17:12 [2108:12160]: reaped "job" with pid 12161
01/06/2005 16:17:12 [2108:12160]: job exited not due to signal
01/06/2005 16:17:12 [2108:12160]: job exited with status 0
01/06/2005 16:17:13 [0:12160]: found pid of qrsh client command: -12165
01/06/2005 16:17:13 [2108:12160]: now sending signal KILL to pid -12165
01/06/2005 16:17:13 [2108:12160]: get_exit_code_of_qrsh_starter()
01/06/2005 16:17:13 [2108:12160]: get_exit_code_of_qrsh_starter - TMPDIR 
= /tmp/78303.1.all_times.q, pe_task_id = 0
01/06/2005 16:17:13 [2108:12160]: error code from remote command is 0
01/06/2005 16:17:13 [2108:12160]: get_error_of_qrsh_starter()
01/06/2005 16:17:13 [2108:12160]: get_error_of_qrsh_starter - TMPDIR = 
/tmp/78303.1.all_times.q, qrsh_task_id = 0
01/06/2005 16:17:13 [2108:12160]: job exited normally, exit code is 0

01/06/2005 16:17:13 [2108:12160]: writing usage file to "usage"
01/06/2005 16:17:13 [2108:12160]: no tasker to notify
01/06/2005 16:17:13 [2108:12160]: no epilog script to start
01/06/2005 16:17:13 [2108:12160]: write_exit_code_to_qrsh(0)
01/06/2005 16:17:13 [2108:12160]: write_exit_code_to_qrsh - TMPDIR = 
/tmp/78303.1.all_times.q, pe_task_id = 0
01/06/2005 16:17:13 [2108:12160]: error code from remote command is 0
01/06/2005 16:17:13 [2108:12160]: write_to_qrsh - data = 0
01/06/2005 16:17:13 [2108:12160]: write_to_qrsh - address = 
dechirico.lionbio.co.uk:58775
01/06/2005 16:17:13 [2108:12160]: write_to_qrsh - host = 
dechirico.lionbio.co.uk, port = 58775

Shepherd pe_hostfile:
watts.lionbio.co.uk 1 all_times.q at watts.lionbio.co.uk UNDEFINED


Cheers,
Gavin.

-- 
Gavin Kelman
UNIX Administrator


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list