[GE users] switched to 6.1 and many problems

Yifan Zhang yifanz at softsound.com
Wed Aug 8 16:59:28 BST 2007


    [ The following text is in the "windows-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi, all,

We were using SGE5.3, and it is been doing its job for years. Now we decided to switch to GE6.1. At begining it worked, and then for no reason every host will be set to error automatically. The job finished fine. It is a qsub job, not qrsh. please help if you know what is the reason causing this. The typical error message is as following:

Job 13851 caused action: Queue "real.q at hal00.softsound.com" set to ERROR
 User        = ajr
 Queue       = real.q at hal00.softsound.com
 Host        = hal00.softsound.com
 Start Time  = <unknown>
 End Time    = <unknown>
failed before job:08/08/2007 16:45:40 [0:15106]: can't open file /tmp/13851.1.real.q/pid: No such file or directory
Shepherd trace:
08/08/2007 16:45:37 [16384:15105]: shepherd called with uid = 0, euid = 16384
08/08/2007 16:45:37 [16384:15105]: setpgid(15105, 15105) returned 0
08/08/2007 16:45:37 [16384:15105]: no prolog script to start
08/08/2007 16:45:37 [16384:15105]: forked "job" with pid 15106
08/08/2007 16:45:37 [16384:15106]: processing qlogin job
08/08/2007 16:45:37 [16384:15106]: pid=15106 pgrp=15106 sid=15106 old pgrp=15105 getlogin()=<no login set>
08/08/2007 16:45:37 [16384:15105]: child: job - pid: 15106
08/08/2007 16:45:37 [16384:15106]: reading passwd information for user 'root'
08/08/2007 16:45:37 [16384:15106]: setting limits
08/08/2007 16:45:37 [16384:15106]: RLIMIT_CPU setting: (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft 18446744073709551615 hard 18446744073709551615)
08/08/2007 16:45:37 [16384:15106]: RLIMIT_FSIZE setting: (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft 18446744073709551615 hard 18446744073709551615)
08/08/2007 16:45:37 [16384:15106]: RLIMIT_DATA setting: (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft 18446744073709551615 hard 18446744073709551615)
08/08/2007 16:45:37 [16384:15106]: RLIMIT_STACK setting: (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft 18446744073709551615 hard 18446744073709551615)
08/08/2007 16:45:37 [16384:15106]: RLIMIT_CORE setting: (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft 18446744073709551615 hard 18446744073709551615)
08/08/2007 16:45:37 [16384:15106]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft 18446744073709551615 hard 18446744073709551615)
08/08/2007 16:45:37 [16384:15106]: RLIMIT_RSS setting: (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft 18446744073709551615 hard 18446744073709551615)
08/08/2007 16:45:37 [16384:15106]: setting environment
08/08/2007 16:45:37 [16384:15106]: Initializing error file
08/08/2007 16:45:37 [16384:15106]: switching to intermediate/target user
08/08/2007 16:45:37 [16519:15106]: closing all filedescriptors
08/08/2007 16:45:37 [16519:15106]: further messages are in "error" and "trace"
08/08/2007 16:45:37 [0:15106]: now running with uid=0, euid=0
08/08/2007 16:45:37 [0:15106]: start qlogin
08/08/2007 16:45:37 [0:15106]: calling qlogin_starter(/home/sound0/softsound/GridEngine/ssqueue/spool/hal00/active_jobs/13851.1, /usr/local/softsound/GridEngine/utilbin/lx24-amd64/rshd -l);
08/08/2007 16:45:37 [0:15106]: uid = 0, euid = 0, gid = 0, egid = 0
08/08/2007 16:45:37 [0:15106]: using sfd 1
08/08/2007 16:45:37 [0:15106]: bound to port 52903
08/08/2007 16:45:37 [0:15106]: write_to_qrsh - data = 0:52903:/usr/local/softsound/GridEngine/utilbin/lx24-amd64:/home/sound0/softsound/GridEngine/ssqueue/spool/hal00/active_jobs/13851.1:hal00.softsound.com
08/08/2007 16:45:37 [0:15106]: write_to_qrsh - address = munge.softsound.com:33046
08/08/2007 16:45:37 [0:15106]: write_to_qrsh - host = munge.softsound.com, port = 33046
08/08/2007 16:45:40 [0:15106]: error connecting stream socket: No route to host
08/08/2007 16:45:40 [0:15106]: communication with qrsh failed
08/08/2007 16:45:40 [0:15106]: forked "job" with pid 0
08/08/2007 16:45:40 [0:15106]: child: job - pid: 0
08/08/2007 16:45:40 [0:15106]: wait3 returned -1
08/08/2007 16:45:40 [0:15106]: can't open file /tmp/13851.1.real.q/pid: No such file or directory
08/08/2007 16:45:40 [0:15106]: write_to_qrsh - data = 1:can't open file /tmp/13851.1.real.q/pid: No such file or directory
08/08/2007 16:45:40 [0:15106]: write_to_qrsh - address = munge.softsound.com
08/08/2007 16:45:40 [0:15106]: illegal value for qrsh_control_port: "munge.softsound.com". Should be host:port
08/08/2007 16:45:40 [16384:15105]: wait3 returned 15106 (status: 2816; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
08/08/2007 16:45:40 [16384:15105]: job exited with exit status 11
08/08/2007 16:45:40 [16384:15105]: reaped "job" with pid 15106
08/08/2007 16:45:40 [16384:15105]: job exited not due to signal
08/08/2007 16:45:40 [16384:15105]: job exited with status 11
08/08/2007 16:45:40 [0:15105]: can't open file /tmp/13851.1.real.q/pid: No such file or directory
08/08/2007 16:45:40 [0:15105]: write_to_qrsh - data = 1:can't open file /tmp/13851.1.real.q/pid: No such file or directory
08/08/2007 16:45:40 [0:15105]: write_to_qrsh - address = munge.softsound.com:33046
08/08/2007 16:45:40 [0:15105]: write_to_qrsh - host = munge.softsound.com, port = 33046
08/08/2007 16:45:43 [0:15105]: error connecting stream socket: No route to host

Shepherd error:
08/08/2007 16:45:40 [0:15106]: can't open file /tmp/13851.1.real.q/pid: No such file or directory
08/08/2007 16:45:40 [0:15105]: can't open file /tmp/13851.1.real.q/pid: No such file or directory


Thank you very much!

-- 
/*
 * Yifan Zhang
 *
 * Softsound
 * +44 (0)1223 448 097
 */

"Autonomyâ??s search technology is becoming a de facto standard for companiesâ???? â?? FT, 4th July 07

Cream is a configuration of the Vim text editor that consists of a set of scripts which can be run within Vim to make it behave more like an editor now common to most personal computers which conform to the Common User Access standards of interface and operability.

The information contained in this message is for the intended addressee only and may contain confidential and/or privileged  information.  If you are not the intended addressee, please delete this message and notify the sender, and do not copy or distribute this message or disclose its contents to anyone.  Any views or opinions expressed in this message are those of the author and do not necessarily represent those of Autonomy Systems Limited or of any of its associated companies.  No reliance may be placed on this message without written confirmation from an authorised representative of the company.  Autonomy Systems Limited, Registered Office:  Cambridge Business Park, Cowley Road, Cambridge CB3 0WZ, Registered Number 03063054.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list