[GE users] switched to 6.1 and many problems

Reuti reuti at staff.uni-marburg.de
Wed Aug 8 21:57:02 BST 2007


    [ The following text is in the "WINDOWS-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Am 08.08.2007 um 17:59 schrieb Yifan Zhang:

> We were using SGE5.3, and it is been doing its job for years. Now  
> we decided to switch to GE6.1. At begining it worked, and then for  
> no reason every host will be set to error automatically. The job  
> finished fine. It is a qsub job, not qrsh. please help if you know  
> what is the reason causing this. The typical error message is as  
> following:
>
> Job 13851 caused action: Queue "real.q at hal00.softsound.com" set to  
> ERROR
> User        = ajr
> Queue       = real.q at hal00.softsound.com
> Host        = hal00.softsound.com
> Start Time  = <unknown>
> End Time    = <unknown>
> failed before job:08/08/2007 16:45:40 [0:15106]: can't open file / 
> tmp/13851.1.real.q/pid: No such file or directory
> Shepherd trace:
> 08/08/2007 16:45:37 [16384:15105]: shepherd called with uid = 0,  
> euid = 16384
> 08/08/2007 16:45:37 [16384:15105]: setpgid(15105, 15105) returned 0
> 08/08/2007 16:45:37 [16384:15105]: no prolog script to start
> 08/08/2007 16:45:37 [16384:15105]: forked "job" with pid 15106
> 08/08/2007 16:45:37 [16384:15106]: processing qlogin job
> 08/08/2007 16:45:37 [16384:15106]: pid=15106 pgrp=15106 sid=15106  
> old pgrp=15105 getlogin()=<no login set>
> 08/08/2007 16:45:37 [16384:15105]: child: job - pid: 15106
> 08/08/2007 16:45:37 [16384:15106]: reading passwd information for  
> user 'root'
> 08/08/2007 16:45:37 [16384:15106]: setting limits
> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_CPU setting: (soft  
> 18446744073709551615 hard 18446744073709551615) resulting: (soft  
> 18446744073709551615 hard 18446744073709551615)
> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_FSIZE setting: (soft  
> 18446744073709551615 hard 18446744073709551615) resulting: (soft  
> 18446744073709551615 hard 18446744073709551615)
> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_DATA setting: (soft  
> 18446744073709551615 hard 18446744073709551615) resulting: (soft  
> 18446744073709551615 hard 18446744073709551615)
> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_STACK setting: (soft  
> 18446744073709551615 hard 18446744073709551615) resulting: (soft  
> 18446744073709551615 hard 18446744073709551615)
> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_CORE setting: (soft  
> 18446744073709551615 hard 18446744073709551615) resulting: (soft  
> 18446744073709551615 hard 18446744073709551615)
> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_VMEM/RLIMIT_AS setting:  
> (soft 18446744073709551615 hard 18446744073709551615) resulting:  
> (soft 18446744073709551615 hard 18446744073709551615)
> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_RSS setting: (soft  
> 18446744073709551615 hard 18446744073709551615) resulting: (soft  
> 18446744073709551615 hard 18446744073709551615)
> 08/08/2007 16:45:37 [16384:15106]: setting environment
> 08/08/2007 16:45:37 [16384:15106]: Initializing error file
> 08/08/2007 16:45:37 [16384:15106]: switching to intermediate/target  
> user
> 08/08/2007 16:45:37 [16519:15106]: closing all filedescriptors
> 08/08/2007 16:45:37 [16519:15106]: further messages are in "error"  
> and "trace"
> 08/08/2007 16:45:37 [0:15106]: now running with uid=0, euid=0
> 08/08/2007 16:45:37 [0:15106]: start qlogin
> 08/08/2007 16:45:37 [0:15106]: calling qlogin_starter(/home/sound0/ 
> softsound/GridEngine/ssqueue/spool/hal00/active_jobs/13851.1, /usr/ 
> local/softsound/GridEngine/utilbin/lx24-amd64/rshd -l);
> 08/08/2007 16:45:37 [0:15106]: uid = 0, euid = 0, gid = 0, egid = 0
> 08/08/2007 16:45:37 [0:15106]: using sfd 1
> 08/08/2007 16:45:37 [0:15106]: bound to port 52903
> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - data = 0:52903:/usr/ 
> local/softsound/GridEngine/utilbin/lx24-amd64:/home/sound0/ 
> softsound/GridEngine/ssqueue/spool/hal00/active_jobs/ 
> 13851.1:hal00.softsound.com
> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - address =  
> munge.softsound.com:33046
> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - host =  
> munge.softsound.com, port = 33046

Your qmaster is hal00 and you want to access munge.softsound.com?  
Something special with munge, is it showing up in qhost? Has this  
node more than one name and/or network card?

-- Reuti


> 08/08/2007 16:45:40 [0:15106]: error connecting stream socket: No  
> route to host
> 08/08/2007 16:45:40 [0:15106]: communication with qrsh failed
> 08/08/2007 16:45:40 [0:15106]: forked "job" with pid 0
> 08/08/2007 16:45:40 [0:15106]: child: job - pid: 0
> 08/08/2007 16:45:40 [0:15106]: wait3 returned -1
> 08/08/2007 16:45:40 [0:15106]: can't open file /tmp/13851.1.real.q/ 
> pid: No such file or directory
> 08/08/2007 16:45:40 [0:15106]: write_to_qrsh - data = 1:can't open  
> file /tmp/13851.1.real.q/pid: No such file or directory
> 08/08/2007 16:45:40 [0:15106]: write_to_qrsh - address =  
> munge.softsound.com
> 08/08/2007 16:45:40 [0:15106]: illegal value for qrsh_control_port:  
> "munge.softsound.com". Should be host:port
> 08/08/2007 16:45:40 [16384:15105]: wait3 returned 15106 (status:  
> 2816; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
> 08/08/2007 16:45:40 [16384:15105]: job exited with exit status 11
> 08/08/2007 16:45:40 [16384:15105]: reaped "job" with pid 15106
> 08/08/2007 16:45:40 [16384:15105]: job exited not due to signal
> 08/08/2007 16:45:40 [16384:15105]: job exited with status 11
> 08/08/2007 16:45:40 [0:15105]: can't open file /tmp/13851.1.real.q/ 
> pid: No such file or directory
> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - data = 1:can't open  
> file /tmp/13851.1.real.q/pid: No such file or directory
> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - address =  
> munge.softsound.com:33046
> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - host =  
> munge.softsound.com, port = 33046
> 08/08/2007 16:45:43 [0:15105]: error connecting stream socket: No  
> route to host
>
> Shepherd error:
> 08/08/2007 16:45:40 [0:15106]: can't open file /tmp/13851.1.real.q/ 
> pid: No such file or directory
> 08/08/2007 16:45:40 [0:15105]: can't open file /tmp/13851.1.real.q/ 
> pid: No such file or directory
>
>
> Thank you very much!
>
> -- 
> /*
> * Yifan Zhang
> *
> * Softsound
> * +44 (0)1223 448 097
> */
>
> "Autonomy?s search technology is becoming a de facto standard for  
> companies? ? FT, 4th July 07
>
> Cream is a configuration of the Vim text editor that consists of a  
> set of scripts which can be run within Vim to make it behave more  
> like an editor now common to most personal computers which conform  
> to the Common User Access standards of interface and operability.
>
> The information contained in this message is for the intended  
> addressee only and may contain confidential and/or privileged   
> information.  If you are not the intended addressee, please delete  
> this message and notify the sender, and do not copy or distribute  
> this message or disclose its contents to anyone.  Any views or  
> opinions expressed in this message are those of the author and do  
> not necessarily represent those of Autonomy Systems Limited or of  
> any of its associated companies.  No reliance may be placed on this  
> message without written confirmation from an authorised  
> representative of the company.  Autonomy Systems Limited,  
> Registered Office:  Cambridge Business Park, Cowley Road, Cambridge  
> CB3 0WZ, Registered Number 03063054.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list