[GE users] switched to 6.1 and many problems

Chansup Byun Chansup.Byun at Sun.COM
Wed Aug 8 22:06:09 BST 2007


    [ The following text is in the "windows-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti wrote:
> Am 08.08.2007 um 17:59 schrieb Yifan Zhang:
>
>> We were using SGE5.3, and it is been doing its job for years. Now we 
>> decided to switch to GE6.1. At begining it worked, and then for no 
>> reason every host will be set to error automatically. The job 
>> finished fine. It is a qsub job, not qrsh. please help if you know 
>> what is the reason causing this. The typical error message is as 
>> following:
>>
>> Job 13851 caused action: Queue "real.q at hal00.softsound.com" set to ERROR
>> User        = ajr
>> Queue       = real.q at hal00.softsound.com
>> Host        = hal00.softsound.com
>> Start Time  = <unknown>
>> End Time    = <unknown>
>> failed before job:08/08/2007 16:45:40 [0:15106]: can't open file 
>> /tmp/13851.1.real.q/pid: No such file or directory
>> Shepherd trace:
>> 08/08/2007 16:45:37 [16384:15105]: shepherd called with uid = 0, euid 
>> = 16384
>> 08/08/2007 16:45:37 [16384:15105]: setpgid(15105, 15105) returned 0
>> 08/08/2007 16:45:37 [16384:15105]: no prolog script to start
>> 08/08/2007 16:45:37 [16384:15105]: forked "job" with pid 15106
>> 08/08/2007 16:45:37 [16384:15106]: processing qlogin job
>> 08/08/2007 16:45:37 [16384:15106]: pid=15106 pgrp=15106 sid=15106 old 
>> pgrp=15105 getlogin()=<no login set>
>> 08/08/2007 16:45:37 [16384:15105]: child: job - pid: 15106
>> 08/08/2007 16:45:37 [16384:15106]: reading passwd information for 
>> user 'root'
>> 08/08/2007 16:45:37 [16384:15106]: setting limits
>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_CPU setting: (soft 
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>> 18446744073709551615 hard 18446744073709551615)
>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_FSIZE setting: (soft 
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>> 18446744073709551615 hard 18446744073709551615)
>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_DATA setting: (soft 
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>> 18446744073709551615 hard 18446744073709551615)
>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_STACK setting: (soft 
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>> 18446744073709551615 hard 18446744073709551615)
>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_CORE setting: (soft 
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>> 18446744073709551615 hard 18446744073709551615)
>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_VMEM/RLIMIT_AS setting: 
>> (soft 18446744073709551615 hard 18446744073709551615) resulting: 
>> (soft 18446744073709551615 hard 18446744073709551615)
>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_RSS setting: (soft 
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>> 18446744073709551615 hard 18446744073709551615)
>> 08/08/2007 16:45:37 [16384:15106]: setting environment
>> 08/08/2007 16:45:37 [16384:15106]: Initializing error file
>> 08/08/2007 16:45:37 [16384:15106]: switching to intermediate/target user
>> 08/08/2007 16:45:37 [16519:15106]: closing all filedescriptors
>> 08/08/2007 16:45:37 [16519:15106]: further messages are in "error" 
>> and "trace"
>> 08/08/2007 16:45:37 [0:15106]: now running with uid=0, euid=0
>> 08/08/2007 16:45:37 [0:15106]: start qlogin
>> 08/08/2007 16:45:37 [0:15106]: calling 
>> qlogin_starter(/home/sound0/softsound/GridEngine/ssqueue/spool/hal00/active_jobs/13851.1, 
>> /usr/local/softsound/GridEngine/utilbin/lx24-amd64/rshd -l);
>> 08/08/2007 16:45:37 [0:15106]: uid = 0, euid = 0, gid = 0, egid = 0
>> 08/08/2007 16:45:37 [0:15106]: using sfd 1
>> 08/08/2007 16:45:37 [0:15106]: bound to port 52903
>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - data = 
>> 0:52903:/usr/local/softsound/GridEngine/utilbin/lx24-amd64:/home/sound0/softsound/GridEngine/ssqueue/spool/hal00/active_jobs/13851.1:hal00.softsound.com 
>>
>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - address = 
>> munge.softsound.com:33046
>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - host = 
>> munge.softsound.com, port = 33046
>
> Your qmaster is hal00 and you want to access munge.softsound.com? 
> Something special with munge, is it showing up in qhost? Has this node 
> more than one name and/or network card?

hal00 should be an execution host.
munge.softsound.com should be a submmit host.

>
> -- Reuti
>
>
>> 08/08/2007 16:45:40 [0:15106]: error connecting stream socket: No 
>> route to host
>> 08/08/2007 16:45:40 [0:15106]: communication with qrsh failed
>> 08/08/2007 16:45:40 [0:15106]: forked "job" with pid 0
>> 08/08/2007 16:45:40 [0:15106]: child: job - pid: 0
>> 08/08/2007 16:45:40 [0:15106]: wait3 returned -1
>> 08/08/2007 16:45:40 [0:15106]: can't open file 
>> /tmp/13851.1.real.q/pid: No such file or directory
>> 08/08/2007 16:45:40 [0:15106]: write_to_qrsh - data = 1:can't open 
>> file /tmp/13851.1.real.q/pid: No such file or directory
>> 08/08/2007 16:45:40 [0:15106]: write_to_qrsh - address = 
>> munge.softsound.com
>> 08/08/2007 16:45:40 [0:15106]: illegal value for qrsh_control_port: 
>> "munge.softsound.com". Should be host:port

How is the ignore_fqdn set?
The FQDN hostname may have confused SGE shepherd?

Regards,

- Chansup


>> 08/08/2007 16:45:40 [16384:15105]: wait3 returned 15106 (status: 
>> 2816; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
>> 08/08/2007 16:45:40 [16384:15105]: job exited with exit status 11
>> 08/08/2007 16:45:40 [16384:15105]: reaped "job" with pid 15106
>> 08/08/2007 16:45:40 [16384:15105]: job exited not due to signal
>> 08/08/2007 16:45:40 [16384:15105]: job exited with status 11
>> 08/08/2007 16:45:40 [0:15105]: can't open file 
>> /tmp/13851.1.real.q/pid: No such file or directory
>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - data = 1:can't open 
>> file /tmp/13851.1.real.q/pid: No such file or directory
>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - address = 
>> munge.softsound.com:33046
>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - host = 
>> munge.softsound.com, port = 33046
>> 08/08/2007 16:45:43 [0:15105]: error connecting stream socket: No 
>> route to host
>>
>> Shepherd error:
>> 08/08/2007 16:45:40 [0:15106]: can't open file 
>> /tmp/13851.1.real.q/pid: No such file or directory
>> 08/08/2007 16:45:40 [0:15105]: can't open file 
>> /tmp/13851.1.real.q/pid: No such file or directory
>>
>>
>> Thank you very much!
>>
>> -- 
>> /*
>> * Yifan Zhang
>> *
>> * Softsound
>> * +44 (0)1223 448 097
>> */
>>
>> "Autonomy?s search technology is becoming a de facto standard for 
>> companies? ? FT, 4th July 07
>>
>> Cream is a configuration of the Vim text editor that consists of a 
>> set of scripts which can be run within Vim to make it behave more 
>> like an editor now common to most personal computers which conform to 
>> the Common User Access standards of interface and operability.
>>
>> The information contained in this message is for the intended 
>> addressee only and may contain confidential and/or privileged  
>> information.  If you are not the intended addressee, please delete 
>> this message and notify the sender, and do not copy or distribute 
>> this message or disclose its contents to anyone.  Any views or 
>> opinions expressed in this message are those of the author and do not 
>> necessarily represent those of Autonomy Systems Limited or of any of 
>> its associated companies.  No reliance may be placed on this message 
>> without written confirmation from an authorised representative of the 
>> company.  Autonomy Systems Limited, Registered Office:  Cambridge 
>> Business Park, Cowley Road, Cambridge CB3 0WZ, Registered Number 
>> 03063054.
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list