[GE users] switched to 6.1 and many problems

Harald Pollinger Harald.Pollinger at Sun.COM
Thu Aug 9 10:49:45 BST 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

this is the trace file of a qrsh job. Are you sure there is no 
link/alias/wrapper and the "qrsh" binary was not renamed or copied to 
"qsub"?
"qsub -help" prints what it really is.


 > 08/08/2007 16:45:40 [0:15106]: error connecting stream socket: No
 > route to host

This is the reason why the job fails, all later error messages are only 
consecutive faults. "No route to host" is a system error returned by the 
connect(3SOCKET) call (it's errno 148), the execution host really can't 
connect to "munge.softsound.com".

Regards,
Harald


Yifan Zhang wrote:
> Hi,
> 
> Thanks for helping. The ignore_fqdn was set to true. The funny thing is 
> the failure "can't open file /tmp/13851.1.real.q/pid: No such file" does 
> not really stop the job from finishing. I don't really know the details 
> of GE, does every qsub start as qlogin to execute jobs?
> And we have linux systems in different flavours, could that cause 
> problem? say submit host is FC and execution is SuSE
> 
> Thanks
> 
> Chansup Byun wrote:
> 
>> Reuti wrote:
>>
>>> Am 08.08.2007 um 17:59 schrieb Yifan Zhang:
>>>
>>>> We were using SGE5.3, and it is been doing its job for years. Now we 
>>>> decided to switch to GE6.1. At begining it worked, and then for no 
>>>> reason every host will be set to error automatically. The job 
>>>> finished fine. It is a qsub job, not qrsh. please help if you know 
>>>> what is the reason causing this. The typical error message is as 
>>>> following:
>>>>
>>>> Job 13851 caused action: Queue "real.q at hal00.softsound.com" set to 
>>>> ERROR
>>>> User        = ajr
>>>> Queue       = real.q at hal00.softsound.com
>>>> Host        = hal00.softsound.com
>>>> Start Time  = <unknown>
>>>> End Time    = <unknown>
>>>> failed before job:08/08/2007 16:45:40 [0:15106]: can't open file 
>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>> Shepherd trace:
>>>> 08/08/2007 16:45:37 [16384:15105]: shepherd called with uid = 0, 
>>>> euid = 16384
>>>> 08/08/2007 16:45:37 [16384:15105]: setpgid(15105, 15105) returned 0
>>>> 08/08/2007 16:45:37 [16384:15105]: no prolog script to start
>>>> 08/08/2007 16:45:37 [16384:15105]: forked "job" with pid 15106
>>>> 08/08/2007 16:45:37 [16384:15106]: processing qlogin job
>>>> 08/08/2007 16:45:37 [16384:15106]: pid=15106 pgrp=15106 sid=15106 
>>>> old pgrp=15105 getlogin()=<no login set>
>>>> 08/08/2007 16:45:37 [16384:15105]: child: job - pid: 15106
>>>> 08/08/2007 16:45:37 [16384:15106]: reading passwd information for 
>>>> user 'root'
>>>> 08/08/2007 16:45:37 [16384:15106]: setting limits
>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_CPU setting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615)
>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_FSIZE setting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615)
>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_DATA setting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615)
>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_STACK setting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615)
>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_CORE setting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615)
>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_VMEM/RLIMIT_AS setting: 
>>>> (soft 18446744073709551615 hard 18446744073709551615) resulting: 
>>>> (soft 18446744073709551615 hard 18446744073709551615)
>>>> 08/08/2007 16:45:37 [16384:15106]: RLIMIT_RSS setting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft 
>>>> 18446744073709551615 hard 18446744073709551615)
>>>> 08/08/2007 16:45:37 [16384:15106]: setting environment
>>>> 08/08/2007 16:45:37 [16384:15106]: Initializing error file
>>>> 08/08/2007 16:45:37 [16384:15106]: switching to intermediate/target 
>>>> user
>>>> 08/08/2007 16:45:37 [16519:15106]: closing all filedescriptors
>>>> 08/08/2007 16:45:37 [16519:15106]: further messages are in "error" 
>>>> and "trace"
>>>> 08/08/2007 16:45:37 [0:15106]: now running with uid=0, euid=0
>>>> 08/08/2007 16:45:37 [0:15106]: start qlogin
>>>> 08/08/2007 16:45:37 [0:15106]: calling 
>>>> qlogin_starter(/home/sound0/softsound/GridEngine/ssqueue/spool/hal00/active_jobs/13851.1, 
>>>> /usr/local/softsound/GridEngine/utilbin/lx24-amd64/rshd -l);
>>>> 08/08/2007 16:45:37 [0:15106]: uid = 0, euid = 0, gid = 0, egid = 0
>>>> 08/08/2007 16:45:37 [0:15106]: using sfd 1
>>>> 08/08/2007 16:45:37 [0:15106]: bound to port 52903
>>>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - data = 
>>>> 0:52903:/usr/local/softsound/GridEngine/utilbin/lx24-amd64:/home/sound0/softsound/GridEngine/ssqueue/spool/hal00/active_jobs/13851.1:hal00.softsound.com 
>>>>
>>>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - address = 
>>>> munge.softsound.com:33046
>>>> 08/08/2007 16:45:37 [0:15106]: write_to_qrsh - host = 
>>>> munge.softsound.com, port = 33046
>>>
>>>
>>> Your qmaster is hal00 and you want to access munge.softsound.com? 
>>> Something special with munge, is it showing up in qhost? Has this 
>>> node more than one name and/or network card?
>>
>>
>> hal00 should be an execution host.
>> munge.softsound.com should be a submmit host.
>>
>>>
>>> -- Reuti
>>>
>>>
>>>> 08/08/2007 16:45:40 [0:15106]: error connecting stream socket: No 
>>>> route to host
>>>> 08/08/2007 16:45:40 [0:15106]: communication with qrsh failed
>>>> 08/08/2007 16:45:40 [0:15106]: forked "job" with pid 0
>>>> 08/08/2007 16:45:40 [0:15106]: child: job - pid: 0
>>>> 08/08/2007 16:45:40 [0:15106]: wait3 returned -1
>>>> 08/08/2007 16:45:40 [0:15106]: can't open file 
>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>> 08/08/2007 16:45:40 [0:15106]: write_to_qrsh - data = 1:can't open 
>>>> file /tmp/13851.1.real.q/pid: No such file or directory
>>>> 08/08/2007 16:45:40 [0:15106]: write_to_qrsh - address = 
>>>> munge.softsound.com
>>>> 08/08/2007 16:45:40 [0:15106]: illegal value for qrsh_control_port: 
>>>> "munge.softsound.com". Should be host:port
>>>
>>
>> How is the ignore_fqdn set?
>> The FQDN hostname may have confused SGE shepherd?
>>
>> Regards,
>>
>> - Chansup
>>
>>
>>>> 08/08/2007 16:45:40 [16384:15105]: wait3 returned 15106 (status: 
>>>> 2816; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
>>>> 08/08/2007 16:45:40 [16384:15105]: job exited with exit status 11
>>>> 08/08/2007 16:45:40 [16384:15105]: reaped "job" with pid 15106
>>>> 08/08/2007 16:45:40 [16384:15105]: job exited not due to signal
>>>> 08/08/2007 16:45:40 [16384:15105]: job exited with status 11
>>>> 08/08/2007 16:45:40 [0:15105]: can't open file 
>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - data = 1:can't open 
>>>> file /tmp/13851.1.real.q/pid: No such file or directory
>>>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - address = 
>>>> munge.softsound.com:33046
>>>> 08/08/2007 16:45:40 [0:15105]: write_to_qrsh - host = 
>>>> munge.softsound.com, port = 33046
>>>> 08/08/2007 16:45:43 [0:15105]: error connecting stream socket: No 
>>>> route to host
>>>>
>>>> Shepherd error:
>>>> 08/08/2007 16:45:40 [0:15106]: can't open file 
>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>> 08/08/2007 16:45:40 [0:15105]: can't open file 
>>>> /tmp/13851.1.real.q/pid: No such file or directory
>>>>
>>>>
>>>> Thank you very much!
>>>>
>>>> -- 
>>>> /*
>>>> * Yifan Zhang
>>>> *
>>>> * Softsound
>>>> * +44 (0)1223 448 097
>>>> */
>>>>
>>>> "Autonomy???s search technology is becoming a de facto standard for 
>>>> companies??? ??? FT, 4th July 07
>>>>
>>>> Cream is a configuration of the Vim text editor that consists of a 
>>>> set of scripts which can be run within Vim to make it behave more 
>>>> like an editor now common to most personal computers which conform 
>>>> to the Common User Access standards of interface and operability.
>>>>
>>>> The information contained in this message is for the intended 
>>>> addressee only and may contain confidential and/or privileged  
>>>> information.  If you are not the intended addressee, please delete 
>>>> this message and notify the sender, and do not copy or distribute 
>>>> this message or disclose its contents to anyone.  Any views or 
>>>> opinions expressed in this message are those of the author and do 
>>>> not necessarily represent those of Autonomy Systems Limited or of 
>>>> any of its associated companies.  No reliance may be placed on this 
>>>> message without written confirmation from an authorised 
>>>> representative of the company.  Autonomy Systems Limited, Registered 
>>>> Office:  Cambridge Business Park, Cowley Road, Cambridge CB3 0WZ, 
>>>> Registered Number 03063054.
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
> 
> 


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         N1 Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1,
D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list