[GE users] drmaa return value for getJobProgramStatus

Daniel Templeton Dan.Templeton at Sun.COM
Tue Jul 3 19:28:17 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Ryan,

OK.  I see where your problem is.  Sometimes the function of the DRMAA 
session can be a little confusing.  According to the DRMAA standard, it 
is only possible to wait() for a job within the same session in which it 
was submitted.  In your code, you submit the job, exit the session, then 
try to wait().  Exiting the session unregisters the DRMAA client for job 
events from that session.  The wait() call is dependent on those events, 
so unregistering the client prevents it from working.  Same for 
synchronize().

getJobProgramStatus() and control(), on the other hand, are allowed by 
the DRMAA standard to function on jobs outside of the scope of the 
session.  Because they are synchronous RPC calls, they can work without 
the need for an event client.  (Actually, Issue 1485 says that 
getJobProgramStatus() does *not* work with jobs outside the session, but 
your results and my review of the source code say otherwise.  I'll have 
to see if we can figure out when 1485 was addressed and close it.)

So, there are two solutions to your problem.  The first is to just not 
exit() the session.  The second is to use the new reconnectable sessions 
feature to reconnect to the session that you just closed.

Educational reading:

http://blogs.sun.com/templedf/entry/drmaa_internals1
http://blogs.sun.com/templedf/entry/good_drmaa_news

Daniel

Ryan Golhar wrote:
> Dan,
>
> I'm still stuck.  I've stripped down my code to the basics to test this:
>
> 1.  I initialized the grid engine
> 2.  I submit the job and get the jobid
> 3.  I poll the status of the jobid using getJobProgramStatus.  The first
> time I call getJobProgramStatus, it returns QUEUED_ACTIVE.  
>
> If I call Session.wait(..) immediately after, I get an Exception that the
> job does not exist.  However qstat reports the job is queued.  
>
> If I wait for a few seconds, I see the job starts running (using qstat).
> The program called getJobProgramStatus which returns RUNNING.  I then call
> Session.wait and get the same exception.
>
> I'm not sure what is wrong.  If the program status is RUNNING, then I would
> expect Session.wait to work.  I've attached the java code I'm using.  Any
> help would be appreciated.  Thanks,
>
> Ryan
>
>
> -----Original Message-----
> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
> Sent: Saturday, June 30, 2007 4:21 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] drmaa return value for getJobProgramStatus
>
>
> Actually, I guess there is a third option.  In the DRMAA Hands-on Lab I 
> did for JavaOne, I just had the polling thread stop asking about a job 
> once it got the InvalidJobException, with the assumption that the job 
> was now in the hands of the wait thread.
>
> Daniel
>
> Daniel Templeton wrote:
>   
>> Gah!  My apologies.  Please ignore my previous two emails.  That's
>> what I get for writing emails at midnight.  I reread my emails this 
>> morning, and I don't know what I was thinking. :(
>>
>> The implementation *does* already do what I was suggesting.  It uses
>> the local job info cache to return the job state if the qmaster 
>> doesn't remember the job.  The reason why you're seeing the 
>> InvalidJobException is that you also have a thread doing a wait(ANY) 
>> call in a loop.  (Am I right?)  Once a wait() call has succeeded for a 
>> job, that job no longer exists.  Period.
>>
>> There are two ways to deal with the problem.  Either have the wait
>> thread notify the polling thread once a job has ended, or build the 
>> wait() call into the polling thread after a job's state is FINISHED or 
>> ERROR.
>>
>> Sorry for the confusion.
>>
>> Daniel
>>
>> Ryan Golhar wrote:
>>     
>>> Thanks Daniel.  It seems a bit odd that there is a Session.DONE but
>>> it will
>>> never get used.  If the DRMAA implementation does have information, 
>>> will it
>>> always work or is it just because of the session instance?  If the DRMAA
>>> implementation will always have information on completed jobs, then 
>>> it makes
>>> sense to use that information, but if its not guaranteed, then I 
>>> don't know
>>> if that is the best solution (in my opinion).  In either case, I 
>>> think it
>>> would be good to file it as an RFE.  How do I do that?
>>>
>>> Ryan
>>>
>>>
>>> -----Original Message-----
>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] Sent:
>>> Saturday, June 30, 2007 3:11 AM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] drmaa return value for getJobProgramStatus
>>>
>>>
>>> Ryan,
>>>
>>> The exception happens because the qmaster disavows all knowledge of
>>> finished jobs.  (Not exactly, but close enough for this discussion.)  
>>> Since the DRMAA implementation actually does have the information 
>>> about the job on hand, though, it really would make sense for the 
>>> getJobProgramStatus() method to use that information in the case of 
>>> finished jobs instead of only relying on the qmaster.  If you'd like 
>>> to file that as an RFE, that would be helpful.
>>>
>>> Thanks,
>>> Daniel
>>>
>>> Daniel Templeton wrote:
>>>  
>>>       
>>>> Ryan,
>>>>
>>>> That is indeed how the implementation works.  To confirm that the 
>>>> InvalidJobException from getJobProgramStatus() means that the job 
>>>> has ended, wait() for the job with the timeout set to 
>>>> Session.TIMEOUT_NO_WAIT.  If the job has finished, the wait() call 
>>>> will return its exit info, including why/how it exited.  If the job 
>>>> simply doesn't exist for some reason, you'll get another 
>>>> InvalidJobException.
>>>>
>>>> Daniel
>>>>
>>>> Ryan Golhar wrote:
>>>>    
>>>>         
>>>>> I'm able to successfully submit a job through Drmaa to the 
>>>>> appropriate queue and set other settings.  If the job is running 
>>>>> and I call getJobProgramStatus (Java), I get a return value of 
>>>>> Session.Running
>>>>> (32)
>>>>> which is correct.  Once the job completes, and I call
>>>>> getJobProgramStatus, I
>>>>> get an exception about the job id not being valid:
>>>>>
>>>>> org.ggf.drmaa.InvalidJobException: The job specified by the 'jobid' 
>>>>> does not exist.
>>>>>         at 
>>>>> com.sun.grid.drmaa.SessionImpl.nativeGetJobProgramStatus(Native
>>>>> Method)
>>>>>         at
>>>>>
>>>>>           
> com.sun.grid.drmaa.SessionImpl.getJobProgramStatus(SessionImpl.java:213) 
>   
>>>>>         at
>>>>> org.umdnj.JBLAST.LocalSGEBLAST.exeGet(LocalSGEBLAST.java:82)
>>>>>         at 
>>>>> org.umdnj.JBLAST.BlastResultThread.run(BlastResultThread.java:62)
>>>>>
>>>>> I can interpret this exception as the job has completed, however I
>>>>> don't think this is the correct way of doing things as I can't tell if
>>>>>       
>>>>>           
>>> the job
>>>  
>>>       
>>>>> complete successfully or if something else happened.   Am I missing
>>>>> something?
>>>>> Ryan
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------
>>>>> --
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>         
>>>>>           
>>>> --------------------------------------------------------------------
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>     
>>>>         
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>   
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>   
> ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list