URGENT Issue with Job Queues

CNC Guy

Well Known Member
Folks,

We are experiencing a very weird and critical issue in our setup since today. We have a bunch of Job queues (some single threaded and some multi) but the issue is even when there is only 1 job running (in P status) in one of the single threaded queues, the jobs submitted to other queues are going to S (in queue) and rest jobs are in W. So even though there isn't a job in some job queue the jobs are in S and in W status. Even the QBatch has 4 jobs in Queue (S) and next jobs in W whereas there is no job in P status in Qbatch.

We have bounced the jde services once and it worked for 15 minutes but again started getting this weird issues.

Please advise what can we do now to resolve this and what might be causing this.

Thanks,
CNC Guy
8.11 SP1
UNIX Solaris
Oracle 10G
OAS on windows
 
Can you verify at that point when the queues are messed up that there are no processes from prior reports still running in the background on the logic server?
 
--0016e6d588dee14f4b046ea4ff9c
Content-Type: text/plain; charset=ISO-8859-2
Content-Transfer-Encoding: quoted-printable

Please check the jde log, there might be no space in the JDE812 database
log needs to be truncated.

Regards,

Jawed Akhtar
[email protected]


)
ne
ng
in
obs
g
st.com?Subject=3DUnsubscribe&Body=3DSirs,

Please++remove+this+address+

--
Regards,

Jawed Akhtar

--0016e6d588dee14f4b046ea4ff9c
Content-Type: text/html; charset=ISO-8859-2
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Please check the jde log, there might be no space in the JDE812 database log=A0needs to be truncated.</div>
<div class=3D"gmail_quote">On Mon, Jul 13, 2009 at 2:19 PM, CNC Guy <span d <blockquote style=3D"BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex ; PADDING-LEFT: 1ex" class=3D"gmail_quote">Folks,

We are experiencin g a very weird and critical issue in our setup since today. We have a bunch of Job queues (some single threaded and some multi) but the issue is even when there is only 1 job running (in P status) in one of the single threade d queues, the jobs submitted to other queues are going to S (in queue) and rest jobs are in W. So even though there isn't a job in some job queue the jobs are in S and in W status. Even the QBatch has 4 jobs in Queue (S) and next jobs in W whereas there is no job in P status in Qbatch.

again started getting this weird issues.

Please advise what can we d o now to resolve this and what might be causing this.

Thanks,

CNC Guy
8.11 SP1
UNIX Solaris
Oracle 10G
OAS on windows
=3DOW&Number=3D148161" target=3D"_blank">entire JDELIST thread</a> is a vailable for viewing.
postlist.php?Cat=3D&Board=3DOpps" target=3D"_blank">Job Opportunities f <div align=3D"center"><font size=3D"-2" face=3D"Verdana, Arial">This is the JDELIST EnterpriseOne Mailing List.
JDELIST is not affiliated with JDEd wards=AE.
[email protected]?Subject=3DUnsubscribe&Body=3DSirs,

Please++remove+t his+address+from+the+JDELIST+EnterpriseOne+Mailing+List.

Thank+You." t
--0016e6d588dee14f4b046ea4ff9c--
 
Well Dan this is what happens. We've even restarted services and when we start the scheduled it sends jobs which gets processed. then users start submitting.. for sometime everything looks fine..

but then the jobs start behaving erratically.. jobs go to W, from there some go to P but some go to S and stay there and hold up queues.. sometimes there are no jobs in either S or P (in the same queue) but still jobs are in W.. So somewhere some component which pushes jobs to P from W isn't working properly.. also once this starts happening we see even the scheduler stops pushing new jobs.. we saw that when we tried to stop it it threw and error stating it was not able to check the status... so something wrong somewhere.
we also noted some jobs in mutexes in SAW.. basically if we go to our unix box and see the list of jobs we see more process id's that the one we see in P.. basically the jobs in S are also displayed as process on unix..

This is getting very alarming here and we have even contacted Oracle but yet to hear back from them so all help would be greatly appreciated.

Thanks,
CNC Guy
 
I've had this explained to me in the past. The way it works is queue kernel moves a job to 'S', then launches a runbatch, which does the spec merge for the version specs data selection and sequencing, then the runbatch moves it to a 'P' to start processing. All jobs spend at least a tiny amount of time in S before getting to P.

So yes, jobs in S are involved in actual runbatch processes.

If it is locked there, maybe something weird is happening that is causing the report specs to be locked so the other jobs are waiting in 'S' to write to specs? Did someone deploy a package on you?
 
I am surprised not to see any suggetions here by our jde family
frown.gif
It was a very tough last 4 days for us. The update is we almost tried everything we could (logged a Sev1 with Oracle, consultants from oracle worked with us from Denver, UK, Singapore, India but not too many suggestions.

Basically since we could not had the business suffer for that long we redirected the user submitted as well as scheduled jobs (this was a bit tricky) to one of our logic servers so currently the old batch server is kind of just there since it hosts our DB as well. Oracle are still testing something and we are awaiting what they say.. Basically the last resort is to deploy last Full package and subsequent update packages to that server.

In fact we had another issue that started happening when we redirected jobs to new server. people weren't able to open the pdf from web. from fat it worked.. we had 4 web servers so issue wasn't on web server and also they were able to open pdf's on other servers. so finally we resolved it by reindexing the job master table for that new server and it worked !!

But as I said earlier the original issue is still unresolved and we need to do something about it.
 
Guys,

Since there aren't any replied I would take this further.

As said earlier we almost tried everything and now even though we still have a Sev 1 running with Oracle there aren't any recomendations coming from them !!

Our last hope was to build a full package and deploy it this server.. But we checked a test package and the deployment is hanging.. it does not complete and the package deployment report stays running. The build was fine but the deployment isn't happening. this means we cannot even try and deploy a full package to see if it resolves our issue.

Can somebody here advise what can we do now?

Thanks,
CNC Guy
 
Well, as a matter of fact, you can. Manually. Just copy your bin & specs folders from the package folder over into the pathcode folder on the same server, overwriting the old files, while the server is shut down. Plus, delete the usual DD files from the destination pathcode. Does this make sense?
 
Hi,

Let me try and deal with each issue in turn.

The job queue issue would seem to indicate that your queue kernel is closing down peridoically for some reason. There is no need to stop and start the services to get the jobs submitting again, all you need to do is to go to the queue setup appliaction (P986130) and highlight each queue that is not processing and then click on the refresh queue row exit hyperlink. This should restart the queue. This does not resolve your underlying issue (I will leave that to Oracle support) but should keep you running.

Secondly onthe package deploy hanging - this is normally because it cannot get back a response from all kernels that they have been placed into a "suspend / hold mode" in order to allow the package to deploy. Review the last entries in your SvrPkgBuild.log in the package directory on your enterprise / logic server(s) and you will see the last messages of :-

Mutex resource created. Attempting to lock kernels via jdenet broadcast message.
Kernels locked attempting to get an exclusive WRITE lock.

This normally means that JDE is trying to lock kernels and cannot - either because a UBE or UBE's are still running (always checked submitted reports before attempting a deploy to ensure that no UBE's are running - and place queues on hold so no new jobs can start before you deploy) OR that it is trying to stop kernels that have blown (normaly call object kernels) that it thinks still exist BUT the server does not recognise.

Go to server manager and remove ALL zombie kernels before attempting to deploy - this is not fool proof but may help.

If you do all this before a deploy and it still hange then the only resolution is to restart the server to kill off the rogue kernel processes.

You may want to chaneg the following parameter in your server jde.ini:-

[DEBUG]
LogErrors=1

This will create .log files in your server log directory and an empty debug file BUT at least you can then review the log files and see which kernels are giving errors.

Hope this helps
 
Thanks for the replies Terry and Alex but as I said earlier we have tried most of the stuff here. Let me write the current situation:

- Our primary batch server was the one with problem so we routed all batch jobs (user submitted as well as scheduler by changing the *PUBLIC mapping for user submitted and individual scheduler entries for existing scheduler jobs)to one of our logic servers (we had 2 logic servers executing bsfns) and currently there is a workaround in place.

But the original server is still an issue. We have deactivated all queues on that server except QBATCH, restarted it several times, deleted DD & GLTBL specs and restarted again but the issue is still there. Everytime we try and submit a report from web to that server it goes to S and stays there..at times we have even seen JDE crash. So currently we have restarted jde on that server and Basically there aren't any queue kernel running as of now on that server.

We thought may be deploying a full package will resolve it but when we tried deploying a test package it hung. Here's the log. It hangs.. Are there any issues with TAM specs..
I am also attaching some set of logs which we sent to oracle which are two sets of ES logs for two JDE crashes
2.Two Core dump for crash.

--------------
Thu Jul 23 19:59:51 - Server Package Deploy Log
Thu Jul 23 19:59:51 - Package Name: PDTESTING
Thu Jul 23 19:59:51 - Package Description: Update Package
Thu Jul 23 19:59:51 - Parent Package: PD811FH
Thu Jul 23 19:59:51 - Path Code: PD811
Thu Jul 23 19:59:51 - Server Name: NZNSFN42
Thu Jul 23 19:59:51 - Process ID\Thread ID: 5760\928
Thu Jul 23 19:59:51 -
Thu Jul 23 19:59:51 - ----------------------------------------------------------
Thu Jul 23 19:59:51 - Initialize detail structure and working folders on server.
Thu Jul 23 19:59:52 - Initializing internal structures for server nznsfn42.
Thu Jul 23 19:59:52 - Message received from server.
Thu Jul 23 19:59:52 - Server initialized.
Thu Jul 23 19:59:52 - Detail server structure initialized.
Thu Jul 23 19:59:52 -
Thu Jul 23 19:59:52 - Setting up server nznsfn42 for server package deploy.
Thu Jul 23 19:59:52 - Sending lock message to server.
Thu Jul 23 19:59:52 - Server nznsfn42 locked and ready for package deploy.
Thu Jul 23 19:59:52 - Deploying TAM Package to Server nznsfn42.
Thu Jul 23 19:59:52 - Package was not compressed.
Thu Jul 23 20:17:51 -

---------------

Any assistance would be greatly appreciated folks.

But yep we have not tried the bin32 and spec folder copy. We are on unix. Can we still do it? How?

Thanks,
CNC Guy
 

Attachments

  • 148679-Files for Oracle.zip
    565.3 KB · Views: 106
Try deploying the last full build you had on this machine (before this build) and see if that hangs also. If it does it is probably not spec / tam related.

If that fails then you will need to watch the processes that are executing and see which one are running and whwteher they hang.

You may want to switch full logging on as well as this is a stand alone server and not being used by the user base at the moment.

If there is anything else pointing to this server (i.e. security server) then you may want to repoint that as well.

You may also want to check the permissions on the specs etc in the pathcodes on this server as well to ensure they are set correctly and can be updated - compare against the logic server.
 
Hi We had an issue once where batch jobs would not run. We found a message in a log file (AS400) which advised of duplicate keys on F986110. Eventually we took the decision to copy then clear the Job Control Master (F986110) and jobs started to run again. We then reinserted the old data back in to this table and the system has carried on without fault for more than 12 months.
 
Yep we've tried deploying old packages which was successfully deployed to this server earlier but they hung too.

Yep that server is still used as a security server in some fat clients but i don't think that is any problem.

On the permissions it was working until last friday so i don't suspect that as a permission issue.

Another part to the problem is that when we restart this server and then submit a job from fat client it goes through succesfully but when the same job is then submitted from a web client it goes to S and stays and subsequently all jobs (whether from fat client or web go to S).

In fact we had intially noted that even the scheduler jobs were going through fine but once the web jobs came in they kind of stuck the queue but it was intermitteent.. some jobs went to P and completed.. some other went to S and stayed there..

I am just curious to know whether this can be a tam related issue since the logs somehow say it is locked at TAM specs.. can we do a TAMFTP or something?

Thanks,
CNC Guy
 
Might be a dumb question but something had to trigger this event.

- Do you update the web server prior to the crash (not running the correct version of OAS, java, etc)?

- Do you have multiple web servers but only issues with 1 (points to an issue with a specific web server)?

- Drive space issue?

- Log space issue?

- Fat client submitted job through 'J' environment cause the issue?

- Redirect the web server to submit jobs to one of the app servers and see if it locks up.

- Windows/Java automatic updates running?

You need to isolate the issue. Trying the different combinations of setup should help point to the problem. If nothing seems to work, I would be getting a VM web server up and running to see what it does with a fresh install.

Sounds to me like the web server -> enterprise server handoff is crashing the job kernel. Prove that and you really narrow down the problem to the web server itself.
 
Is the Oracle Database on a SAN or using disks on the Server?

When the job(s) start going into "S" status -- can you check the following --
vmstat (check the CPU "id" column)
& iostat (disk "wait" or cpu "wt" columns).

At one of my clients (Oracle / Solaris / XE) we had the same issue -- the issue was tracked down to a SAN LUN contention issue. The server was not receiving the data from the SAN in a timely basis.

This was a EMC SAN - and had to do with I/O contention with another process using data from the same LUN. Issue was resolved by EMC.
 
Is debug set to "File" in your enterprise server? There could be excessive logging, which does not finish one job and then all the others are in queue.

Sriram
 
Back
Top