Accumalation of outstanding requests on callobjects processes of enterprise servers

antoine_mpo

Reputable Poster
Hi list.

For a couple of months, we have been meeting more and more often the following problem on our production plateform
(JDE OneWorld Xe SP23_Q1) :

Some ube and some interactive transactions starts to get stuck (it can be only on ube, or only interactive, or both)

When i look at web saw for each JAS application, i see increasing number of "busy", and timeouts.
In the Jas.log of the JAS application where there are timeouts, i can see timeout errors with the process id on the enterprise server, and that the server is maybe unavailable.
In web saw for enterprise servers, it start to have several "outstanding requests" for one or 2 call object processes.
Of course, from time to time, outstanding requests can come back to zero, when all the requests reach timeout.
But anyway, the problem is still there, and starts to appear on some other callobjects processes.
If we let the situation goes on, more and more processes have the issue. (but there are still some transactions and
ube running correctly. So it's not all the processes that are ghosts)
Looking at the processes on the server, they are still there, but it seems to have no activity.
Looking at the jde.log of theses "ghost" processes, i can't see nothing.
Looking at the jde.log of the related jdenet process, nothing.

We tried several time to kill a ghost process, but it's useless, coz the problem occurs on more and more others.
The only solution we have : stop and restard the whole platform.

Lately, we met the issue several time per week.
We even already had the case of 2 times in less than 24 hours.

Some technicals information on the platform :
all servers are wintel, with Windows 2003.
The database (oracle 10.2.0.3) is on a dedicated server.
We use 2 enterprise servers (one is only bsfn server, the other is ube + bsfn).
We use 2 web servers (websphere 6.0.2.13), and 6 jas applications (3 on each server), using horizontal and vertical websphere cloning.
The web and entreprise servers are virtual servers, on 2 physical ESX VMWare machines.

Any idea or suggestion would be appreciated.

Thanks for your help.
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi,

1. How many users do you have?
2. May you, please, post your JDE.INI, JAS.INI and JDBJ.INI
(do not post your passwords)?
3. Has it started right after installing SP23Q1 or
this problem has been around for a long time?
4. Have you checked CPU, RAM, network and disk activity
on your servers?
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi Sebastian,

Here are some answers to your questions :
1- during the day, there are something like 15-20 users connections at a time, per JAS application. We have 6 applications. So we have about 100 users.
2- I'll attached the jde.ini and jas.ini (there's no jdbj in Xe) when i'm back to my office.
3- We upgraded our platform (SP20 to SP23, Oracle 8 to Oracle 10, Windows 2000 to Windows 2003, new servers, and use of VMWare) 2 years ago. Since that time, we had the issue couple of time. But since february, it appeared more and more often.
4- Our production platform is outsourced (servers are in site, but administrated remotely), and according to our outsourcing company, there's nothing wrong on the servers (CPU, RAM, ...). No alerts, no overload. But, it's according to indicators that are under surveillance. May be we're missing something. (and i can't check anything, as i don't have access to the servers).

Here are the information i can give you for the moment.

Thanks.
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Are you taking automatic MS Windows updates on any servers?
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi Alex,

Automatic updates, God no !!!
:)
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Every time I've seen this in the past its been the database. A classic symptom is single threaded CallObject kernels (which they are in Xe) with a stack of requests waiting, these are highlighted red.

The way we got around this was to find out what each user was doing at the time to cause the issue, debug it, find the query, and then liase with the DBAs to optimise that particular query. Sometimes the dba was able to give us the query that was taking time. We had all sorts of problems which only really came down to three queries.

With Oracle is vital that all stats etc are kept up to date, and Oracle can get it quite badly wrong. We had one once that was caused by having a query running over 10 histogram buckets instead of 1 - whatever that means!

Paul
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi Sebastian,

i'm attaching jde.ini of both entreprise servers (Q11 is a bsfn server, Q10 is a ube/bsfn server), and one jas.ini.


Regards,
 

Attachments

  • 145372-ini_files.zip
    9.5 KB · Views: 110
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi Antoine,

I'm reading your INI files, here are my comments :

JAS.INI

1. I'm surprised by the high number of connections you
open on each JAS instance (up to 110), that may have
a noticeable impact on your DB.

You have 6 JAS, so you can have to 660 connections opened.

I suggest you reducing those connections to 25 per JAS,
you only have 20 users per JAS, haven't you?

2. Are you using the proper Oracle JDBC drivers?
The ones matching your DB and Xe levels?

3.Why did you set the DBFetchLimitBeforeWarning=200000?
Isn't that too much? Fetching 200,000 records from DB
without any warning? That will definitely kill DB and
JAS performance

JDE.INI #10

You're having Call Objects contention, have you tried
increasing Call Objects from 20 to, let'say, 30?

JDE.INI #11

Seems OK, but why do you have one server for BSFNs and
another for UBE+BSFNs? Is there anything special there?

Both servers :

Have you set your antivirus to exclude /b7333 folder
and its contents? (most critical : /spec, /bin32, /log
and /PrintQueue folders)
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi Sebastian,

First, thanks a lot for checking the configuration files.

The setup was made by someone from Oracle, when we installed the SP23 platform.

Here are the some answers to your questions :

1 - You're probably right. Maybe we should decrease the max session number. But 25 wouldn't be enough i guess. For instance, as i'm writing here, the sessions are not very well balanced between the JAS applications. i got 46 opened sessions on one, 28 on another, and 3 on a third one ...
And as far as i know, we wanted to have enough possible sessions per JAS in case a web server is down.

2- yes we have the proper jdbc drivers (it's something that was well checked in a previous issue we had of PO losses at runtime. A SR that had been opened for a year :)

3 - We set DBFetchLimitBeforeWarning so high because it was very annoying to have warning message in P31225, when people are checking which WO they have to proceed. Several times per day each department is searching the WO that are at given range of status and a planner number. As you know, in this screen, the criteria of range statuts are applied after the sql request, in the record fetch of the grid, to display only the correct rows. If we let the DBFetchLimitBeforeWarning too low (like the 2000 set by default), users get the warning every 2000 lines fetched in the grid. They can't work that way.
I agree we should do something (for instance, to move old WO records to another table), but it's been a big discussion in my company for long ...

About our entreprise servers :

Recently, in another SR i opened, the Oracle analyst checked the jde.ini and suggest to lower the callobject from 20 to 17, as nowadays, the recommendations are to set a call object kernel per 6 users.
So don't know what to think of setting 30 kernels. I'm getting confused.


Why do we have a bsfn server and a bsfn+ube ? Well, it's due to the history of our platform :
At the begining (when we set up the new SP23 platform), there was one bsfn server (Q11) and one UBE server (Q10). It was nice coz we could stop the UBE server if needed without impact for interactive. But the thing is that we met some overload issues on Q11 during our big activity period of year. As a consequence, we set up ocm mapping (and reviewed jde.ini of the servers, for appropriate kernel setup) for a group of users to run bsfn on Q10.
It's not a very satisfaying solution, but we are planning to add a new server (virtual one), to have 2 bsfn servers and one ube. (But it requires we upgrade memory on our ESX servers. and you know what : we had an issue on one of it after adding 8 Go. So we had to remove them on one ESX !! :-( .... Did you say Murphy's law ???)

And last question :
it was a good idea to check antivirus setup !
I asked our outsourcing company and on entreprise servers and the result was :
- no exlusions at all on Q11 (bsfn server)
- on Q10 (UBE+bsfn) : only B7333\system was excluded

So we just changed it to exclude everything under B7333.
Don't know if it's gonna change the world, but it's probably a good thing.
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi Antoine,

I met quite a few accounts that were used to work with
gigantic grids (>2000 records) from old FAT client times.

However, that's not the way Web applications are designed
to work.

A Web grid is not an Excel and shouldn't be taken as such.

In fact, no human user is able to read, memorize and
process 2,000 rows in a glance; in fact, I noticed that
most users click on Find, they retrieve 1000 or 100000
records and just work on 5, 10 or 20 of them.
The rest is just network, JAS and db waste of time.

You should probably customize that screen to provide
more restrictive queries or some summarized information
to your users.

Please, remember that users are not wrong, it's the way
the interface is presented to them. If you provide them
with a better interface they'll run lighter queries.

On the other hand, excluding the antivirus from scanning
B7333 folders will give you a disk performance boost,
I'm sure your users will notice that change.

Finally, be careful when splitting UBE from BSFN servers,
cause UBEs call a lot of BSFNs, so your network should
be very well tuned to provide the necessary throughput.

I agree that separating DB from UBE+BSFN gives better
scalability and sometimes performance too, but I don't
think there are so important gains when separating UBEs
from BSFNs.

Good luck with your changes!
Bonne chance!
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi Sebastian,

I only replied, but i can't see my reply. So i'm doing it again (hoping my answer won't appear twice).

I probably didn't explain well the problem of the warning in some interactive application, like P31225.
I didn't mean to say our users were dealing with thousands of lines. In that case, you're right, i would have require some developpement to work in some other way (like developping a UBE with csv output).
I meant that our production departments check everyday (several time per days) which Work orders they need to produce. And for that, they search in P31225 the WOs for a particular planner (that identify their departement) and with a particular range of status (for instace blank to 20).
The way is designed that application (and every other application where you can filter with a range of values on a field) is the following :
- the sql request sent to the db has no criteria on the status (in our case, the sql made on F4801 will only be filtered on the planner number). It will then return a big amount of rows (thousands, because we have almost all work orders from the start of the ERP).
- In the event "grid record is fetched", the eventrules will check if the status of the row is in the range your want. If not, the row won't be displayed in the grid.

So, from what i remember, the warning (due to the setting in jas.ini) is raised very often because you get a lot of rows from the sql statement, and not because of the number of rows really displayed.
That's why i was talking of moving records to another table.


About splitting the UBE and BSFN on different server, i think it's not an issue. Indeed, all the bsfn called in a ube are executed on the server where is running the UBE, whatever OCM mapping you set up on bsfn (And i would say hopefully).

I hope our changes on antivirus exclusions will help, but i'm not sure users will notice it. I must confess that after several years of experience, i learnt to be extra carefull on what shoud be (and seems logical) and what is the reality of these big and complex systems we work with everyday ! ;-)

Thanks for your help Sebastian.
We'll see if the platform is more stable.
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi Antoine,

No problem, I thought users were scrolling thru 10,000 or
50,000 records (I saw users doing that!)
Tell us how it goes with the Antivirus...

Good luck / Bonne chance
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi List,

Here are some updates on my issue.
It still occurs, inspite of the changes done with antivirus exclusions parameters.
But i could get more information of what's happening with the processes. Maybe it could help to find what is the cause.

During the 2 last occurences (april 30th and may 11th) i did the following, working with our outsourcing company :
When the issues are starting, i asked the outsourcing company to check if they were seeing connection to the database for the entreprise processes accumulating outstanding requests and which sql statements were in progress.
And me, at the same time, i activated on the fly the debug logs for the processes, through the web saw for enterprise server.

On april 30th, each process that was accumulating outstanding request was having a database connection, with the sql statement :
SELECT * FROM PRODDTA.F4111 WHERE ( ILDOC = :KEY1 AND ILDCT = :KEY2 AND ILKCO = :KEY3 )
Looking in the debug log file of a process that seemed stuck, i could find this request :
Apr 30 11:08:49 ** 3616/1680 SELECT * FROM PRODDTA.F4111 WHERE ( ILDOC = 2286535.000000 AND ILDCT = 'IM' AND ILKCO = '00106' )
And nothing was written in the log file for a while.
In the meantime, i executed this sql statement in a sql editor. I get the result in less than a second, and the explain plan was very good :
Plan
SELECT STATEMENT ALL_ROWSCost: 5 Bytes: 379 Cardinality: 1
2 TABLE ACCESS BY INDEX ROWID TABLE PRODDTA.F4111 Cost: 5 Bytes: 379 Cardinality: 1
1 INDEX RANGE SCAN INDEX PRODDTA.F4111_2 Cost: 4 Cardinality: 1 )
After 9 minutes, informations started to be written again the the debug file :
Apr 30 11:17:49 ** 3616/1680 ORACLE DBFetch: Invoke OCI Fetch fetchNumRows = 100
Apr 30 11:17:49 ** 3616/1680 Entering JDB_CloseTable(Table = F4111)
Apr 30 11:17:49 ** 3616/1680 Entering JDB_ClearSequencing
Apr 30 11:17:49 ** 3616/1680 Entering JDB_ClearSelection
Apr 30 11:17:49 ** 3616/1680 ORACLE DBFreeReq conn=02F4BF10 requ=030B7008 CLOSE
Apr 30 11:17:49 ** 3616/1680 Entering JDB_ClearBuffers
Apr 30 11:17:49 ** 3616/1680 Exiting JDB_ClearBuffers with success.
Apr 30 11:17:49 ** 3616/1680 ORACLE DBFreeReq conn=02F4BF10 requ=06D38658 DROP
and then, the same problem again appeared :
In the same debug log, we can see it again :
Apr 30 11:17:58 ** 3616/1680 SELECT * FROM PRODDTA.F4111 WHERE ( ILDOC = 2286537.000000 AND ILDCT = 'IM' AND ILKCO = '00106' )
Apr 30 11:29:37 ** 3616/1680 ORACLE DBFetch: Invoke OCI Fetch fetchNumRows = 100

12 minutes this time between the select statement, and the fetch.

After the issue occurence of april 30th, we were wondering if something was wrong with F4111 table, but at the same time it seemed a bit strange as i could execute the statement very fast with an editor, and that other statement on the same table were working fine ...

The occurence of april 11th killed the hypothesis.
We did the same but this time, all the processes that were accumulating outstanding requests seemed stuck on the following sql statements :
SELECT * FROM proddta.f4211 WHERE ( sdkcoo = :KEY1 AND sddoco = :KEY2 AND sddcto = :KEY3 AND sdmcu = :KEY4) ORDER BY sddoco ASC, sddcto ASC, sdkcoo ASC, sdlnid ASC

Indeed, in the debug log file of one of it here is what we see :
May 11 11:38:38 ** 2232/2128 SELECT * FROM PRODDTA.F4211 WHERE ( SDKCOO = '00106' AND SDDOCO = 4330101.000000 AND SDDCTO = 'F1' AND SDMCU = ' F03' ) ORDER BY SDDOCO ASC,SDDCTO ASC,SDKCOO ASC,SDLNID ASC
May 11 11:53:47 ** 2232/2128 Entering JDB_Fetch

It took 15 minutes, but with a sql editor, i get the result in less than a second and the explain plan is good again.

Any idea on what to do next time to get further information ?

oh, by the way, did i mention that i still don't have any answer from Oracle to the SR i opened on March 18th ?
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi List,

Just a little update to say that we still meet the issue from time to time, and that Oracle support doesn't seem to care much.
After the debug logs i sent to the support, they answered that they opened a internal case with a development team.

So far, this development team didn't give any information about it.
The only update i had lately from Oracle was to ask when we upgraded the database, what was the value of optimizer_mode parameter of the Oracle database, and if we were using custom E1 reports or standard ....

The SR has been opened for almost 3 months now, and Oracle didn't help us on that matter. When you know the price we pay each year to Oracle for E1 maintenance ...
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Sounds like a regular performance problem. The fact that these servers are virtualised may also be contributing.

The plan is calculated for every run, so what you get running it in SQL*Plus is not necessarily the same as what JDE would get.

Get an independent DBA to look at it. And another one. And maybe one more after that ;-)
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi Alex,

I guess you're right about the fact it sounds like a regular performance issue.

And you're also right when you say that the plan i get running sqlplus is probably not the same as JDE is getting because analysing another issue (a custom application, that sometimes returns no row for a search, in web client, but with criterias that should give something), and searching in dba forums, i now know that explain plan alone can not be trusted, especially for sql request with binds variable (which is the case with E1).

The only way to really see what happens with a SQL request is to activate sql trace, and analyse it.
And i wonder if the problem i have with this custom program is the same symptom as for the issue of oustanding requests. Because for that application, i could notice that : sometimes the search is working fine, and sometimes, you don't get any row in web, but in fact, from my tests in fat client, you don't get any because a sql request (on F4211) take 10 minutes to get the resultset. So i guess in web client, the 5 min timeout leads to the "no row".
When i looked at the sql request, the explain plan was saying "full access on F4211". So i thought : "bingo, that's the problem". But analysing it a bit more (i had the same explain plan in developpement, but a very fast answer), i went on dba forums and found the explanation of why the explain plan alone cannot be trusted.
So i did a sql trace, and indeed, the "row source operation", that shows the real plan used by Oracle at the time of execution is using an index.
But so far, i could only produced the sql trace for a well working application. I have to find a moment where the program is getting slow, to compare what i get in sql trace.
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

These slow-downs would likely depend on values used in the query, but because DB would cache what it's read, re-running the same quesry right away would likely work much quicker.

Have you tried noting the search values used and then trying the same ones after this server has been restarted? - if you can reproduce this issue, you will have a much better chance of fixing it, of course...
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Hi Alex,

I could notice the following thing, during my tests in fat client :
- I first logged in to PD7333, and did a search. It worked fast
- I logged out and logged in to JPD7333 and did the same search (with same criteria) and it worked fast.
- I was still logged in, and several minutes after that try, i launched the search again (same criteria) and it took 10 minutes to get the result
- i launched the search again straight away : it was long again, and in the meantime, i logged in to another fat client, in PD7333 this time, and did the same search : long too.

So, as you can see, with the same value, it worked very well for almost all day, and for some time it went very very very very slow.
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

Well, in this case look at what else is running - there may be some ad-hoc (or not) queries using 100% of disk bandwidth for ~10 minutes at a time. Maybe even in a different DB entirely, depending on what databases you have on this disk.

So, if you re unlucky and overlap, it would take longer and if not - quicker.

Could this be the case?
 
Re: Accumalation of outstanding requests on callobjects processes of enterprise servers

You're right, it could be something due to something with disk accesses. Can't tell it right now. Furthermore our production plateform is managed by an outsourcing company, so we can't check this by ourselves (anyway, i don't have the skills for that. I don't know anything about SAN device).

But if the problem was what you suggest it could be, then, if i manage to generate a sql trace for the sql requests taking lots of time, i shouldn't see any incorrect "real" execution plan, should i ?
I think it's the first thing i should get, to know where to look further, Oracle database, or Disk accesses.
If i see that the execution plan is different when the request is slow from the execution plan when it's fast, then it would be more something about Oracle (maybe something about oracle cache that would get deleted due to some transactions requiring a lot of memory space, or something like that), and if the execution plan is the same, then it would be more something about disk accesses.
Do you agree ?
 
Back
Top