Kernels getting to zombie with a specific error

CNC Guy

Well Known Member
Folks,

Some or our kernels (CALL OBJECT, WORKFLOW, jdenet_k) are going to zombies regularly and we have to clean up them with a weekly restart. What we've noted is as below :

1) Mostly there is a DB Lock which happens on the DB and which has to be subsequently forcefully killed and then one of the kernels becomes zombie.

2) There are some of this error code which we see in those logs of kernels which get zombied.

"QUEUE04950060-Fetch failed for _01_6013 in F986130"

Can you suggest what might be going wrong?

Thanks,
CNC Guy
E1 8.11 SP1
8.96 D1
UNIX Solaris
 
Hello,

I think you should look into the reason as to why you are experiencing the DB locks. My bet is that if you figure that out, the rest (or majority of...) of your zombied processes will go away too.

The cause of a DB lock can be difficult to trace, I can offer some starting points to look at, but I am sure someone else can provide better details.

First, go talk to your DBA. He/She should be able to tell you which table has the lock on it. Depending on your DB platform, you may even get the person that caused the lock. Take that info and a really big stick and go find the person in question. Use the stick to "persuade" the person into telling you what they were doing at the time that the lock occurred (The stick may or may not be necessary depending on their level of cooperation.).
smile.gif
From there you can determine if it is possibly bad coding or a process not being followed, etc... My guess is the latter.

If you can't discern the user from the logs on the server or from the DBA's help, hopefully you can at least get a table name. If so, you might be able to use that info to talk to a business analyst to figure out who in the organization would be using (writing to, deleting from or inquiring on) the table. That may lead you in the right direction as to which department or group of individuals is causing the issue. Start asking around in the area if anyone had a hung session around that time. Then follow the advice above concerning the big stick when you find the culprit.
wink.gif


Hope this helps a little.

Dan
 
Thanks Dan. I do get the point you are trying to make and it indeed can be one of those reasons. unfortunaltely the db is being handled by a seperte group of people on our setup and they aren't up to the mark and I guess I would need to stick to get the required info from the dba's themselves ;-)

But ya it looks like in short there is a whole lot of possiblities and it is like finding a needle in a haystack.

I will wait to see if we can figure this out somehow.

Thanks,
CNC Guy
 
99% of these sorts of errors have a DB problem behind them. The only issue I have right now is on a Java Dynamic Connector coming from a system on the other end of a WAN (don't ask, and its being replaced by EBSS). Very occasionally there's a glitch on the WAN and a CallObject Kernel almost instantly zombies.

An especially memorable one was when at a client we implemented 8.97 on Orace 10.

The kernels would come up, run a few minutes, then zombie, new ones would start, run a few minutes and then zombie etc etc etc. Then after some time it progressively got even worse.

It turned out that the F986110 had a bad index on it, which as it was the test system hadn't been seen before, so a full table scan was OK. But the week before a developer, in thier infinte wisdom wrote a UBE interconnect which called another UBE on a PER LINE basis. This was fine on the instance it was developed on, but on the real test system resulted it 80,000 UBEs being submitted to the job queue all at once. This caused the full scan to take longer than the timeout period on the queue kernels, which then bombed and restarted, but crucially left the query running in the backgrouund, after about 15 mins the DB become so busy that everything else cascaded around it.

These issues are a complete nightmare to find. JAS makes it easier, since if the issue was caused by a JAS session it will usually log it. Alternatively UBEs and BFSNs caused by external systems can also cause it and these are a LOT harder. On the last two sites I've had decent DBAs and they've been able to tell me long running queries, locks etc on an alert basis. Finding the query is pretty much 90% of the battle, once you have a specific problem you can add an index, hint etc.

Literally last week the AB purge was taking 93 seconds on a lookup on the F4211. A quick addition of an index took the lookup rate in proof mode to 50 per second...
 
FWIW, the indexes on E1 tables are horribly, horribly wrong with some tables not having a clustered index and some tables having no indexes at all, resulting in heap storage.

I have asked Oracle to do an index analysis of indexes, starting with the Central Objects database. I asked them to start with the CO tables since the new XML specs storage tables are now there. The indexes on CO tables are horrible and since the spec storage for auto package discovery and dynamic generation require higher performance of these tables, the seek performance has become more critical. Also, the serialized objects tables are there and their indexes are not optimal either.

For maximum web client performance we must ensure that serialized objects tables and specs tables used in dynamic generation are optimized. Having the proper indexes (and keeping them defragmented) is critical.




[ QUOTE ]
99% of these sorts of errors have a DB problem behind them. The only issue I have right now is on a Java Dynamic Connector coming from a system on the other end of a WAN (don't ask, and its being replaced by EBSS). Very occasionally there's a glitch on the WAN and a CallObject Kernel almost instantly zombies.

An especially memorable one was when at a client we implemented 8.97 on Orace 10.

The kernels would come up, run a few minutes, then zombie, new ones would start, run a few minutes and then zombie etc etc etc. Then after some time it progressively got even worse.

It turned out that the F986110 had a bad index on it, which as it was the test system hadn't been seen before, so a full table scan was OK. But the week before a developer, in thier infinte wisdom wrote a UBE interconnect which called another UBE on a PER LINE basis. This was fine on the instance it was developed on, but on the real test system resulted it 80,000 UBEs being submitted to the job queue all at once. This caused the full scan to take longer than the timeout period on the queue kernels, which then bombed and restarted, but crucially left the query running in the backgrouund, after about 15 mins the DB become so busy that everything else cascaded around it.

These issues are a complete nightmare to find. JAS makes it easier, since if the issue was caused by a JAS session it will usually log it. Alternatively UBEs and BFSNs caused by external systems can also cause it and these are a LOT harder. On the last two sites I've had decent DBAs and they've been able to tell me long running queries, locks etc on an alert basis. Finding the query is pretty much 90% of the battle, once you have a specific problem you can add an index, hint etc.

Literally last week the AB purge was taking 93 seconds on a lookup on the F4211. A quick addition of an index took the lookup rate in proof mode to 50 per second...

[/ QUOTE ]
 
Amen. Trouble is the whole system has been built incrementally...
 
Back
Top