debugging zombie kernel

nkuebelbeck

nkuebelbeck

VIP Member
So from time to time random kernels go zombie. I have no way to reproduce the zombie kernels. All i have is the bssv log,zombie kernel logs and jdenet log. I know what function derps. I don't know what data was sent in without turning on a ton of logging and waiting for the process to derp. this could take weeks or months (slowing everything down).

The zombie process is created when BSSV executes a bsfn and the call object kernel derps (again, i do not know why and can't reproduce). this causes the bssv to return 500 errors for that user as the next executing bssv will not attach to a new kernel (i've complained about this and it's working as designed per oracle. any bssv executing bsfn using cache will NEVER attach to new process without restarting the windows bssv service)

Whats the best way to go about trying to find the source of the zombie process? are there any other logs to point me in a direction without turning on debug logging?
 
I don't have a specific answer for you other then to tell you it can be a very lengthy process.

We were/are having a similar issue with XML CallObject (zombie kernels). We tried to figure it out for months w/o success. It finally took a very concentrated effort involving people from CNC, developers and BAs to finally get to a point were we could reproduce it. We had regular meetings, formed hypothesis, constructed tests, etc. I guess the best way to go about it is to start by simply logging every time it happens and look for patterns.

In our case it appears to be a bug in the tools release (Oracle is able to reproduce the issue) but we are still waiting on a resolution.
 
Ours is so sporadic. 1 a week sometimes,1 a month, 1 every other month.

I was hoping there would be some system log telling me why the process took a dirt nap

Sounds like some precision guess work is in my future.
 
Does the jde.log of the kernel show a stack dump? That will indicate exactly which low level function is executing when the process crashes and can be helpful.

Craig
 
yes. i have the jde log for the kernel with the stack dump. it's unclear how I read it. top down? bottom up?

the jdenet also logged what function fails but doesn't tell me why that function failed.
 
Is there an actual error listed in the log?

It appears that the MathCopy in I3101250_GetWorkOrderInfo of F3112WorkOrderRoutingsBeginDoc is causing some exception. Any reference to a null pointer in the log?
 
compare this log to others when you have the same issue. See if the call stack is the same (last called function is on the top). If it always crashes at the same place, that points to some kind of specific logic bug that may be trappable. If it's different each time, the problem may be more resource oriented.
 
Is I3101250_GetWorkOrderInfo allways in the log when it crashes?

nope :/ thats just the latest log

I've got random functions failing at seemingly random times.

I'm going to start saving kernel log messages. try to find a rhyme or reason to it.
 
Don't the security tokens for BSSV E1 connections expire after a while? Like 30 odd days

Could it be that?
 
Don't the security tokens for BSSV E1 connections expire after a while? Like 30 odd days

Could it be that?

I would hope that if this was the case that it would be more apparent in the log files. (and not kill a call object kernel)
 
nope :/ thats just the latest log

I've got random functions failing at seemingly random times.

That's unfortunate. Needle in a haystack.

I'm on Apps 9.0 but I'm guessing that sub has changed very little. There is some dangerous code in my version.

Pristine:
Code:
      if (dsB3100330.idF4801Pointer > (ID)0)
      {
         lpdsWorkCache->lpF4801Pointer = 
            (LPF4801)jdeRemoveDataPtr(lpdsInternal->hUser,
            (unsigned long)dsB3100330.idF4801Pointer);

         MathCopy(&lpdsInternal->mnShortItemNumber, &lpdsWorkCache->lpF4801Pointer->waitm);
         jdeStrncpy((JCHAR*)(lpdsInternal->szBranchPlant),
            (const JCHAR*)(lpdsWorkCache->lpF4801Pointer->wammcu),
            DIM(lpdsInternal->szBranchPlant)-1);
      }

If jdeRemoveDataPtr fails the MathCopy is going to fail with a null ptr. One possible needle in the haystack would be that you have something somewhere leaking pointer handles (jdeStoreDataPtr w/o a corresponding jdeRemoveDataPtr - there is a finite supply) which could be why you see random failures.

Safer:
Code:
      if (dsB3100330.idF4801Pointer > (ID)0)
      {
         lpdsWorkCache->lpF4801Pointer = 
            (LPF4801)jdeRemoveDataPtr(lpdsInternal->hUser,
            (unsigned long)dsB3100330.idF4801Pointer);
         
         if(lpdsWorkCache->lpF4801Pointer)
         {
            MathCopy(&lpdsInternal->mnShortItemNumber, &lpdsWorkCache->lpF4801Pointer->waitm);
            jdeStrncpy((JCHAR*)(lpdsInternal->szBranchPlant),
               (const JCHAR*)(lpdsWorkCache->lpF4801Pointer->wammcu),
               DIM(lpdsInternal->szBranchPlant)-1);
         }
      }


Maybe that is one pattern you could look for... when a function fails and you get lucky enough to have a small enough section of code to look through like in this case, look for jdeRemoveDataPtr/jdeRetrieveDataPtr/jdeRemoveDataPtr calls.
 
i'm actually thinking something similar. that the magic number of pointers in the magic array of handling pointers gets full.
 
BTW, looking through that entire sub-routine (I3101250_GetWorkOrderInfo) on my system you can see where that routine itself could leak ptr handles.
 
is there any way to know/log when the array for handling prts fills up?
 
Back
Top