debugging zombie kernel

nkuebelbeck · Jun 1, 2016

So from time to time random kernels go zombie. I have no way to reproduce the zombie kernels. All i have is the bssv log,zombie kernel logs and jdenet log. I know what function derps. I don't know what data was sent in without turning on a ton of logging and waiting for the process to derp. this could take weeks or months (slowing everything down).

The zombie process is created when BSSV executes a bsfn and the call object kernel derps (again, i do not know why and can't reproduce). this causes the bssv to return 500 errors for that user as the next executing bssv will not attach to a new kernel (i've complained about this and it's working as designed per oracle. any bssv executing bsfn using cache will NEVER attach to new process without restarting the windows bssv service)

Whats the best way to go about trying to find the source of the zombie process? are there any other logs to point me in a direction without turning on debug logging?

BOster · Jun 1, 2016

I don't have a specific answer for you other then to tell you it can be a very lengthy process.

We were/are having a similar issue with XML CallObject (zombie kernels). We tried to figure it out for months w/o success. It finally took a very concentrated effort involving people from CNC, developers and BAs to finally get to a point were we could reproduce it. We had regular meetings, formed hypothesis, constructed tests, etc. I guess the best way to go about it is to start by simply logging every time it happens and look for patterns.

In our case it appears to be a bug in the tools release (Oracle is able to reproduce the issue) but we are still waiting on a resolution.

nkuebelbeck · Jun 2, 2016

Ours is so sporadic. 1 a week sometimes,1 a month, 1 every other month.

I was hoping there would be some system log telling me why the process took a dirt nap

Sounds like some precision guess work is in my future.

craig_welton · Jun 2, 2016

Does the jde.log of the kernel show a stack dump? That will indicate exactly which low level function is executing when the process crashes and can be helpful.

Craig

nkuebelbeck · Jun 2, 2016

yes. i have the jde log for the kernel with the stack dump. it's unclear how I read it. top down? bottom up?

the jdenet also logged what function fails but doesn't tell me why that function failed.

nkuebelbeck · Jun 2, 2016

here is a gist of the this particular zombie kernel stack dump

https://gist.github.com/anonymous/3c746bae67291c7b8013498fe9058fa3

craig_welton · Jun 2, 2016

Is there an actual error listed in the log?

It appears that the MathCopy in I3101250_GetWorkOrderInfo of F3112WorkOrderRoutingsBeginDoc is causing some exception. Any reference to a null pointer in the log?

nkuebelbeck · Jun 2, 2016

no reference to a null pointer in the log.

craig_welton said:
Is there an actual error listed in the log?

Not sure what an actual error is you are referring to.

craig_welton · Jun 2, 2016

can you post the entire jde.log?

nkuebelbeck · Jun 2, 2016

https://gist.github.com/anonymous/ac7fb3b71d014b539d9167a5906922d1

craig_welton · Jun 2, 2016

compare this log to others when you have the same issue. See if the call stack is the same (last called function is on the top). If it always crashes at the same place, that points to some kind of specific logic bug that may be trappable. If it's different each time, the problem may be more resource oriented.

nkuebelbeck · Jun 2, 2016

thanks for the advice. this is going to be a long process

BOster · Jun 2, 2016

Is I3101250_GetWorkOrderInfo allways in the log when it crashes?

nkuebelbeck · Jun 2, 2016

BOster said:
Is I3101250_GetWorkOrderInfo allways in the log when it crashes?

nope :/ thats just the latest log

I've got random functions failing at seemingly random times.

I'm going to start saving kernel log messages. try to find a rhyme or reason to it.

johndanter · Jun 2, 2016

Don't the security tokens for BSSV E1 connections expire after a while? Like 30 odd days

Could it be that?

nkuebelbeck · Jun 2, 2016

johndanter said:
Don't the security tokens for BSSV E1 connections expire after a while? Like 30 odd days

Could it be that?

I would hope that if this was the case that it would be more apparent in the log files. (and not kill a call object kernel)

BOster · Jun 2, 2016

nkuebelbeck said:
nope :/ thats just the latest log

I've got random functions failing at seemingly random times.

That's unfortunate. Needle in a haystack.

I'm on Apps 9.0 but I'm guessing that sub has changed very little. There is some dangerous code in my version.

Pristine:

Code:

      if (dsB3100330.idF4801Pointer > (ID)0)
      {
         lpdsWorkCache->lpF4801Pointer = 
            (LPF4801)jdeRemoveDataPtr(lpdsInternal->hUser,
            (unsigned long)dsB3100330.idF4801Pointer);

         MathCopy(&lpdsInternal->mnShortItemNumber, &lpdsWorkCache->lpF4801Pointer->waitm);
         jdeStrncpy((JCHAR*)(lpdsInternal->szBranchPlant),
            (const JCHAR*)(lpdsWorkCache->lpF4801Pointer->wammcu),
            DIM(lpdsInternal->szBranchPlant)-1);
      }

If jdeRemoveDataPtr fails the MathCopy is going to fail with a null ptr. One possible needle in the haystack would be that you have something somewhere leaking pointer handles (jdeStoreDataPtr w/o a corresponding jdeRemoveDataPtr - there is a finite supply) which could be why you see random failures.

Safer:

Code:

      if (dsB3100330.idF4801Pointer > (ID)0)
      {
         lpdsWorkCache->lpF4801Pointer = 
            (LPF4801)jdeRemoveDataPtr(lpdsInternal->hUser,
            (unsigned long)dsB3100330.idF4801Pointer);
         
         if(lpdsWorkCache->lpF4801Pointer)
         {
            MathCopy(&lpdsInternal->mnShortItemNumber, &lpdsWorkCache->lpF4801Pointer->waitm);
            jdeStrncpy((JCHAR*)(lpdsInternal->szBranchPlant),
               (const JCHAR*)(lpdsWorkCache->lpF4801Pointer->wammcu),
               DIM(lpdsInternal->szBranchPlant)-1);
         }
      }

Maybe that is one pattern you could look for... when a function fails and you get lucky enough to have a small enough section of code to look through like in this case, look for jdeRemoveDataPtr/jdeRetrieveDataPtr/jdeRemoveDataPtr calls.

nkuebelbeck · Jun 2, 2016

i'm actually thinking something similar. that the magic number of pointers in the magic array of handling pointers gets full.

BOster · Jun 2, 2016

BTW, looking through that entire sub-routine (I3101250_GetWorkOrderInfo) on my system you can see where that routine itself could leak ptr handles.

nkuebelbeck · Jun 2, 2016

is there any way to know/log when the array for handling prts fills up?

debugging zombie kernel

nkuebelbeck

VIP Member

BOster

Legendary Poster

nkuebelbeck

VIP Member

craig_welton

Legendary Poster

nkuebelbeck

VIP Member

nkuebelbeck

VIP Member

craig_welton

Legendary Poster

nkuebelbeck

VIP Member

craig_welton

Legendary Poster

nkuebelbeck

VIP Member

craig_welton

Legendary Poster

nkuebelbeck

VIP Member

BOster

Legendary Poster

nkuebelbeck

VIP Member

johndanter

Legendary Poster

nkuebelbeck

VIP Member

BOster

Legendary Poster

nkuebelbeck

VIP Member

BOster

Legendary Poster

nkuebelbeck

VIP Member

Similar threads

We value your privacy