EnterpriseOne down on AS400 ! Help us pls !

mauragio

Member
Hi all.
We want to submit you our problem on EnterpriseOne services on AS400 architecture.
From a month ago to now, there was 3 cases of system down:
- all enterpriseone services, on JDEE900 subsystem, go in MTXW status (JDENET_K and JDENET_N jobs)
- some jobs on QBATCH subsystem go in MTXW status also
- some jobs on QBATCH subsystem are still in RUN, very slow, but RUN
- on the web console (EnterpriseOne) we see the instance running, with some zombie process and/or some callobject
- all user are unable to login (web & local) due to an error in reaching the security server

In this scenario, the only thing that we can do is an IPL, an EnterpriseOne Services restart and a double IPC clear. All these procedures are in A98 OneWorld Menu.

The question, now, is: we don't want to know the cause of these repeating failures, but where, in AS400 system, can we search infos of failure causes ? There are logs that we can investigate around ? Where these logs are stored ?

Thanks in advance.

We are frustrated !
 
The only time I have seen everything go MTXW is during a package deployment. Are you deploying packages when this occurs? Are you using a third-party tool that directly interfaces JDE like a barcode system? What do the kernel logs show?

-Ethan
 
Thank you ethan for your reply.
No, we are not deploying a pkg during mutex scenario. We can view kernel logs, but where are they? On as400 there are commands to view these logs? Can I view its from web console (enterpriseone)?
Thanks in advance.
 
You can view the kernel logs from the Server Manager console.On AS400, you can view the logs of the kernels in Mutex state using option 10.
 
Thx cncsts.
Now we no longer have logs of mutex processes/jobs, but I want to ask you: what type of information should I view on a mutex job log? I have attached to this reply 2 screenshots: option 10 (as400) and log file from console of a kernel job (not in mutex state), but I can not imagine what kind of statement I find in case of error.
Can you explain me ?

thx in advance.
smile.gif
 

Attachments

  • 167240-10-03-2011 08-48-30.png
    167240-10-03-2011 08-48-30.png
    13.6 KB · Views: 212
... and this is the second screenshot
 

Attachments

  • 167241-10-03-2011 08-51-16.png
    167241-10-03-2011 08-51-16.png
    4.8 KB · Views: 211
Hi,
we have an AS400 running around 500 users. We have had mutex wait issues on numerous occasions in the past and have done various tuning tweaks to get rid of them. Below is a list of the server JDE.INI changes we have made. In brackets are settings we were thinking of making but did not need to. In terms of the kernel jobs themselves you can get mtxw on the job or the thread. It is always worth doing an option 20 to view the thread info. We have had occasions where a thread has gone rogue and looped, using thousands of CPU seconds. In this eventuality you can do an option 4 to end the thread without killing the kernel. When all of the kernels lock up it is worth seeing if you have a kernel on THDW. This one is probably holding things up and if you kill it the others will become free. hope this helps,
Rich
(MaxNetConnections reduced from 1190)
JDENET_N reduced from 30 to 20 DONE
Security Kernels reduced from 25 to 15 DONE
(Call Object Kernels reduced from 160)
(change IPC range startIPCKeyValue=6101 value to 8101)
increase maxIPCQueueMsgs from 100 [JDENET] to 150 DONE
increase internalqueuetimeout from 30 to 45 DONE
([JDENET_KERNEL_DEF30] METADATA KERNEL reduce )
(maxNumberOfProcesses=4 to 2 or 3)
(numberOfAutoStartProcesses=4 to 2 or 3)
PS if you get further MTXW issues please post your ini file and we can take a look at it.....
smile.gif
 
thx very much rgreensl.
we appreciate your help and we'll consider your info.

bye
smile.gif
 
mauragio,

Next time you have the problem, log into the A98 OneWorld Menu, as you mention in your first post. Then type SAW on the command line. This brings up the logs that can also be viewed in server manager, but in a format that for me is much easier to read (being an AS400 geek). Take option 2, Work with Server Process, then Option 3, Display OneWorld Processes (DANGER: Do NOT take Option 2 or services will end without prompting). If you put option 7 next to the processes of interest it brings up the logs. I usually put 7's all the way down the line and simply hit f12 to go from log to log quickly.

We had that Mutex wait issue when we went to 8.98.3.4 a couple of times and I was able to see that the Metadata Kernel was gettting overwhelmed and locking. We had to increase it from 1 to 2, which is the opposite to what Richard had to do - it goes to show you that it really depends on what your logs are telling you when it happens.

Even after you have restarted your services or re-IPL, you should still be able to view these logs, just not through this method. You can use server manager or the wrklnk AS400 command, but they can be a little hard to find. You will have to use the last updated date and time.

Good Luck.
 
Hi all.
We had the same problem last night. Fortunately, however, this time we have saved all available logs and the processes list in mutex state (with its job number).
What kind of log I can post to give you the opportunity to understand what happened?

thanks
 
Thx markdcci for your info.
We appreciate it.
Now the system is again down and we are going to save all logs... we will look them later!

thanks

frown.gif
 
Hi all,
after last weekend system down, today morning we are investigating around as400 messages and we are noticing a strange condition.

Every night, at 4:00 AM, there are many jdenet_k jobs that are being terminated. When the system go down, the total amount of these jobs are about 50. In normal condition, however, there are 3/4.
This info can be helpful to identify the cause of the mutex-condition and the system down ?

Thanks.
 
We have attached to this post the log of a jdenet_k job terminated (look at my previous post).
In this log I read that there is an invalid handle in a thread tree and, than, the "IPC Handle State structures" is setted "to abandoned".

Exists the possibility that the removal of an IPC Handle cause the mutex-state of some or all processes that they are still using it?

Thanks
 

Attachments

  • 167329-14-03-2011 09-29-21.png
    167329-14-03-2011 09-29-21.png
    34.5 KB · Views: 149
looks like your kernel recycling is kicking in according to the message. There is an area in the JDE.INI file which states when is occurs. There is also a timeout value which states how long it leaves the kernel before killing it off.
perhaps you could post your ini file so we could take a look...? It os on the IFS in E812SYS.
Rich
 
Also, in addition to the ini as Rich has requested can you provide your system details? Application release level (sounds like 9.0), tools release and OS400 release.
 
Hi all.
I have attached our jde.ini to this post.
The system details (that I know) are:
- EnterpriseOne 9.0 (E900)
- Tool Release 8.98.23
- OS400 Release 6.1.0

Thanks all.
 

Attachments

  • 167349-JDE.INI.txt
    10.7 KB · Views: 449
thanks for that. Your ini file looks a bit odd generally, all the DEFxx stanzas are out of order. Shouldn't matter but I would be interested to know how you have been changing it (is it via server manager..?) The section with kernel recycling states that you are recycling twice a day currently and you are forcing users off after 1.5 hours after the times stated. Why don't you recycle once a day in a quiet time (say 7pm) before your batch window..?
The other thing I would suggest is to log this issue on the main JDELIST rather than iseries only....there is a bigger audience on there who may have the answer to your issue....
Rich
 
If next time you see MTXW, try to go to SAW through AS400 command line. Most probably you will not be able to go to SAW menu, in this scenario locate your saw kernel through wrkactjob and try to kill with immediate option. chances your job will not end. Wait for 10 minutes and then try Endjobabn and once this job is killed you will see remaining kernels are back to Normal.

If this works next time stop recycling your kernels (change JDe ini) and you will not see any kernel locking issue.
 
Hi,
We also have the same issue, but our issue is,
When we submit UBE jobs, it when to error (WSJ) , later all the user unable to access the Server, From the WRKACTJOB, the JDE subsystem show mtxw,

I have to end all the JDE services and start back. everything is back to normal after that.

Everything to check or change?
 
Back
Top