MTXW and SEMW on iSeries, call object kernel timeouts.

HolderAndrew

Well Known Member
Hi all,

we are experiencing business critical problems in our E1 8.11 TR 8.94_P1 production system which causes kernel jobs to go into mutex wait MTXW and sometimes into semaphore locks SEMW status on the iseries. The end result is that all user web sessions start to go unstable and eventually we need to restart services.

The problem can only be seen on our production machine (despite trying very hard to produc in test) and can be replicated in P42101 by doing the following:

1.Inquire on order within "Manage Existing Order" form W42101C
2.Select any order line in the grid and then exit to the detail revisions form W42101D.
3.Press the button 'Submit and Close' after changing some information on the detail line, eg. quantity. (Note: we can also reproduce problem even when no change is performed)
4. When returned back to inquiry screen press the 'Find' button immediately.
5. System hangs and then after timeout period has been exceeded (currently set to 90 seconds) we get a message saying:

JAS_MSG346: JAS database failure [BSFN - CALL OBJECT ERROR] An error occurred during the call to trigger business function F4211_CURRENCY:

[aside:We can reproduce the problem in P4205 by doing similar actions (ie. update and then press Find immediately)]

I suspect that the message may be bogus and is displayed purely because the system is no longer communicating as we see various messages on other user sessions when this problem occurs.

Possibly the asynchronous event that occurs after pressing the 'Save and Submit' button either never completes or if the user presses 'Find' too quickly gets into a conflict situation?

We have a case open (4133679) and we are getting good response from Oracle but so far we have been unable to find the problem. Normal things have been checked (ini files, logs, kernel setting, IPC settings, etc..) too numerous to describe in detail here but I am hoping that there might be somebody out there who has seen this kind of problem before?

We have built new prod package in case of spec corruption but this has not helped. We have also checked all PTFs and settings on the iseries but have not discovered any discrepencies.

Latest things that we are checking are RTE settings in sy811/f90705 - does anybody know the significance of active records here?

We have modified P42101 along with many other objects but we have never seen these problems in our test environments since our E1 project started - which is almost 1 year ago now and can only see the problem in production.

Any help or suggestions would be appreciated.
 

Sebastian Sajaroff

Legendary Poster
Hi,

Have you taken a look at your server package build logs?
(Specially those regarding JDBTRIG1 to JDBTRIG4)
I don't mean the R9622 PDF, I mean reading compile logs
on /OneWorld/Packages/xxxxx/text/JDBTRIG.*
Those triggers are BSFNs, there could be a bug there.
Try also the following : go to a Development PC,
checkout F4211 and try building its triggers. Are there
any errors?
Finally, are there any errors when you WebGen F4211?

Regards and good luck,
 

techmgr

Member
We had similar issues here and it was because of missing setup data in the F0006, F0010 and F0013 tables. It causes the application to fail when it can't find the constaints.
 

cdawes

VIP Member
Try increasing the MaxDBConnections in the JDBJ.INI. What's the current setting.

Can you actually post your JDBJ or this small section?

Can you post the kernal settings from the JDE.INI on the iSeries as well


Colin
 

HolderAndrew

Well Known Member
Hi Sebastian/techmgr/colin,

Many thanks for your replies.

I have already checked the F4211.c trigger code and rebuilt this along with other related system 42 stuff. I will look though to see if there are any logs from the JDBTRG functions as this is where the F4211_CURRENCY code is. I also checked the setup in F0010/F0013 as this is what the trigger code is using but I will now check more closely. I think (and hope) that the problem is data related since we have only seen the problem in production.

I will also pass on the comments made about our jdbj settings and get those checked.

Here are the kernel settings from ini file on the iseries (also include IPC stuff - just in case).

********* start **************
[JDEIPC]
maxNumberOfResources=1000
startIPCKeyValue=4101
avgResourceNameLength=15
maxMsgqEntries=1024
maxMsgqBytes=65536
ipcTrace=0

[JDENET]
serviceNameListen=6013
serviceNameConnect=6013
maxNetProcesses=4
maxNetConnections=800
netShutdownInterval=15
maxKernelProcesses=142
maxKernelRanges=24
netTrace=0
enablePredefinedPorts=0

[JDENET_KERNEL_DEF1]
krnlName=JDENET RESERVED KERNEL
dispatchDLLName=JDENET
dispatchDLLFunction=JDENET_DispatchMessage
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF2]
krnlName=UBE KERNEL
dispatchDLLName=JDEKRNL
dispatchDLLFunction=JDEK_DispatchUBEMessage
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF3]
krnlName=REPLICATION KERNEL
dispatchDLLName=JDEKRNL
dispatchDLLFunction=DispatchRepMessage
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF4]
krnlName=SECURITY KERNEL
dispatchDLLName=JDEKRNL
dispatchDLLFunction=JDEK_DispatchSecurity
maxNumberOfProcesses=7
numberOfAutoStartProcesses=2

[JDENET_KERNEL_DEF5]
krnlName=LOCK MANAGER KERNEL
dispatchDLLName=JDEKRNL
dispatchDLLFunction=TM_DispatchTransactionManager
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF6]
krnlName=CALL OBJECT KERNEL
dispatchDLLName=XMLCALLOBJ
dispatchDLLFunction=XMLCallObjectDispatch
maxNumberOfProcesses=100
numberOfAutoStartProcesses=10

[JDENET_KERNEL_DEF7]
krnlName=JDBNET KERNEL
dispatchDLLName=JDEKRNL
dispatchDLLFunction=JDEK_DispatchJDBNETMessage
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF9]
krnlName=SAW KERNEL
dispatchDLLName=JDESAW
dispatchDLLFunction=JDEK_DispatchSAWMessage
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF10]
krnlName=SCHEDULER KERNEL
dispatchDLLName=JDEKRNL
dispatchDLLFunction=JDEK_DispatchScheduler
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF11]
krnlName=PACKAGE BUILD KERNEL
dispatchDLLName=JDEKRNL
dispatchDLLFunction=JDEK_DispatchPkgBuildMessage
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF12]
krnlName=UBE SUBSYSTEM KERNEL
dispatchDLLName=JDEKRNL
dispatchDLLFunction=JDEK_DispatchUBESBSMessage
maxNumberOfProcesses=1
numberOfAutoStartProcesses=1

[JDENET_KERNEL_DEF13]
krnlName=WORK FLOW KERNEL
dispatchDLLName=WORKFLOW
dispatchDLLFunction=JDEK_DispatchWFServerProcess
maxNumberOfProcesses=5
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF14]
krnlName=QUEUE KERNEL
dispatchDLLName=JDEKRNL
dispatchDLLFunction=DispatchQueueMessage
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF15]
krnlName=XML TRANS KERNEL
dispatchDLLName=XMLTRANS
dispatchDLLFunction=XMLTransactionDispatch
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF16]
krnlName=XML LIST KERNEL
dispatchDLLName=XMLLIST
dispatchDLLFunction=XMLListDispatch
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF19]
krnlName=EVN KERNEL
dispatchDLLName=JDEIE
dispatchDLLFunction=JDEK_DispatchITMessage
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF20]
krnlName=IEO KERNEL
dispatchDLLName=JDEIEO
dispatchDLLFunction=JDEK_DispatchIEOMessage
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF22]
krnlName=XML DISPATCH KERNEL
dispatchDLLName=XMLDSPATCH
dispatchDLLFunction=XMLDispatch
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF23]
krnlName=XTS KERNEL
dispatchDLLName=XTSKRNL
dispatchDLLFunction=JDEK_DispatchXTSMessage
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0

[JDENET_KERNEL_DEF24]
krnlName=XML SERVICE KERNEL
dispatchDLLName=XMLSERVICE
dispatchDLLFunction=XMLServiceDispatch
maxNumberOfProcesses=1
numberOfAutoStartProcesses=0


********* end *****************

and here is the jdbj.ini stuff from JAS server (I haven't include commented out settings or bootstrap stuff.I am trying to keep e-mail shorter!)
****** start of jdbj.ini ********
[JDBj-JDBC DRIVERS]
###ORACLE=oracle.jdbc.driver.OracleDriver
AS400=com.ibm.as400.access.AS400JDBCDriver
###SQLSERVER=com.microsoft.jdbc.sqlserver.SQLServerDriver
###UDB=COM.ibm.db2.jdbc.app.DB2Driver

[JDBj-ORACLE]
tns=D:\jdbcdrivers\tnsnames.ora

[JDBj-LOGS]
jdbcTrace=false

[JDBj-SERVER]
dbcsConversionTolerant=true

[JDBj-CONNECTION POOL]
minConnection=5
maxConnection=150
initialConnection=5
poolGrowth=5
connectionTimeout=1800000
cleanPoolInterval=600000
cachePurgeSize=5

[JDBj-RUNTIME PROPERTIES]
ocmCachePurge=3600000

securityCachePurge=3600000
serviceCachePurge=3600000
specConsistencyCheck=none
triggerAutoFetch=all
usageResultSetOpenThreshold=-1

usageTransactionActiveThreshold=-1
****** end of jdbj.ini ********

Anybody know the role of F90705 Real Time Events? We are getting many of these messages "Failed to get event defintion for event RTCOSTOUT. Please check is event definition correct?" - are these relevant. We are investigating these and probably going to remove all records from F90701 and F90705.

Thanks again for the replies, I will let you know if we have any luck.

Regards

Andrew
 

CGIRoman99

Member
I would try one of two things - stop clicking so fast or hard code a delay......just kidding.

We had a performance issue on our iSeries as well, however we were never able to reproduce this at will. At first we ended our services just like you, then we let the system "ride out the storm" to see what would happen, eventually it did recover, the amount of time it took to recover was different every time (this was dependent on the amount of cleanup) but it did recover.

After some consultation from IBM we changed our garbage collection settings in WAS to the complete opposite of everything we have seen documented for tuning this setting.

Initially we changed our settings for Minimum heap size of 1GB and Maximum heap size of 10GB, our performance improved. After about a month we still had issues every once in a while so we changed our settings to Minimum of 2GB and Maximum of 8GB. We also created a memory pool just for our QEJBAS5 subsystem and one pool just for HTTP. We also changed the minimum memory allocation for the QEJBAS5 subsystem to 17% of the total memory on our system. We are running a 570 with 1TB of disk, 60GB of memory and 4 processors allocated to this particular partition.
 

HolderAndrew

Well Known Member
Hi,

system works perfectly without any users! You are probably right, forcing a time delay of about 5 seconds would probably fix the problem - I don't think I can get the customer to accept this though!

Thanks for the info, I will pass on to our iseries guys to test. I think the problem here though is more fundamental. The problem can be replicated after restarting services with just a single user on the system over a slim set of business data (about f4211 lines about 10,000 lines).

During a normal day (although the days of normality seem a distant cry) we do also experience peaks and troughs where the performance degrades and then recovers. I think that your input might help in those scenarios.

When the issue with clicking FIND after inquiring on order in P42101 occurs sometimes the system does recover if we wait about 5 minutes, but if we get a semaphore wait (SEMW) then normally this means that other jobs also start to go into SEMW/MTXW and eventaully you have to retart services. When this happens with all users on the system we notice that every user starts to see problems (eg. buttons disappering. JAS messages appearing, hanging application, etc..).

We will keep looking and I will post the solution when we find it.

Regards

Andrew
 

cdawes

VIP Member
The JDBJ and JAS look good. The only thing I can think if is increasing the maxConnection to a higher value. I've run out of DB connections before but more on Oracle/DB2 UDB.

[JDBj-CONNECTION POOL]
maxConnection=150 --> 200?

Check IBM Redbook SG246359. Set the following:

[JDBj-RUNTIME PROPERTIES]
as400PackageLibrary=QRECOVERY


The only other thing I can think of is running out of memory inside the JVM. If users are consistently using the "X" instead of logout you will be very limited for memory inside the JVM. Garbage collection will occur more frequently and eventually you get the dredfull java.lang.out.of.memory. Only solution is to get new users or increase the size of the JVM.

Colin
 

HolderAndrew

Well Known Member
Hi Colin,

thanks again. I mentioned in my last e-mail that the problem can be replicated when there is only 1 user on the system, immediately after restarting services. So I don't think that it has anything to do with number of allowed connections or running out of memory.

We are still looking .....

Rgds

Andrew
 

HolderAndrew

Well Known Member
Hi Sebastian,

Thanks for reply again. There is 1 latest ESU which we haven't applied yet as it affects many objects in system 42. The objects affected do not include F4211 though. Even if it did I would expect the problem to exist on our other pathcodes?

Rgds

Andrew
 

cassh1

Well Known Member
This may not be the same problem but we had a nasty SEMW issue with Xe - fixed with SAR 7487608 (in SP23J1 I believe, we had to get a custom one-off). It sounds like the same symptoms. Maybe they fixed it for Xe but not 8.11? Stranger things have happened.
Have a look at the SAR.

Sue Shaw
Xe Coexistent SP23i1 iSeries V5R3
 

HolderAndrew

Well Known Member
Thanks for the tip on MTXW monitor and for the info on the SAR although I think we have at last managed to solve the problem but at the moment do not really understand how!

The F4201 and F4211 tables (and indexes)were regenerated - (as suggested by Sebastian) onto the iseries from within toolset and data restored. The problem has now disappeared!

Corrupt specs maybe a possibility but why could this problem not be replicated on our test machine as the specs have been rebuilt there and deployed to prod? Maybe there was/is some authorisation problems as we have noticed that table (and other object types)have different authorities (mixture of JDE, PSFT and ONEWORLD).

We are now having internal review of how these authority discrepencies are happening although we do not know that this is the reason the problem was occurring.

One thing I did notice was that the call object kernel job when the MTXW appeared went from statuses: DEQW-->TIMA-->RUN-->TIMA-->RUN-->TIMA-->MTXW. After the F4211/F4201 objects were regenerated the status of the kernel job went from DEQW-->RUN-->DEQW immediately, and there were no problems. This happened each time we tried to produce.

Thanks again for all input to our problem - we can now move on to our other business critical problems! E1 does keep us busy!

Rgds

Andrew
 
Top