Moved LDAP to a different DC, intermittent UBE failures

cbartlet

Member
Hi all

Don't know if anyone can help with this problem.

We are running E1 8.12, TR 8.98.4.1. on iSeries V5R4 for the Enterprise Server.

Have been using LDAP for E1 for over 2 years with no problem. However our DC's are built on Win2003, so we are building 2008 DC's and migrating to those.

So I added a new config via P95928, activated it, the old one de-activated, restarted E1, tested online login and some UBE's, all good.

Next day we had no problems with users logging in at all, UBE's ran thru the day with no problem.

The problems started when our overnight batch schedule ran, most UBE's completed but a handful failed with "runube: jdeSecGetExtendedTokenByPwd failed"

A handful of our security kernels had this in the log;

7052/148 MAIN_THREAD Wed Mar 27 21:44:59.753560 netqueue.c2763
putExternalQueue0x04 (kernel) failed for msg id 9, pid=7106, queue name=<Krnl7106RspQ>, lastIPCError=<eIPCNotFound>.

7052/148 MAIN_THREAD Wed Mar 27 21:44:59.754528 jdeksec.c1571
JDENET Error = JDENET eIPCErr: eIPCNotFound

Our Server Manager was alerting with lots of these messages;

Event Type: Outstanding Requests
Event Time: 27/03/13 21:43
Managed Instance: RAVEN_ES_6014
Additional Information:
Enterprise Server Process Detail
======================================
OS Status: 1 (RUNNING)
E1 Status: 1 (BUSY WITH STHG)
AS/400 Job Id: 958609
Process Id: 7052
User Id: 387
Group Id: 107
Process Type: 0 [Kernel Process]
Start Time: 27/03/13 20:09
Last Updated: 27/03/13 20:09
Process Name: SECURITY KERNEL

UBE's are submitted using RUNUBE command on the iSeries.

On the thought it might just be a rogue security kernel over the next 2 nights we re-started e1 but the problems persisted.

Eventually had to back out the change and point back to the original DC, re-started E1 again and all is back to normal.

I can't figure what is going on. The new DC looks fine, no resource issues overnight etc.

Has anyone had any experience of this sort of thing happening?

I have logged this with Oracle and they are investigating but they say it isn't an LDAP problem which I don't believe as the only change made was the DC.

Thanks in advance.

Chris
 
I've run into something similar. The root cause was that the Security Kernel would open an ldap bind to a specific AD server, then for some reason lose the connection (i.e. AD Server rebooted). The underlying OS level function called by E1, ldap_search_s(), would not let go. So the Security Kernel got stuck waiting for the OS call to return, it would stop handling incoming security requests and stack up lots of Outstanding Requests on it's message queue.

The quick fix was to kill Security Kernels with outstanding requests to allow a new one to fire up and create new ldap binds. Things were more stable when the E1 ldap configuration was set to talk to a single AD server. The final solution was to create a DNS entry to resolve to several more reliable AD servers located in the data centers, and to insist that CNCs be notified when the AD servers were rebooted.

-- David
 
David

Thanks for the reply. we did some more testing and tried moving E1 LDAP to 2008 DC again last night, failed again with the same errors. Managed to get debug logs off the security kernels so have sent those to Oracle.

It does seem to be tripping up and timing out, I see some timeout errors in the logs, but we are pointing to one DC only by IP address and it wasn't rebooted.

During the day, when things are busy we get no errors at all, only overnight when it's quieter. It's like LDAP on the DC goes to sleep and doesnt respond, the security kernel times out and then hangs as you suggest.

Have reviewed LDAP policy values on the DC and they are the same as the 2003 server.

We have 10 security kernels - is this too many for 500 users? If we reduced it would it drive the remaining ones harder and we might not timeout? Just a thought
smile.gif
 
Here's something you can try: When you see a Security Kernel with outstanding requests have a look at the Call Stack for it's threads. I was able to do that on the Solaris command line using pstack. That's when I saw threads waiting in the ldap_search_s() function call. To me that suggests E1 is waiting to hear back from the other end of it's LDAP connection and it's not going to because as you pointed out it seems to have gone to sleep.

I believe the recommendation is still 1 Security Kernel per 90 E1 users. I don't think it hurts to have more. For the problem we're talking about having more Kernels would sort of spread the problem out as individual kernels were affected, which might make the problem seem more intermittent from the end user perspective.
 
Back
Top