E1 to AD LDAP Failing After 7 Days

jdel6654

VIP Member
We upgraded from Tools Release 8.98.1.1 to 8.98.32 6 months ago. Since that upgrade we have had a problem with our E1 <-> Security Server <-> Active Directory LDAP communications. Almost exactly after 7 days, E1 services quit working. On the 8th day, the system just does not function.

The problem appears to center around LDAP communications. Looking at the UBE and COK kernel logs, I see authentication failure errors. Then, looking at the Security kernel logs, I see error messages related to stating:

"
Re-initializing LDAP connection f

1681424/5611 MAIN_THREAD Sat Jun 18 23:30:03.314152 secldap.c154
"

The problem seems to correct itself and LDAP communications are reestablished during the first 7 days. After the 7th day, there are large numbers of these errors. JAS logins fail and scheduler jobs do not run. The only way to fix the problem is to bounce ES services. We now bounce services every Sunday to avoid this problem. So far, that is our workaround.

Oracle gave us a diagnostic patch that captured the LDAP error code as "LDAP_STRONG_AUTH_REQUIRED". If you google this, it says that AD "LDAP signing" is turned on. However, it can also mean a general disruption in tcp communications. Since we don't have LDAP signing turned on in Active Directory, our AD admins don't know what the problem is.

Long story short, I'm stumped. We didn't have this problem at 8.98.1.1 and we don't have the same problems in our development and prototype environments (they use AD LDAP too). Its just production. I don't believe this is a token expiration or SSO issue because we have the token settings set up for 30 days.

Is anyone else having a problem similar to this with AD LDAP?
 
If you bounce the services on a Tuesday say does the problem interval remain at 7 days or does it revert to the weekend?
 
The problem is that, because its production, I am not permitted to test that scenario specifically. It seems to be consistently 7 days.
 
When do your call object kernels recycle? Ususally every 7 days.

For LDAP are you isolating to the Global Catalogue? ie what port are you connecting to?
 
I am isolating to 2 specific OUs in the GC. I have a JDE groups OU and the users OU.

This what I have for kernel recycling:
[RECYCLING]
krnlRecycleTimeOfDay=
krnlRecycleElapsedTime=
inactiveUserTimeout=6:00
timeToForcedExit=12:00
 
I saw something similar at a client with Solaris on 8.97 (E1 8.12). In that case there was an additional symptom whereby outstanding requests would grow for the affected Security Kernels. That means messages were coming in but not being handled. (And not all Security Kernels were affected so some users were able to function just fine. Gotta love those intermittent issues!)

Are you also seeing outstanding requests for the Security Kernels? (If you're on unix the "netwm" command is handy to show that info.)

I was able to turn on debug in production. The logs were huge, but when the problem occurred again I found that the JDE code was calling out to ldap_search_s() and not coming back.

As a work-around I found that I could avoid bouncing E1 services by simply killing the Security Kernels. The client sessions do a good job of reconnecting, and if a new Security Kernel is needed it's automatically created with a fresh LDAP bind.

In the end I created a cron script that looks for Security Kernels with too many outstanding requests and kills them automatically. It works great.

I hope that helps!

David Scheeff
 
Back
Top