Bulldog,
I have a similiar issue with one of my customers.
Customer is on iseries
Apps 9.1
Tools 9.1.0.4
Websphere app server.
If we let their prod server run, in roughly 8 days, I see the security kernel crash error. Interestingly, I see the issue from two directions. The users are all on the iseries for JDE access. They also have a windows web server running websphere for some custom mobile apps. I start to see the impending crash on the windows server first. The mobile app crashes and then needs to be restarted. An hour or so after the windows webshere apps crash, the iseries websphere instances start to crash.
When we see this condition on the windows servers, I found this message in the logs: "Cannot connect to any OneWorld Security Server.Failure in retrieving extended token from Security Kernel....."
When it starts to hit the iseries, I see this message: "Sign on: error message ID = 340 (Security Service is down, please contact system adminstrator)"
They are running multiple security kernels. Not all of the kernels crash. If they are persistent and try to login a few times, they eventually hit a working security kernel and can get in.
They have a prod and a non-prod instance for their iseries. I only see the issue on the prod server. Oracle support could not find anything. Oracle tried to blamed the issue on websphere.
I looked at all of the various timeout settings and scoured the knowledge jungle for clues. We have all of the recommended settings in place, but the issue persists and is consistent. The closest hit I found to their issue was in Oracle document 885414.1. In that document, Oracle points out some issues with the regular and extended token lifetime. They list some parameters for optimizing the token lifetime settings and pointed out that they are trying to resolve this is a later tools release. I'm still holding my breath.
Our workaround was a scheduled reboot process. We cycle JDE nightly. With that process in place, we have not seen a reoccurance of the security kernel issue.
We have added in some extensive scripting and testing around this nightly reboot that ensures that the customer's system reboots cleanly. If there are some issues that our scripts can't handle, we have a secondary process that proactively tests their system on a 24/7 basis. If their system has a meltdown, our automated system sends an alert for myself or a member of my CNC team to start troubleshooting their system, before the customer even knows that they are having an issue.
- Gregg