Anyone out there using MS-Cluster?

CNC_Kid

Active Member
Hi Folks,

Made posting below a while back and must admit I'm surprised no-one has hit similar issue!. Is anyone out there configured similarily at all?.

<Begin Orginal Post>
We have a config of MS clustered (Active/Passive) App Server and clustered (Active/Passive) Database (Oracle 9i) server. We run subsystem jobs and I'm looking for ideas on automating the cleardown and startup of these subsystem jobs if a failover occurs. Had originally thought of running a command file to manage this but to date have not found a way of having this command fall executed should a failover occurs :crazy:

Of course lateral thinking ideas are welcome!

PeopleSoft EnterPrise One 8.0 SP22_M1
<End Original Post>
 
I think people are avoiding this one because it is messy.

With SQL Server, we have the "Start automatically when SQL Server Agent starts" scheduler option. That coupled with a "Cluster GROUP <groupname> /OFF" and then a "Cluster GROUP <groupname> /ON" could cycle the JDE cluster. If the Oracle and JDE cluster groups do not share hardware, this would need to be done using the PSEXEC command from Sysinternals.

Here are some of the problems:
1. You are running Oracle, if Oracle does not have a similar scheduler option, you will have to find another service to bind to the cluster group. AUTOEXNT from the W2K resource kit might fulfill that requirement if bound to the Oracle cluster group.
2. Typically, if the RDBMS goes offline before JDE goes offline, JDE will not shutdown cleanly. I do not believe that MS Cluster will kill the run-away orphan jde_kernels. You could use PSEXEC to launch a PS Tools program to enumerate these threads and then use PSKILL to kill them, but that would be a very messy solution. A less messy and more robust solution would be to offline the cluster group, reboot the cluster node and then online the group. The problem with this solution is that often high end-nodes in cluster groups take as long as 20 minutes to boot (i.e. Dell 6650 with 32GB of RAM). The defeats the idea of only have a few minute outage.

I hope some of this gets you moving in the right direction.
 
We do run MS Cluster services and subsystem jobs as well but are not really automated for restarting the subsystem jobs in a failover.

We do start and stop the subsystem jobs using a batchfile with the runube command in it (to start it) and a SQL statement (which stops it). On a Monday morning we start the subsystem jobs through a scheduled task. We have the batchfile scheduled on both nodes. On the node where the applications server is not resident, it just fails to execute. We also schedule the stop command on a Friday night on both servers.

I don't know what your situation is, but ours is that really we only use the subsystem job between 7.00am and 5.00pm Monday to Friday. If we get a failover overnight, the helpsdesk just runs the batchfile to start it when we get the first complaint. If the failover happens in the weekend, then the normal Monday morning schedule starts the process on the correct server.

It probably doesn't solve your problem, but it works well enough for us.

Regards
Marty
 
Hi Jeremy,

Thanks for replying on this and tip for AUTOEXNT. I'm hoping it has possibilites. AFAICR the MS Cluster does kill the run-away orphan jde_kernels but its SLOW..........
Depending on where you're at a COLD boot of system can be quicker ;)
As our database is on it's own cluster we're exploting the reduced possibility of both failing at same time. We have a dual feed of power into all machines and the App and DB server are "criss crossed" on separate racks.
It's ironic really considering despite all above I'm still having to look for a solution to manage susbsystem during failover of APP server. Ah well, thats computing for you!
 
We used to use MS clustering to handle 350+ users. The problem is that you are still dependent on the shared storage and do you really want the server to fail over automatically without you knowing why in the first place? In our case (even with dual power supplies, dual controllers, dual HBAs, etc) the shared storage failed. (The controllers need to communicate amongst themselves at the backplane level, and that is the weak link) So now we have opted for a dual standalone system with a third party online SQL database replication software. Works a treat, and we do not rely on any particualr hardware vendor to perform data mirroring between our servers. We can use fibre on the primary site and SCSI for the backup system. In our case the SCSI outguns the fibre as 2Gbit fibre = 256MB/s per channel whereas the SCSI controllers give 320MB/s.

We use different sets of ODBCs to access the individual SQL servers. In the event of a failure we have minimal downtime, which we are manually in control of.

For our application servers we have a single load balanced logical source called APPBAL behind which reside 3 application servers. If one fails the remaining servers handle the load.

It all depends on what you want. In our case having manual control over the system was the priority.
 
Thanks for your input. We have cluster configured to failover but not fail back. As for server failing over automatically I would think if something "triggers" a failover then there is probably a good reason for same. I'd nearly prefer to look at the why afterwards and have the system remain alive for the end users.

As you've pointed out the shared storage (in our case located in a SAN) is a single point of failure. We've have not 'sprung' to having a second SAN with replication configured.

Another influencing factor in our case is that we are not 24 x 7 and do not have support staff covering all shifts, hence the leaning towards automatic failover.

A question for you (if you don't mind).
[1] You mention you used to use MS clustering. What prompted your change to your current configuration?

Must admit I'm surprised that you can manually failover quicker then an automatic failover.
 
My experience with failure clustering has been that it is still very immature. I personally believe that the risk of system outage due to the immaturity of clustering is much higher than the risk of single server hardware failure.

DISCLAIMER: The preceding was just the thoughts of one techie and do not reflect any formal analysis.
 
The main issue we had was with the cluster shared storage The device failed even though it was totally redundant (dual controllers, dual power, dual separate UPS supplies, dual HBAs on the servers etc.) and data still got corrupted.

It takes us 6 hours+ to restore a full copy of our data.

In my opinion the cluster disk gives slower throughput than in native mode.

There were problems with the JDE service failing over to the secondary server. Sometimes it did and sometimes it didnt. We are using OWXE Update 7. When we had jobs in the 'S' or 'P' status, on failover these status flags did not get cleared and we had to modify the F986110 table manually to put the jobs into the 'E' error status. This meant that we still had to stop the JDE services manually, clear the table then restart the services.

We now use a load balancing configuration for our application servers.

The manual failover process is less than 10 minutes which is OK for us.
 
Can you tell me which product you are using for replication that you refer to in the statement "So now we have opted for a dual standalone system with a third party online SQL database replication software" ?

Can you explain your setup a little more ? Is the primary
a fibre connected SAN and the secondary/backup traditional
SCSI arrays ?

Thanks in advance for any info.
Barry
 
Back
Top