Oracle GoldenGate Failover Strategy

This post deals with the failover strategy I decided to use for a GoldenGate Hub architecture
It works like this:

The technologies involved

Overview Of GG Component Roles
Since Oracle has all of the components to effectively implement this I decided to stay within the Oracle stack completely

Oracle RAC
Oracle DataGuard
Oracle Clusterware (CRS)
Oracle Grid Infrastructure Agents / XAG (9.1)
Oracle DBFS
Oracle GoldenGate

Oracle Real Application Clusters (RAC)
GoldenGate uses database and GRID infrastructure for several key processes.
Clusterware resources are defined for both DBFS and GoldenGate.
Definition for failover is also located here for the resiliency of the databases that are hosted here for DataGuard.
File systems are defined in the database with ability to be switched and failed over via Dataguard.

Oracle DataGuard
Oracle’s recommended strategy for failover of GoldenGate processes and DBFS.
ACFS is explicitly not recommended due to lack of geographic failover in automated fashion.
Seamless integration w/Oracle database
Can be mostly automated
Well understood process for most Oracle DBA’s
Used on virtually all resiliency patterns for Oracle databases

Oracle Clusterware (CRS) Cluster Ready Services
Used for defining service architecture
GoldenGate requires service-based architecture to achieve local and geographic resiliency
XAG is a pass-thru for CRS definitions for GoldenGate
Allows clusterware to detect state transitions between primary and standby
Role based startup and shutdown during failover/switchover
Establishes dependencies on DBFS host database

Oracle Grid Infrastructure Agents / XAG (ver 9.1)

The GRID Infrastructure components that provide HA to application resources like GoldenGate
Provides a “ready-to-use” template for application HA
Could be defined as a wrapper for CRS definition of resources for GoldenGate
Pre-defined clusterware resources for GoldenGate, simplifies dependency creation, however does not eliminate knowing the mix of parameters to make it all work properly
Provides a replacement interface (AGCTL) to start, stop, relocate, and manage the location of the environment.
Operates based on the status and role of the database (database up,down and primary or standby)
Forces the use of defining GoldenGate “instances”

Oracle Database File System (DBFS)
Oracle Database File System
Provides file system entirely hosted within database
Allows for DataGuard to failover easily without additional work
Simply provides resiliency to file systems hosting GoldenGate architecture components and trail file and recovery data

Oracle Overall Product Stack Integration

100% Oracle technology stack
Robust resiliency
Downside is complexity for configuration
Once setup it works !!!!!
Takes a lot of testing (weeks and months) of many scenarios to get it right
Very little to no documentation on the whole recipe from Oracle
Takes understanding of all of the components of the stack to get it right
Once it’s right, it rocks!!!!

DBFS and GoldenGate Hub

There are (2) separate classes of DBFS disk, discovered as a result of the Oracle consulting engagement
GoldenGate “common” disk per instance of GoldenGate
Project specific disk per project and failover clusters for GoldenGate
DBFS common disk for certain GoldenGate directories in an admin capacity
DBFS specific disk allocated to trail files for specific projects
Mountpoint will be identified by /dbfs_XXXXXXXX
Mountpoint = /dbfs_xxxxxxxx filesystem =
Naming is “not” nodally dependent and has no relationship to nodal affinity, any instance of GoldenGate can and should be able to run on any node in the local or remote cluster.
File system naming will be consistent to identify GoldenGate instances
ggxxyzzz format
gg = GoldenGate instance
xx = geographic location of the instance like “tx” = Texas
y = environment type like “d” for development
zzz = sequential number “001” 1st environment

DBFS Disk in GG Hub Architecture

Each node running GoldenGate can have one or more “instances” of GoldenGate running, at a minimum each node in a cluster will run one instance of GoldenGate
Having multiple instances per node can be a by-product or a demand of isolation or scalability.
Each instance of GoldenGate needs it own DBFS filesystem from an admin and overhead standpoint. This is strictly for overhead components of the instance and not for client consumption.
Each project will need to estimate the amount of volume for DBFS project specific data, for trail files and recovery. Since this is project specific the GoldenGate can add whatever is determined to be appropriate for trail file definition.
For example a 6 node clustered pair will require minimum of 7 DBFS mounts, one for each of the 6 instances, and at least one for project specific trail files. If another version of GoldenGate were to be installed, this example doubles, as the admin DBFS cannot be utilized by a different set of binaries.

Code trees

There will need to be a code tree for each node and replicated on every other node.
For example: The directory /dbfs_ggtxd001/ggtxd001 will need to exist on all 6 nodes in a cluster pair. This is due to the failover scenarios could allow for the instance to run on any of the local cluster nodes, in addition to any of the nodes on the geographic failover cluster (only when in a cluster failover state)

Version Support

On each node there must be a unique DBFS common file system and code tree for each version of GoldenGate.
There must be a “new” install of GoldenGate for each version of source database supported.
Example:
Database Source version = 12.1
GoldenGate Version = 12.2
This equals one directory code tree, furthermore must be copied to “all” other nodes in the cluster pair, not just local cluster nodes. If we now want to support a 11g database on the source with the same version of GoldenGate we will need a complete new install of the code tree and copied to all clustered pairs. This includes setup of all resiliency components.

Failover Limitations

Currently the failover scenarios provide for:
Node-to-node failover
This would include all DBFS filesystem mounts on this node failing over to another node. If there are multiple applications / replications running in the instance, all of the replications are failed over and affected. Individual failover of a single set of replications within a GoldenGate instance is currently “not supported”, and is not being planned for at this time. The detail and effort to support this currently would be overwhelming, in addition, all of the process would be manual and largely susceptible to human error.
Geographic Cluster Failover
This would entail the “entire” cluster DBFS database which contains “all” common data for instances running on all of the nodes and project specific DBFS disk failing over. There is no individual application replication group failover. This is a complete failover of the entire all customers would failover running on that cluster. The GoldenGate hub was designed to achieve resiliency and not provide graceful “individual” application level switchover. The GoldenGate code tree itself is not failed over only the DBFS filesystems which has “state” data. It is unnecessary to failover code trees since they are duplicated on all available failover nodes. To date, there has been no design requirement to achieve individual application level switchover as far as GoldenGate processes are concerned.

Planning GoldenGate Hub Onboarding

Placement
Which clusters? Nodes? Geographic location?
Capacity Planning
Which nodes are underutilized, make sure to leave allocation for nodes to fail over to “this” node
How much CPU is enough, based on the replications requested?
How much disk? Transaction volume? IOPS is critical, having enough “arms” on the data is critical
Do we need to add nodes to environment to accommodate project? The cluster can add nodes and therefore capacity while remaining in the resiliency group already established.
Do we need to add clusters? Should the application be on an isolated set of nodes? Resiliency can be extended to add additional clusters (DBFS databases).
Monitoring of existing processes to determine how to load-level the environment, the is overall oversight to analyze the environment over time to see how it behaves over time.

Disaster Recovery
What failover options are needed? Any at all? Local, geographic? Both?

Understanding the Clusters / Nodes
Each node stands alone, however each node can host a failed node from any other node in the cluster pair
A cluster pair is defined as two clusters reciprocating in a failover capacity ie: TX 3 node cluster and VA 3 node cluster are a clustered pair.
Either is designed and limited so that a cluster can take on the traffic from the other cluster in the pair
Definitions for each node must house the infrastructure from ALL other nodes in the cluster pair (local cluster and remote cluster)
Each instance of GoldenGate has an overhead infrastructure for itself independent of the client data or replication groups (common DBFS)

Understanding the Clusters / Nodes (part 2)
Each version of GoldenGate must support one and only one source version per code tree (Oracle verified)
A system will need to be devised to easily track code trees and a naming scheme that will eliminate or at least minimize error.
XAG (CRS) tracks GoldenGate instances at the cluster level not at the node level

Scripts
When configuring the environment, building the components and building the resiliency. Several hundred scripts were written in order to make the process of setup easier.
Directory structure
Script examples
Concept of function libraries

ITRemote

Ready for Action?