This paper addresses Oracle’s interconnect requirement as ‘private’ and ‘separate’ and evaluates this requirement in terms of latency and bandwidth in an shared Ethernet network environment. A shared Ethernet network, in an Oracle Clusterware interconnect context, is a network where a switch, network interface or network segment is configured to handle network traffic that is unrelated to the interconnect traffic. This unrelated traffic may be private interconnect traffic from consolidated databases
or consolidated virtual environments, public traffic from adjacent Local Area Network (LAN) segments, storage traffic, backup and replication traffic or any other network traffic unrelated to the cluster interconnect. A shared Ethernet switch network is usually partitioned for broadcast isolation using Virtual Local Area Networks (VLANs). Partitioning may be configured at the port level on the switch (most common configuration) for tagged or untagged VLANs depending on the topology. Partitioning may also
occur on the host network adapter using tagged VLANs. A shared Ethernet network implies shared switch and NIC resources and, as such, are potentially subject to increased contention, performance degradation and diminished availability.
This paper draws focus to shared Ethernet switch, shared NIC VLAN configuration and deployment practices intended to optimize for Oracle Clusterware interconnect performance and availability. Oracle Real Application Clusters (RAC) interconnect latency and bandwidth baselines are described in generic terms as guidelines and are not intended to apply to Oracle Engineered Systems. These baselines should be considered when deploying the Oracle Clusterware interconnect in an Shared Ethernet network
topology. The target audience is anyone with an interest in Oracle Clusterware interconnect network deployment requirements, specifically Architects, DBAs, System Administrators and Network Engineers. This is paper is not intended to provide specific VLAN configuration guidance but will assist in network architectural design based on the requirements of the Oracle Clusterware interconnect, variable workload and the supporting network components.
Oracle has recommendations for the networking components of RAC.
UDP is the default interface protocol for Oracle RAC and Oracle Clusterware.
You must use a switch for the interconnect.
Oracle recommends that you use a dedicated switch.
Oracle does not support token-rings or crossover cables for the interconnect.
Each node must have at least two network adapters: one for the public network interface and the other for the private network interface or interconnect.
In addition, the interface names associated with the network adapters for each network must be the same on all nodes.
For the public network, each network adapter must support TCP/IP.
For the private network, the interconnect must support UDP (TCP for Windows) using high-speed network adapters and switches that support TCP/IP.
Gigabit Ethernet (minimum) or an equivalent is recommended.
Note: For a more complete list of supported protocols, see MetaLink Note: 278132.1.
Before starting the installation, each node requires an IP address and an associated host name registered in the DNS or the /etc/hosts file for each public network interface.
One unused virtual IP address and an associated VIP name registered in the DNS or the /etc/hosts file that you configure for the primary public network interface are needed for each node.
The virtual IP address must be in the same subnet as the associated public interface.
After installation, you can configure clients to use the VIP name or IP address. If a node fails, its virtual IP address fails over to another node. For the private IP address and optional host name for each private interface, Oracle recommends that you use private network IP addresses for these interfaces, for example, 10.*.*.* or 192.168.*.*. You can use the /etc/hosts file on each node to associate private host names with private IP addresses.
Interconnect network is a critical component in a RAC environment. It’s very important to minimize latency, increase its throughput and avoid lost of UDP packets.
There’s more than a heartbeat
Sometimes, when we work with sysadmin or network admins, you find they are expecting a “classic” interconnect network for deploying a RAC. In this situation, this network is just used to perform configuration and keepalive information between nodes. With Oracle RAC, we must see the interconnect just like an I/O system, because it is the component Cache Fusion technology uses to send data blocks from one cluster node to any other without spending time updating it in disk for the other instance to get a current value. So, we can find high network traffic with 8KiBs blocks being sent (if we use default block size for our database).
Some years ago, Oracle said it was necessary to use a physical dedicated network different from the public network as interconnect. Lot of customers did not use specific hardware for this implementation as it was integrated in the consolidated LAN infrastructure, using specific VLANs for these networks. In 2012, Oracle published a White Paper called Oracle Real Application Clusters (RAC) and Oracle Clusterware Interconnect Virtual Local Area Networks (VLANs) Deployment Considerations. This document talks about the concepts private and separate.
As a summary, we should take in mind:
We can use tagged or untagged VLANs and not necessary specific hardware.
RAC servers should be connected as OSI layer 2 adjacency, within the same broadcast domain and just one hope communication.
Disabling or restricting STP (Spanning Tree Protocol) is very important for avoiding traffic suspension that could result in a split brain.
Enable prunning or private VLANs, so multicast and broadcast traffic will never be propagated beyond the access layer.
Additionaly, from 126.96.36.199 onward, we need to enable multicast traffic for network 188.8.131.52, or, from 184.108.40.206.1, for network 220.127.116.11. We can find further information in the MOS Doc Grid Infrastructure Startup During Patching, Install or Upgrade May Fail Due to Multicasting Requirement (Doc ID 1212703.1). We can find in that Note a tool for checking multicast traffic in a network.
The interconnect in an Oracle RAC environment is the backbone of your cluster. A highly performing, reliable interconnect is a crucial ingredient in making Cache Fusion perform well. Remember that the assumption in most cases is that a read from another node’s memory via the interconnect is much faster than a read from disk—except perhaps for Solid State Disks (SSDs). The interconnect is used to transfer data and messages among the instances. An Oracle RAC cluster requires a high-bandwidth solution with low latency wherever possible. If you find that the performance of the interconnect is subpar, your Oracle RAC cluster performance will also most likely be subpar. In the case of subpar performance in an Oracle RAC environment, the interconnect configuration, including both hardware and software, should be one of the first areas you investigate.
You want the fastest possible network to be used for the interconnect. To maximize your speed and efficiency on the interconnect, you should ensure that the User Datagram Protocol (UDP) buffers are set to the correct values. On Linux, you can check this via the following command:
sysctl net.core.rmem_max net.core.wmem_max net.core.rmem_default net .core.wmem_default net.core.rmem_max = 4194304 net.core.wmem_max = 1048576 net.core.rmem_default = 262144 net.core.wmem_default = 262144 Alternatively, you can read the associated values directly from the respective files in the directory /proc/sys/net/core. These values can be increased via the following SYSCTL commands:
sysctl -w net.core.rmem_max=4194304 sysctl -w net.core.wmem_max=1048576 sysctl -w net.core.rmem_default=262144 sysctl -w net.core.wmem_default=262144
The numbers in this example are the recommended values for Oracle RAC on Linux and are more than sufficient for the majority of configurations. Nevertheless, let’s talk about some background of the UDP buffers. The values determined by rmem_max and wmem_max are on a “per-socket” basis. So if you set rmem_max to 4MB, and you have 400 processes running, each with a socket open for communications in the interconnect, then each of these 400 processes could potentially use 4MB, meaning that the total memory usage could be 1.6GB just for this UDP buffer space. However, this is only “potential” usage. So if rmem_default is set to 1MB and rmem_max is set to 4MB, you know for sure that at least 400MB will be allocated (1MB per socket). Anything more than that will be allocated only as needed, up to the max value. So the total memory usage depends on the rmem_default, rmem_max, the number of open sockets, and the variable piece of how much buffer space each process is actually using. This is an unknown—but it could depend on the network latency or other characteristics of how well the network is performing and how much network load there is altogether. To get the total number of Oracle-related open UDP sockets, you can execute this command:
netstat -anp -udp | grep ora | wc -l
Our assumption here is that the UDP is being used for the interconnect. Although that will be true in the vast majority of cases, there are some exceptions. For example, on Windows, TCP is used for Cache Fusion traffic. When InfiniBand is in use (more details on InfiniBand are provided later in the section “Interconnect Hardware”), the Reliable Datagram Sockets (RDS) protocol may be used to enhance the speed of Cache Fusion traffic. However, any other proprietary interconnect protocols are strongly discouraged, so starting with Oracle Database 11g, your primary choices are UDP, TCP (Windows), or RDS (with InfiniBand).
Another option to increase the performance of your interconnect is the use of jumbo frames. When you use Ethernet, a variable frame size of 46–1500 bytes is the transfer unit used between all Ethernet participants. The upper bound is 1500 MTU (Maximum Transmission Unit). Jumbo frames allows the Ethernet frame to exceed the MTU of 1500 bytes up to a maximum of 9000 bytes (on most platforms—though platforms will vary). In Oracle RAC, the setting of DB_BLOCK_SIZE multiplied by the MULTI_BLOCK_READ_COUNT determines the maximum size of a message for the global cache, and the PARALLEL_EXECUTION_MESSAGE_SIZE determines the maximum size of a message used in Parallel Query. These message sizes can range from 2K to 64K or more, and hence will get fragmented more so with a lower/ default MTU. Increasing the frame size (by enabling jumbo frames) can improve the performance of the interconnect by reducing the fragmentation when shipping large amounts of data across that wire. A note of caution is in order, however: Not all hardware supports jumbo frames. Therefore, due to differences in specific server and network hardware requirements, jumbo frames must be thoroughly tested before implementation in a production environment.
In addition to the tuning options, you have the opportunity to implement faster hardware such as InfiniBand or 10 Gigabit Ethernet (10 GigE). InfiniBand is available and supported with two options. Reliable Datagram Sockets (RDS) protocol is the preferred option, because it offers up to 30 times the bandwidth advantage and 30 times the latency reduction over Gigabit Ethernet. IP over InfiniBand (IPoIB) is the other option, which does not do as well as RDS, since it uses the standard UDP or TCP, but it does still provide much better bandwidth and much lower latency than Gigabit Ethernet.
Another option to increase the throughput of your interconnect is the implementation of 10 GigE technology, which represents the next level of Ethernet. Although it is becoming increasingly common, note that 10 GigE does require specific certification on a platform-by-platform basis, and as of the writing of this book, it was not yet certified on all platforms. Check with Oracle Support to resolve any certification questions that you may have on your platform.