In Part 1 of this blog series, SecureTransport Cluster Models Overview, I introduced the reasons to consider clustering a SecureTransport Managed File Transfer (MFT) environment and an overview of the different clustering models. In Part 2, we will dig into the Standard Cluster model and the three modes in which it can be configured.
- Active-active
- Active-passive
- Active-passive (legacy)
Active-Active Standard Cluster
In this mode, the cluster operates in a hub-spoke pattern with one server as the primary server and up to two secondary servers. The primary server serves as the dispatcher for all server-initiated jobs by assigning jobs from the event queue to the different servers in the cluster. The primary server also owns the Scheduler and Folder Monitors.
If the primary server goes down, one of the remaining secondary server(s) becomes the primary server and its Transaction Manager unsuspends its own Scheduler and Folder Monitors. Note, the primary server and order of the secondary servers are defined in the configuration file <FILEDRIVEHOME>/lib/admin/config/servers. This “new” primary server then takes over as dispatcher. This failover happens automatically and does not require any human intervention. Any jobs in the event queue will not be lost and will start being dispatched by the new primary server.
However, any User (partner external to SecureTransport) initiated protocol sessions TO the SecureTransport server that went down will be lost and the partner will need to restart the session. Inflight protocol sessions initiated by the SecureTransport server that went down will also fail but the normal retry capabilities of SecureTransport will retry the protocol session on another server.
Figure 1. Active-Active Standard Cluster
Primary benefits of this mode are:
- Server-originated jobs are automatically balanced by SecureTransport between the servers
- Secondary server(s) are active meaning that they can not only process existing load but are already available to assume more load and take over as the primary if the primary server fails.
There are a few things to be aware of with the operation of the active-active Standard Cluster model:
- The primary server does not take into consideration the load on the servers when dispatching work. The secondary servers also cannot refuse work dispatched to it by the primary server. This can create situations where a secondary server is overwhelmed by jobs potentially resulting in performance or stability issues.
- Generally, there will not be an even split of the load between the servers in the Standard Cluster. Under light or medium load conditions, the primary server will tend to absorb a greater proportion of the load because the primary servers Transaction Manager can pull and process events off of its Event Queue faster than the Dispatcher can dispatch jobs to the secondary server(s) over the TCP connection. As the cluster nears its peak capacity, the load will become more evenly split. In fact, if you are seeing an even or almost even split of the load across the servers in your Standard Cluster, then you may be close to the maximum capacity of the cluster.
- Tip: With an active-active Standard Cluster, to have a more evenly distributed load across the cluster, you can configure your load-balancer to route a larger proportion of the incoming client protocol sessions to the secondary server(s) since the primary server will typically be processing a greater portion of the server-initiated jobs, as well as running the Scheduler and Folder Monitors.
- An active-active Standard Cluster is limited to a maximum of three servers. Beyond three servers, the additional overhead of another server outweighs its benefit. This is due to the overhead required for the synchronization of the mySQL databases.
- The consolidated file transfer log is only stored on the primary server. The Transfer Status Manager Service is responsible for the consolidation of the file transfer activity on every server into the consolidated transfer log on the Primary server. If the primary server fails and a secondary server becomes the new primary, the new primary will write the tracking table entries to its own local mySQL Tracking table for the period it is primary. The Tracking table data is NOT synchronized between the servers even if the cluster is restarted and synchronized. This is necessary for performance reasons but there are some ramifications to this:
- File Tracking data will be spread across multiple servers in the cluster if there has been a failover of the primary to a secondary server. This means:
- To get a comprehensive report of all File Transfer tracking data, you may need to run the report on each server.
- Searching for a particular file transfer for troubleshooting or retry purposes may also require searching each server.
- Similarly, REST API calls will also need to be executed against each server of the cluster to ensure all File Transfer activity is accounted for or to find a particular file transfer for troubleshooting, retry, etc.
Active-Passive Standard Cluster
The active-passive Standard Cluster is essentially an active-active Standard Cluster in which only the primary server is processing jobs. The Transaction Manager is suspended on the secondary server and no inbound communication sessions are being routed to the secondary server (configured at the Load Balancer or network level). All the same, information is being synchronized between the primary server and the secondary server and the secondary server is active but not processing anything.
In the event of the primary server going down, the secondary server becomes the primary server and unsuspends its own Transaction Manager, Scheduler and Folder Monitors. Just as with the active-active Standard Cluster, this failover happens automatically and does not require any human intervention. Any jobs in the event queue will not be lost and will start being dispatched by the new primary server.
However, any User (partner external to SecureTransport) initiated protocol sessions to the SecureTransport server that went down will be lost and the partner will need to restart the session. In-flight protocol sessions initiated by the SecureTransport server that went down will also fail but the normal retry capabilities of SecureTransport will retry the protocol session on the new primary server.
Figure 2. Active-Passive Standard Cluster
Primary benefits of this mode are:
- Failover from the primary to the secondary server is automatic and transparent
- If Sentinel is being used, this mode is required for active-passive since it will report the file transfer states to Sentinel consistently so that Sentinel can link them
Secondary server(s) are active meaning that they can not only process existing load but are already available to assume more load and take over as the primary if the primary server fails.
Some considerations with the active-passive Standard Cluster are:
- The cluster is limited to two servers—one active and one passive.
- Hence, this mode will address availability but not scalability. In this mode, the only way to scale would be to add additional capacity to the existing server or to switch to either the active-active Standard or Enterprise Cluster models.
- As with the active-active Standard Cluster, File Tracking data can become split across the primary and secondary servers if a fail-over occurs with the same ramifications of having to go to each server for a complete report of file transfer activity or potentially to locate and research a particular file transfer and resubmit it.
Primary reasons to consider an active-passive Standard Cluster versus an active-active cluster are:
- Lack of a performant shared file storage system.
- Note that with the active-passive model, shared file storage for the file payload data is still required but if only one server is actively writing data to the filesystem, there will be less load on it.
- Lack of expertise for managing and tuning the shared file storage system.
- Lack of internal or external load-balancers or support for sticky sessions.
Active-Passive Standard Cluster (Legacy mode)
This is the oldest active-passive clustering model. In this mode only the configuration data is synchronized between the primary and secondary servers—the Event Queue and Sentinel link data are not synchronized.
Only synchronizing the Configuration data, and not the Event Queue or Sentinel link data greatly reduces the amount of data that needs to be synchronized between the servers as synchronization of these tables typically accounts for 60 to 80% of the chatter between the servers. This has both positive and negative ramifications.
Primary benefits of this mode are:
- This mode can be easily used for Disaster Recovery in addition to availability by placing the secondary server in a remote disaster recovery data center.
- Only synchronizing the configuration data makes this possible. This mode is the only Standard Cluster mode that inherently supports Disaster Recovery.
Some considerations with this mode include:
- When the primary server fails and either a manual or scripted failover to the secondary server occurs, any jobs that were in the Event Queue will be lost. This would also include any Server initiated file transfers that were in process or a retry state.
- The failover to the secondary server is not automated out of the box. However, with some scripting work, it can be automated.
- A stable connection is required for the synchronization of the configuration data—this is particularly important if the secondary server is at a remote Disaster Recovery site.
Standard Cluster Synchronization
All Standard Cluster models require synchronization of data between the primary server and the secondary server(s). For the active-active and active-passive modes, the synchronization is of both configuration data, event data, and Sentinel link data. For the active-passive (legacy) mode only configuration data is being synchronized. The synchronization process propagates configuration data from the primary server to the secondary server(s).
Tip: Configuration can be performed from any server in the cluster. However, if it is done on a secondary server, the changes are first replicated to the primary server and then pushed back out to the secondary server(s) resulting in a double-hop and additional load. Therefore, whenever possible, perform configuration activities on the primary node so that the data only needs to be pushed out to the secondary node(s).
During normal operations, the synchronization is handled automatically by SecureTransport. However, there are some circumstances when a manual synchronization is required including:
- Restarting the whole cluster
- Restoring a server to the cluster
Manual synchronization can take significant time and dynamic updates are suspended during this time. Therefore, it is recommended that the system is under light load during the synchronization process.
In Part 3 of this blog series, I will cover our flagship clustering model-Enterprise Cluster, Disaster Recovery considerations for the different clustering models and a tabular summary of the different clustering models.
Read part one in this series: MFT SecureTransport Cluster Models.