Personal tools
You are here: Home Documentation How-tos HA-Cluster with loadbalancing for Zope (and Plone)
Support

Get Help

Join our chat rooms or support forums if you have more specific questions.

Plone Training
Learn how to design, build, and deploy a website in Plone through one of the numerous Plone training sessions around the world.
Find Plone training…
 
Document Actions

HA-Cluster with loadbalancing for Zope (and Plone)

This How-to applies to: Any version.
This How-to is intended for: Server Administrators, Site Administrators

This document describes an HA-Cluster solution for Zope and Plone with load-balancing over two physical machines, based on ZEO. We assume NO single point of failure, and use NO commercial software.

Contents

  • Introduction
  • Assumptions/prerequisites
  • Setup
  • Configuration
  • Use cases
  • Alternatives
  • Resources

Introduction

An HA-Cluster assumes (almost) continuous availability, even in case of hardware failure. Not however that to attain real high availability, or to speak in marketing terms have an uptime of 99.999% (see for example WikiPedia), your system can be down to users only 5 minutes per year. It should be obvious that this leaves very little time for fixing problems... Also, to be truly Highly Available, not only you need more than one machine to cope with hardware failure, but you would need geographic redundancy as well, to cope with failure of the data center or the data backbone. Geographical redundancy and covering for marketing managers are not covered here though. We leave this to the reader, as an exercise... ;)

To be able to cope with soft- and hardware failure, you'll need a setup with at least two machines, where each machine is able to perform the same services to the end-user, without the user noticing failure. For the end-user, the services are thus Highly Available. Please note that some end-user may notice some failures: if in the middle of a request the web server stops, or a CPU breaks down, there might be an effect, but the user can resume operations straight away.

We assume a simple setup with two machines in one physical location, although this is not a practical limitation of our setup, only for purposes of clarity. For a thorough account of clustering and load-balancing techniques involving other setups, see http://www.ultramonkey.org/.


Assumptions/prerequisites

We have used a setup including the following components:

  • Apache2
  • Heartbeat
  • NFS
  • mod_proxy_balancer (optional, but more efficient in terms of hardware use)
  • Squid (optional. Squid configuration for Zope is not further described in this document.)
  • ZEO
  • Our own syncer script (currently named syncPozo)

On any Linux system, these are readily available as packages except for our synchronization scripts.


Setup

The setup consists of two machines, each having two network interfaces. The machines are linked to each other over one serial cable, and one cross-cable for ethernet, and linked to the internet with the other ethernet interface. This is a typical setup for HA clustering with heartbeat. Both machines have Apache2 installed, as well as a Zope instance using ZEO. The machines use heartbeat to determine whether the cluster is still in normal operation mode (master is alive). The machines are available over the internet both with their own IP-addresses, and with a floating IP-address that might point to the master, or when the master is down, to the slave. More details on this configuration can be found in the man pages for heartbeat.

The setup can be graphically depicted as follows:



The general idea is that the HA Cluster as a whole is contacted over the floating IP-address. This is configured to be on the master for normal conditions. The master server handles all requests on port 80, and uses mod_proxy_balancer to distribute requests over two machines, either to Squid, or directly to Zope clients. Note that load balancing is not an essential part of the setup, but makes more efficient use of your hardware. The Zope clients use a ZEO server for their data back-end. This server runs on the master, and is contacted over the floating IP-address by the clients. Data from the master (the Data.fs) is synchronized to the slave server. On the slave, the ZEO server is not running.

In case of failover, heartbeat takes care of assigning the floating IP-address to the slave and starts the ZEO server on the slave. This server will now be the one contacted by the Zope clients. This will be detected by the Zope clients automatically. If possible, the ZEO server on the master will be stopped. Also, the syncing process on the client is stopped, to prevent writing erroneous data to the slave database.

Recovery from a failover is not automated, due to the high risk of errors in this procedure. Recovery includes:

  • check on integrity of data on the slave
  • copy back slave data to master
  • stop ZEO server on slave
  • start ZEO server on master
  • start syncer on slave
  • start heartbeat on master

If you wish you can automate recovery as well, but we have chosen to implement manual intervention, to make sure that the master is thoroughly checked on the nature of the failure, before recovery. In the process of recovery, there might be a very short interval where clients will notice unavailability, since at some stage the ZEO service on the slave machine needs to be stopped before copying the Data.fs back to the master, to ensure that all changes are copied back. You may consider not stopping the ZEO during copying, and only stopping it after the master has taken over again.

Configuration

Apache

Apache is configured to load the modules for proxy, proxy_balancer and proxy_http at least. Roughly, balancing is achieved by the following statements, for example within a virtual host declaration:

<Proxy balancer://lb>
   BalancerMember http://192.168.1.10:8080
   BalancerMember http://192.168.1.11:8080
</Proxy>

...

ProxyPass / balancer://lb/VirtualHostBase/http/somesite.foo.bar:80/ploneinstance/VirtualHostRoot/

assuming you have two nodes running in port 8080, IP-addresses 192.168.1.10 and 192.168.1.11, and your Plone instance is called 'ploneinstance', and you use the Virtual Host Monster to map somesite.foo.bar to the proper Plone instance.


Heartbeat

The heartbeat process on the slave continually checks if the master server is still up. The heartbeat on the slave can start automatically, so you can add links to the start-stop scripts in all runlevel init directories.
The heartbeat process on the master is not automatically (re)started (so the floating IP address won't switch back to the master automatically) due to our need for manual failover recovery. Remove start/stop links to heartbeat from all runlevels in /etc/rc.<x> and start manually using /etc/init.d/heartbeat.

Configure heartbeat according to your hardware setup, preferably using at least two communication channels for checking cluster status. A serial and an ethernet interface between both machines is a common setup. Preferably your machines have two ethernet interfaces, one for external communication, and one for heartbeat. The heartbeat configuration on the slave needs to contain the directive for starting the ZEO cluster, and stop syncing in case of failover. The configuration on the master doesn't need to do that, but should stop the zeo cluster.

Add the following directive to the /etc/heartbeat/haresources file:

<master>    <floating IP address>/24/<ethernet interface> zeo

where master is to be replaced by the name of your master node on both machines, available in the /etc/hosts file.
The identifier 'zeo' is arbitrary, but should be the name of a script available in the directory /etc/heartbeat/resource.d, that takes care of stopping and starting your zeo cluster, and stopping syncing on the slave. Check the attached zeo files for an example setup on master and slave. Both scripts assume an instance location of /opt/zope/instance0, but obviously this can be whatever you like.

NFS

The master shares the directory containing the Data.fs, and the slave mounts this directory. On the master, add the following line to your /etc/exports file:

/data0          192.168.80.1(ro,sync)

asuming that your data on the master is on the /data0 partition, and the IP-address of your slave (preferably on a second ethernet interface) is 192.168.80.1. Also make sure that the slave server can actually mount the NFS share. You might want to read the manual for NFS and Portmap.

Now add the following line to the /etc/fstab file on the slave:

192.168.80.2:/data0     /mnt/master/data0 nfs     defaults        0       0

assuming that the IP address of the slave is 192.168.80.2, the share is /data0, and the mount point is /mnt/master/data0.

Syncing

The synchronization mechanism consists of two python scripts, one controller script, and one script to perform the actual syncing of the Data.fs. Syncing is triggered by some scheduler, like cron. We call the syncData.sh script every minute from cron, like so:

* * * * * . .profile; $HOME/bin/syncData.sh <MASTER DATA DIR> <SLAVE DATA DIR> >> $HOME/var/log/syncData.log 2>&1 
where master data dir points to the mounted NFS partition directory where the Data.fs resides, and slave data dir points to the directory where the slave Data.fs is. The user profile contains some settings, for example for the PYTHONPATH.

Whether or not syncing needs to be done at all, is registered in a run file syncPozo.status. In case of failover to the slave, syncing is disabled. The file contains one single line:

SYNCING=[0|1]

A value of 0 means syncing is off.
Sources for both syncPozo.py (the actual syncer) and syncData.sh (the controller script) are attached.

Given the call from cron, logging is done to $HOME/var/log/syncData.log. You may see the following messages:
Start syncing
Script has started synchronization
missing or empty .dat file (full backup) 
No Data.fs has been found on the slave, a full backup is made
 No handlers could be found for logger "ZODB.FileStorage"
You didn't set the PYTHONPATH variable to contain the Python libraries before calling the syncPozo.py script. The path to your Zope libraries is <ZOPE INSTANCE>/lib/python.
Slave  (backup) file has grown since last syncing...
 The slave has been changed apart from syncing from the master. Consider removing the slave Data.fs altogether, and forcing a full backup.
NOT SYNCING FROM MASTER
 SYNCING is set to 0 in syncPozo.status.
Finished syncing
That's all for this round, folks!

Note: make sure that the python libraries for Zope are on your PYTHONPATH environment variable.

Use cases

NORMAL OPERATION (master + slave are up)

Only ZEO on the master is running, and connected to Data.fs located on the master. Apache dispatches requests to clients on both master and slave, that use the ZEO cluster on the master, using the floating IP address.
The slave continually (at a given interval) syncs the Data.fs from the master (local mount with NFS) to itself using the syncPozo.py script (with this script only the changes are synchronized, which is very fast).

MASTER IS GOING DOWN

Heartbeat on the slave detects that the master server has gone down. The following actions are executed:

  • automatic syncing of the Data.fs is stopped (for ever! until manually started again). This ensures that after the master has gone down and up again, the (possibly) old master Data.fs will not override the newer slave's Data.fs!
  • the slave heartbeat resource writes "SYNCING=0" to the file /var/run/syncPozo.status)
  • floating IP address is taken over from master
  • ZEO on the slave is started

(ZOPE2 automatically reconnects to ZEO2)

SLAVE IS GOING DOWN

nothing happens

MASTER UP AFTER BEING DOWN

no automatic actions are executed (ZEO1 remains down, Slave will keep the floating IP address!)

manual actions need to be undertaken (PER site!)

1. stop ZEO2 (site is now down!)

2. copy Data.fs from slave to master

3. start ZEO1

After the precious steps have been done for each site, the following global steps need to be taken:

4. take over floating IP address (site is up again, after ZOPE clients have automatically reconnected to ZEO1)

5. start syncPozo processes (edit the file /var/run/syncPozo.status: SYNCING=1)

6. check syncer logging: if need be, delete the Data.fs on the slave, for full backup from master


Alternatives

As usual there's more than one way to achieve similar results. We'll not exhaust ourselves with a comparison here, but at least point to alternative ways:

  • use DRBD to sync the ZODB over TCP/IP;
  • use the commercial ZRS solution of Zope Corporation.


Resources

Attached files

see also:

Sticky sessions and mod_proxy_balancer
This document explains how-to enable sticky sessions in a Zope/Plone HA cluster so authenticated users are routed to the same back-end.
by Goldmund, Wyldebeast & Wunderliebe last modified May 25, 2007 - 09:29 All content is copyright Plone Foundation and the individual contributors.

For any issues with the web site functionality, please file a ticket.

Please consult the policy on plone.org content if you want your content published on this site.

Servers and hosting by