Personal tools
You are here: Home Documentation How-tos Coping with a Live Spinning Zope
Support

Get Help

Join our chat rooms or support forums if you have more specific questions.

Plone Training
Learn how to design, build, and deploy a website in Plone through one of the numerous Plone training sessions around the world.
Find Plone training…
 
Document Actions

Coping with a Live Spinning Zope

This How-to applies to: Any version.
This How-to is intended for: Server Administrators

Most people who have been working with Zope and Plone for some time have learned, on one day or another, what it means for Zope to "spin". It incessantly uses 100% cpu, or very close, and, though it accepts HTTP connections, it never responds, requiring a restart. If your live Zope is spinning, you are most assuredly not having a good day, but there are some measures you can take to ease your pain.

Overview

Developers most often encounter this, usually while developing new features, and especially at the lower levels of the Zope and ZODB APIs, though even sometimes, possibly, in simple Archetypes.  In deployment, one hopes never for this to happen, because it is unpredictable, and a production environment is nowhere to test software.  That said, as has happened to me for the past 5-10 days, sometimes it does happen.  An otherwise well-functioning system may spin for no apparent reason at irregular intervals, causing monitoring systems to send out alerts around the world, waking people up on their cell phones, etc..  This is no good. :)

A handy utility for tracking down problems in Spinning Zopes is the DeadlockDebugger Product which, along with the threadframe python module, will dump a traceback from all running threads at an HTTP URL.  I have noticed, on some occasions, that one, two, or a handful of threads may hang, while the others are still available to serve up requests to DeadlockDebugger.  Unfortunately, when Zope fails to respond to requests at all, it tends also not to serve info from DD, making it difficult to leverage as a tool for tracking down problems which happen in production and seemingly at random.

In hopes of seeing to it that my system admin and other dedicated, mostly volunteer, team members, can get some sleep, I've devised a simple means of monitoring and restarting Zope based on what I have observed in our recent failures.  I am using Dan J. Bernstein's daemontools package in my setup for various reasons which I will outline in the next section, but will first present the simplest solution, a shell script which runs every minute from cron.  If your site is dying right now, drop this in place while you sort out the rest - I wrote it feverishly for just that purpose, knowing that I wanted to use daemontools and to send mails.  This has been in place for a couple of months now and from a particular zope, I still get a restart message many times a day.  This is much nicer than receiving e-mails from frustrated users. :)

Quick-'n-Dirty

I'll start you off with the simplest version of the script, something which Just Works(tm):

#!/bin/sh

$INSTANCE_HOME=/home/zope/instance/zope-emergency
$ZOPE_ROOT_URL=http://127.0.0.1:8080/

echo "--- checking zope.. ---"

if ! wget --spider -t 1 -T 30 $ZOPE_ROOT_URL 2>&1; then
  echo "restarting zope..";
  $INSTANCE_HOME/bin/zopectl restart;
fi

The --spider option to wget tells it not to try and download a file, the -t option specifies the number of tries / retries, and -T is a universal timeout value, which I've considered replacing with connection-only timeout.  Notice that I am only checking the root Zope page, largely because it does not load when Zope spins, and it is served more quickly than a Plone page after a restart.  Save this script as something like 'checkzope' in a handy location such as $INSTANCE_HOME/bin or /home/zope/bin, and make it executable.  Add an entry to your zope user's crontab like the following, and you should be set:

* * * * * /home/zope/bin/checkzope >> /home/zope/auto-restart-log

Now the file /home/zope/auto-restart-log will contain the status of each timestamped wget check and a message whenever a restart is required, i.e.:

--- checking zope.. ---
--13:17:01-- http://127.0.0.1:8080/
=> `index.html'
Connecting to 127.0.0.1:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28 [text/html]
200 OK

--- checking zope.. ---
--13:18:01-- http://127.0.0.1:8080/
=> `index.html'
Connecting to 127.0.0.1:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28 [text/html]
200 OK

This should be enough to give you peace of mind and at least some information about how often the problem is occurring, all in one tidy package.  I decided that this was OK, but I wanted more.  For one, I want to see some DeadlockDebugger output from just before Zope dies entirely, and it probably isn't worth my trouble to try and reload the page in my browser all day. ;)

Notification, Debugging

I also decided that, as my teammates and I had been e-mailing each other when a restart was required, so that we all had a clear idea of whether or not this problem was still occurring, the computer should follow suit.  So, this next version of the script logs output from DeadlockDebugger instead of boring wget status info, and sends an e-mail as well as writing to the log.  I wish I knew something like 'tee' which would split the output to another program instead of a file, and even thought of setting up a filesystem pipe for this purpose, though I think it's a bit much for this tutorial.

#!/bin/sh

$INSTANCE_HOME=/home/zope/instance/zope-emergency
$TEST_URL=http://127.0.0.1:8080/manage_debug_threads?password
$NOTIFY_EMAIL=webmaster@mysite.com

if ! wget -O - -T 30 $TEST_URL; then
  echo "restarting zope..";
  echo "restarting zope.." | mail -s "restarting zope.." $NOTIFY_EMAIL;
  $INSTANCE_HOME/bin/zopectl restart;
fi

This version of the script is still very simple and compact, but will now log debug information from Zope which may be useful in tracking down the source of our site's instability, and it will notify people via e-mail so long as the server is correctly configured to do so.

Your crontab line need not change from above to use this version of the script, but you might be wondering some things like, "Will the DeadlockDebugger output, or even the simpler wget output, start to fill up my drive very fast?", or even, "Hm, I'd like to keep two copies of this script from running at once".  I've chosen to set my timeout and run interval in such a way that *should* prevent the latter from happening, but who's to say? ;)

Daemontools

Installing

If you don't already have daemontools installed, I strongly suggest that you hop on over to http://cr.yp.to/daemontools.html and grab it.  The installation is very straightforward and works on every UNIX system I have tried it on, though GNU/Linux users will need to grab a simple patch to make up for the fact that our errno.h is missing.  Once you've got daemontools installed, or while it's installing, you can move on.

Why daemontools?

As I mentioned, I am using DJB's daemontools package, which I already use for simple service monitoring of my Zope and Squid processes.  The simplest and most useful thing that Daemontools does for a service is to make sure that it restarts immediately if the process exits.  Unfortunately, that won't do us any good for a spinning Zope, but we can still leverage this powerful and simple tool to make our auto-restart script a bit more robust.  Again, this is optional, it should work fine as is, and you might just want to manually delete or relocate the log every few days, which will also work fine.  For me, it's not enough.

Other than simple service monitoring, daemontools also provides automatic log rotation, and I believe the way I am using it will prevent two copies of my script from running at the same time, though I have not tested.  It probably won't hurt if this script runs twice, but it really doesn't need to.  It's also possible to use user and group ownership and permissions to manage who can restart a service, which I find very handy, though probably not necessary, as my script and server are already running as the same user.  In any case, it's time for a crash course in daemontools.

Creating a checkzope service

As I mentioned, daemontools is a service monitor, and is a very simple, small piece of code which works based on a very simple directory layout, called service directories.  By default in /service, you want to create or symlink service directories which only, in most cases, contain a run script which logs to stdout and a log/main hierarchy where a program called multilog manages and auto-rotates filesystem logs of the stream your script sends.  So, here's how I set up a new daemontools service for my checkservice script:

mkdir -p /var/services/checkzope/log/main
touch /var/services/checkzope/down
cp /home/zope/bin/checkzope /var/services/checkzope/run
chmod +x /var/services/checkzope/run
cp /service/squid/log/run /var/services/checkzope/log/run
chmod +x /var/services/checkzope/log/run

I've copied our checkzope script directly to the 'run' script which supervise looks for, and it may work, but I've edited mine to be more daemontools-y::

#!/bin/sh
exec 2>&1
$INSTANCE_HOME=/home/zope/instance/zope-emergency
$TEST_URL=http://127.0.0.1:8080/manage_debug_threads?password
$NOTIFY_EMAIL=webmaster@mysite.com
$ZOPE_USER=zope

if ! wget -O - -T 30 $TEST_URL; then
  echo "restarting zope..";
  echo "restarting zope.." | mail -s "restarting zope.." $NOTIFY_EMAIL;
  /command/setuidgid zope $INSTANCE_HOME/bin/zopectl restart;
  # or this for zope running under daemontools:
  # svc -t /service/zope
fi

The "exec 2>&1" bit at the top sends the entire script's stderr fd into stdout, which makes it easier to log everything into one file, which is, by the way, done by a script which looks almost identical for every daemontools service, like so:

#!/bin/sh
/command/setuidgid zope /usr/local/bin/multilog t ./main

Logging

In a nutshell, daemontools will just pipe the output of your script into this script, and as mentioned before, multilog will auto-rotate your logs.  To view a given service's log, you want to run something like:

tail -f /service/zope/log/main/current | tai64nlocal

The "tai64nlocal" bit is an application which converts special binary time formats in multilog files into nice textual strings for examination.  You will also want to pipe output through this program when you do a directory listing of the log/main directory in order to make sense of the rotated log names.  The current log will always be named 'current'.

Enabling

I got a bit ahead of myself in the last part, however, as I recall our service still living only at /var/services/checkzope.  As I mentioned, daemontools will look in the /service directory for service configurations to monitor, so we'll create a symlink.  This is handy because if your service breaks, you can break the symlink, kill it's supervise processes, and fix it at will.  So, let's show our new service to daemontools:

ln -s /var/services/checkzope /service

By design, daemontools will pick up my new service quickly, starting two supervise processes - one for the service itself, and one for logging.  If not for the empty 'down' file I created in the root of the service folder, it would also start it immediately by executing the 'run' script.

Checking periodically

For our checkzope service, we don't want daemontools to make sure it is always running, because that may place more load on our server than we want, we just want to execute it once in a while from a scheduler such as cron, and we can do this using the once, or -o command, to the daemontools service control program, 'svc'::

* * * * * svc -o /service/checkzope

Why?

Now our auto-restart / monitoring script will execute via daemontools once per minute, piping log data which will be auto-rotated, and if we like, even pruned to not grow past a given size.  Still, as with the last version, when Zope needs to be restarted, a notification e-mail will be sent.  I think that, for now, this is pretty good, but you may have noticed above that my Zope instance is called 'zope-emergency'.  Just under two weeks ago, I got home from a long ten-day trip to a convention, and more or less when I was on the plane home, our Zope started to die.  For a couple of days we all sat around our computers with coffee, waiting for alerts to come in from a third-party system which checks from eight or so locations around the world to see that our site is up, responds, and that the response contains a given string.

It would be handy if it were possible for DeadlockDebugger to hook into Zope's core and auto-dump to a log periodically, but for now, I'm going to try the next best thing and hope that I catch a handful of my ten threads spinning during the last few once-per-minute checks just before a restart.  In time, perhaps I can even devise a means of automatically parsing the output.

I have had this in production for some time on two machines whose Zope / Plone sites need to be upgraded badly, and I have not entirely surmised the problems occuring on either.  I have fixed a couple of sites simply by upgrading to the latest code, but it's often a challenge to move a legacy application, esp one which you did not develop, to new upstream code, and we have received no complaints since this system was put in place.  It's even a bit Googleian. ;)

by justizin — last modified December 1, 2006 - 11:49
Contributors: Kenneth Bauer
All content is copyright Plone Foundation and the individual contributors.

For any issues with the web site functionality, please file a ticket.

Please consult the policy on plone.org content if you want your content published on this site.

Servers and hosting by