Coping with a Live Spinning Zope
This How-to applies to:
Any version.
This How-to is intended for:
Server Administrators
Overview
Developers most often encounter this, usually while developing new features, and especially at the lower levels of the Zope and ZODB APIs, though even sometimes, possibly, in simple Archetypes. In deployment, one hopes never for this to happen, because it is unpredictable, and a production environment is nowhere to test software. That said, as has happened to me for the past 5-10 days, sometimes it does happen. An otherwise well-functioning system may spin for no apparent reason at irregular intervals, causing monitoring systems to send out alerts around the world, waking people up on their cell phones, etc.. This is no good. :)
A handy utility for tracking down problems in Spinning Zopes is the DeadlockDebugger Product which, along with the threadframe python module, will dump a traceback from all running threads at an HTTP URL. I have noticed, on some occasions, that one, two, or a handful of threads may hang, while the others are still available to serve up requests to DeadlockDebugger. Unfortunately, when Zope fails to respond to requests at all, it tends also not to serve info from DD, making it difficult to leverage as a tool for tracking down problems which happen in production and seemingly at random.
In hopes of seeing to it that my system admin and other dedicated,
mostly volunteer, team members, can get some sleep, I've devised a
simple means of monitoring and restarting Zope based on what I have
observed in our recent failures. I am using Dan J. Bernstein's
daemontools package in my setup for various reasons which I will
outline in the next section, but will first present the simplest
solution, a shell script which runs every minute from cron. If your site is dying right now, drop this in place while you sort out the rest - I wrote it feverishly for just that purpose, knowing that I wanted to use daemontools and to send mails. This has been in place for a couple of months now and from a particular zope, I still get a restart message many times a day. This is much nicer than receiving e-mails from frustrated users. :)
Quick-'n-Dirty
I'll start you off with the simplest version of the script, something which Just Works(tm):
#!/bin/sh
$INSTANCE_HOME=/home/zope/instance/zope-emergency
$ZOPE_ROOT_URL=http://127.0.0.1:8080/
echo "--- checking zope.. ---"
if ! wget --spider -t 1 -T 30 $ZOPE_ROOT_URL 2>&1; then
echo "restarting zope..";
$INSTANCE_HOME/bin/zopectl restart;
fi
The --spider option to wget tells it not to try and download a file,
the -t option specifies the number of tries / retries, and -T is a
universal timeout value, which I've considered replacing with
connection-only timeout. Notice that I am only checking the root Zope
page, largely because it does not load when Zope spins, and it is
served more quickly than a Plone page after a restart. Save this
script as something like 'checkzope' in a handy location such as
$INSTANCE_HOME/bin or /home/zope/bin, and make it executable. Add an
entry to your zope user's crontab like the following, and you should be
set:
* * * * * /home/zope/bin/checkzope >> /home/zope/auto-restart-log
Now the file /home/zope/auto-restart-log will contain the status of each timestamped wget check and a message whenever a restart is required, i.e.:
--- checking zope.. ---
--13:17:01-- http://127.0.0.1:8080/
=> `index.html'
Connecting to 127.0.0.1:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28 [text/html]
200 OK
--- checking zope.. ---
--13:18:01-- http://127.0.0.1:8080/
=> `index.html'
Connecting to 127.0.0.1:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28 [text/html]
200 OK
This should be enough to give you peace of mind and at least some information about how often the problem is occurring, all in one tidy package. I decided that this was OK, but I wanted more. For one, I want to see some DeadlockDebugger output from just before Zope dies entirely, and it probably isn't worth my trouble to try and reload the page in my browser all day. ;)
Notification, Debugging
I also decided that, as my teammates and I had been e-mailing each
other when a restart was required, so that we all had a clear idea of
whether or not this problem was still occurring, the computer should
follow suit. So, this next version of the script logs output from
DeadlockDebugger instead of boring wget status info, and sends an
e-mail as well as writing to the log. I wish I knew something like
'tee' which would split the output to another program instead of a
file, and even thought of setting up a filesystem pipe for this
purpose, though I think it's a bit much for this tutorial.
#!/bin/sh
$INSTANCE_HOME=/home/zope/instance/zope-emergency
$TEST_URL=http://127.0.0.1:8080/manage_debug_threads?password
$NOTIFY_EMAIL=webmaster@mysite.com
if ! wget -O - -T 30 $TEST_URL; then
echo "restarting zope..";
echo "restarting zope.." | mail -s "restarting zope.." $NOTIFY_EMAIL;
$INSTANCE_HOME/bin/zopectl restart;
fi
This version of the script is still very simple and compact, but will now log debug information from Zope which may be useful in tracking down the source of our site's instability, and it will notify people via e-mail so long as the server is correctly configured to do so.
Your crontab line need not change from above to use this version of the script, but you might be wondering some things like, "Will the DeadlockDebugger output, or even the simpler wget output, start to fill up my drive very fast?", or even, "Hm, I'd like to keep two copies of this script from running at once". I've chosen to set my timeout and run interval in such a way that *should* prevent the latter from happening, but who's to say? ;)
Daemontools
Installing
If you don't already have daemontools installed, I strongly suggest that you hop on over to http://cr.yp.to/daemontools.html and grab it. The installation is very straightforward and works on every UNIX system I have tried it on, though GNU/Linux users will need to grab a simple patch to make up for the fact that our errno.h is missing. Once you've got daemontools installed, or while it's installing, you can move on.
Why daemontools?
As I mentioned, I am using DJB's daemontools package, which I already use for simple service monitoring of my Zope and Squid processes. The simplest and most useful thing that Daemontools does for a service is to make sure that it restarts immediately if the process exits. Unfortunately, that won't do us any good for a spinning Zope, but we can still leverage this powerful and simple tool to make our auto-restart script a bit more robust. Again, this is optional, it should work fine as is, and you might just want to manually delete or relocate the log every few days, which will also work fine. For me, it's not enough.
Other than simple service monitoring, daemontools also provides automatic log rotation, and I believe the way I am using it will prevent two copies of my script from running at the same time, though I have not tested. It probably won't hurt if this script runs twice, but it really doesn't need to. It's also possible to use user and group ownership and permissions to manage who can restart a service, which I find very handy, though probably not necessary, as my script and server are already running as the same user. In any case, it's time for a crash course in daemontools.
Creating a checkzope service
As I mentioned, daemontools is a service monitor, and is a very simple,
small piece of code which works based on a very simple directory
layout, called service directories. By default in /service, you want
to create or symlink service directories which only, in most cases,
contain a run script which logs to stdout and a log/main hierarchy
where a program called multilog manages and auto-rotates filesystem
logs of the stream your script sends. So, here's how I set up a new
daemontools service for my checkservice script:
mkdir -p /var/services/checkzope/log/main
touch /var/services/checkzope/down
cp /home/zope/bin/checkzope /var/services/checkzope/run
chmod +x /var/services/checkzope/run
cp /service/squid/log/run /var/services/checkzope/log/run
chmod +x /var/services/checkzope/log/run
I've copied our checkzope script directly to the 'run' script which
supervise looks for, and it may work, but I've edited mine to be more
daemontools-y::
#!/bin/sh
exec 2>&1
$INSTANCE_HOME=/home/zope/instance/zope-emergency
$TEST_URL=http://127.0.0.1:8080/manage_debug_threads?password
$NOTIFY_EMAIL=webmaster@mysite.com
$ZOPE_USER=zope
if ! wget -O - -T 30 $TEST_URL; then
echo "restarting zope..";
echo "restarting zope.." | mail -s "restarting zope.." $NOTIFY_EMAIL;
/command/setuidgid zope $INSTANCE_HOME/bin/zopectl restart;
# or this for zope running under daemontools:
# svc -t /service/zope
fi
The "exec 2>&1" bit at the top sends the entire script's stderr
fd into stdout, which makes it easier to log everything into one file,
which is, by the way, done by a script which looks almost identical for
every daemontools service, like so:
#!/bin/sh
/command/setuidgid zope /usr/local/bin/multilog t ./main
Logging
In a nutshell, daemontools will just pipe the output of your script into
this script, and as mentioned before, multilog will auto-rotate your
logs. To view a given service's log, you want to run something like:
tail -f /service/zope/log/main/current | tai64nlocal
The "tai64nlocal" bit is an application which converts special binary time formats in multilog files into nice textual strings for examination. You will also want to pipe output through this program when you do a directory listing of the log/main directory in order to make sense of the rotated log names. The current log will always be named 'current'.
Enabling
I got a bit ahead of myself in the last part, however, as I recall our
service still living only at /var/services/checkzope. As I mentioned,
daemontools will look in the /service directory for service
configurations to monitor, so we'll create a symlink. This is handy
because if your service breaks, you can break the symlink, kill it's
supervise processes, and fix it at will. So, let's show our new
service to daemontools:
ln -s /var/services/checkzope /service
By design, daemontools will pick up my new service quickly, starting two supervise processes - one for the service itself, and one for logging. If not for the empty 'down' file I created in the root of the service folder, it would also start it immediately by executing the 'run' script.
Checking periodically
For our checkzope service, we don't want daemontools to make sure it is
always running, because that may place more load on our server than we
want, we just want to execute it once in a while from a scheduler such
as cron, and we can do this using the once, or -o command, to the
daemontools service control program, 'svc'::
* * * * * svc -o /service/checkzope
Why?
Now our auto-restart / monitoring script will execute via daemontools once per minute, piping log data which will be auto-rotated, and if we like, even pruned to not grow past a given size. Still, as with the last version, when Zope needs to be restarted, a notification e-mail will be sent. I think that, for now, this is pretty good, but you may have noticed above that my Zope instance is called 'zope-emergency'. Just under two weeks ago, I got home from a long ten-day trip to a convention, and more or less when I was on the plane home, our Zope started to die. For a couple of days we all sat around our computers with coffee, waiting for alerts to come in from a third-party system which checks from eight or so locations around the world to see that our site is up, responds, and that the response contains a given string.
It would be handy if it were possible for DeadlockDebugger to hook into Zope's core and auto-dump to a log periodically, but for now, I'm going to try the next best thing and hope that I catch a handful of my ten threads spinning during the last few once-per-minute checks just before a restart. In time, perhaps I can even devise a means of automatically parsing the output.
I have had this in production for some time on two machines whose Zope / Plone sites need to be upgraded badly, and I have not entirely surmised the problems occuring on either. I have fixed a couple of sites simply by upgrading to the latest code, but it's often a challenge to move a legacy application, esp one which you did not develop, to new upstream code, and we have received no complaints since this system was put in place. It's even a bit Googleian. ;)