Debugging tips and tricks (a real-life example)
Follow a programmer through the process of debugging a real-life bug. Watch and learn how to use a few of the tips and tricks, in the context of a full solution.
Introduction
The components and setup for server-side debugging.
A Comprehensive Example
There are a lot of good examples and how-tos on Plone.org about debugging. This story will demonstrate how to use several different tools to debug in Plone. My fellow developers and integrators: if you've ever wondered, "Where IS that bug?" or "What's wrong with the data?"--This may be for you. (It is a true story, although some of the server names have been changed to protect the innocent.)
Server-Side Debugging
This tutorial addresses the mystery that we know as "server-side debugging." Therefore, this story will take place largely in a terminal window on the server.
This tutorial does not cover client-side debugging, including CSS issues and other things that are better suited to Firebug.
System Environment and Setup
Here are the specifics for the Plone/Zope installation used for this tutorial. If you try these steps on a different machine with a different configuration, Your Mileage May Vary.
- Centos 5 server
- Plone 3.1.4
- Zope 2.10.6-final
- python 2.4.5
- configured via buildout
- all files located under /var/Plone
The Major Components
We'll use several pieces of Plone and the operating system to debug the problem. These include:
- Zope running in the foreground
- the Python debugger (Pdb)
- source code editing with a text editor
- the command shell ("command prompt" in Windows)
The Problem
An error occurred while attempting to pack the ZODB.
All I Wanted Was a Backup
We needed to backup our Data.fs nightly. An easy way to do this without taking down the server is to pack the database, then backup the pre-packed "Data.fs.old." One solution suggests using "wget" to run the "pack" command via HTTP. A more up-to-date solution recommends running zeopack.py. Here's a script that attempts to do the latter:
#!/bin/bash # NOTE: You'll have to adjust the paths, of course. # PORT comes from the buildout.cfg -- it's the zeo port number. export PLONE=/var/Plone export PORT=8100 export PYTHONPATH=$PLONE/parts/zope2/lib/python $PLONE/parts/Python-2.4.5/bin/python $PLONE/parts/zope2/utilities/ZODBTools/zeopack.py -p $PORT -d 7
Bad Behavior and Bad Error Messages
After running about 10 minutes, the script failed with this message:
Odd-length string
That's it. No more detail, just that. What's worse, Google search revealed no useful information about this string. Without any stack trace information, we're still in the dark. (We'll turn on a light in a moment.)
Where is the Bug?
But first, let's ponder the source of the problem. There are three possibilities:
- Bad Data - Perhaps the data got mangled and needs to be corrected. (This can happen, for example, when you migrate content improperly.)
- Bad Configuration - Perhaps my Plone configuration and/or my operating system are not setup correctly. ("Cannot adapt" is a huge signal that this is occurring.)
- Bad Code - Though much less likely, it is possible that the source code has a bug.
As you read stack traces, log files, and other diagnostic information, be aware that most of the problems you will face are either bad data or bad configuration. This insight has been very helpful to me.
Foreground Execution
Using "instance fg" for more error details.
Check the log files
When Zope runs as an operating system service, it runs silently. Services use log files to record significant events. However, not all events are recorded in the log. I tried checking both log files, but found nothing mentioned about this "Odd-length string" thing.
$ cd /var/Plone/var/logs $ tail instance1.log -n 40 $ tail instance1-Z2.log -n 40Advanced Developers: You may have picked up on the fact that the script accesses Zope directly, not via HTTP. Therefore, it's not really interacting with the operating system service, and would not be tracked in these system logs.
Honest Mistake #1: I did not notice the zeoserver.log file.
So, what now?
Run Zope in the Foreground
When you can't see what's going on via the logs, you can run Zope in the foreground, so you can watch it react to events. You'll have to stop the service, then restart it in foreground mode:
$ service plone stop $ /var/Plone/instance fg
Zope will start up, issue a lot of "Deprecation Warnings", and then end with this:
2009-01-12 11:35:35 INFO Zope Ready to handle requests
From this point on, you can watch Zope's reaction to HTTP requests.
Watching Zope
Zope reacts to this bug via a web request.
Catching the Bug
In order to see this bug in our "instance fg" window, the request must come via HTTP. There aforementioned "wget" solution does this at the command-line. It can also be done manually, via a web browser. (Fortunately, the ZMI already has a Database packing tool, so we'll use that.)
Sending the Request via the ZMI
With the terminal window open, so that we can see Zope's reaction, these are the steps:
- Open a web browser to the ZMI.
- Navigate to Control Panel > Database management > main.
- Enter the value 7 and click "Pack."
The Bug, as seen in the ZMI
After about 10 minutes, the ZMI returns with:
Site Error
An error was encountered while publishing this resource.
Error Type: TypeError
Error Value: Odd-length string
The Bug, as seen in on the server:
The terminal window shows much more detail:
2009-01-12 11:51:15 ERROR ZODB.DB packing
Traceback (most recent call last):
File "/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/DB.py", line 684, in pack
self._storage.pack(t, self.references)
File "/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZEO/ClientStorage.py", line 831, in pack
return self._server.pack(t, wait)
File "/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZEO/ServerStub.py", line 166, in pack
self.rpc.call('pack', t, wait)
File "/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZEO/zrpc/connection.py", line 699, in call
raise inst # error raised by server
TypeError: Odd-length string
2009-01-12 11:51:15 ERROR Zope.SiteErrorLog http://localhost:8080/Control_Panel/Database/main/manage_pack
Traceback (innermost last):
Module ZPublisher.Publish, line 119, in publish
Module ZPublisher.mapply, line 88, in mapply
Module ZPublisher.Publish, line 42, in call_object
Module , line 3, in _facade
Module AccessControl.requestmethod, line 64, in _curried
Module App.ApplicationManager, line 431, in manage_pack
Module ZODB.DB, line 684, in pack
Module ZEO.ClientStorage, line 831, in pack
Module ZEO.ServerStub, line 166, in pack
Module ZEO.zrpc.connection, line 699, in call
TypeError: Odd-length string
Advanced Developers: Now, the log files show this trace.
The key line to note in that is this one:
File "/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZEO/zrpc/connection.py", line 699, in call
raise inst # error raised by server
What does it mean? Let's dig a little and find out.
Diving into Plone’s guts
Locating the affected code.
Editing the Source Code
The file we need to examine is this:
/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZEO/zrpc/connection.py
In a text editor (vim, in my case), line 699 is inside this block of code:
def call(self, method, *args):
if self.closed:
raise DisconnectedError()
msgid = self.send_call(method, args, 0)
r_flags, r_args = self.wait(msgid)
if (isinstance(r_args, tuple) and len(r_args) > 1
and type(r_args[0]) == exception_type_type
and issubclass(r_args[0], Exception)):
inst = r_args[1]
raise inst # error raised by server
else:
return r_args
What we want to do is find out what's going on. To do so, let's put a pdb.set_trace() call in.
Where should we put the set_trace?
We could put it right at the top, like this, but then we'd have to deal with cases where it's working correctly:
def call(self, method, *args):
import pdb; pdb.set_trace()
if self.closed:
raise DisconnectedError()
msgid = self.send_call(method, args, 0)
r_flags, r_args = self.wait(msgid)
A better place would be right before the "raise" statement, like this:
if (isinstance(r_args, tuple) and len(r_args) > 1
and type(r_args[0]) == exception_type_type
and issubclass(r_args[0], Exception)):
inst = r_args[1]
import pdb; pdb.set_trace();
raise inst # error raised by server
else:
return r_args
Save the file. Now, we need to restart the Zope instance in the foreground and re-issue the request via the ZMI. (And wait.)
Bonus: How to tell that your changes were included
When it restarts, Zope will recompile the connection.py file. If you're really curious, or want to verify that your changes were compiled, you can list that file's parent directory on the file system. In linux, "ls -lasF" is very nice for that. Look for a connection.pyc file with a file-date that is newer than the connection.py file itself.
Why bother? Sometimes there are multiple versions of the file on the filesystem. For example, Plone downloads the egg for plone.app.blob automatically, but your buildout also downloads your own version of plone.app.blob as an svnproduct. It's an odd situation, I know, but be aware that odd situations sometimes happen around Plone. (That's why it's so flexible.)
Grep is your friend
How to find source code.
Needle in a Haystack
Sometimes, you've just got to sift through massive piles of code. On linux, grep is amazingly useful. (On Windows, use the File Search feature, under the Start menu.) Here's the command to search for our strange error message:$ cd /var/Plone $ grep 'Odd-length string' * -R
Here are the results:
Binary file parts/Python-2.4.5/lib/python2.4/lib-dynload/binascii.so matches Binary file parts/Python-2.4.5/build/lib.linux-x86_64-2.4/binascii.so matches Binary file parts/Python-2.4.5/build/temp.linux-x86_64-2.4/binascii.o matches parts/Python-2.4.5/Modules/binascii.c: PyErr_SetString(PyExc_TypeError, "Odd-length string");
Honestly,I'm not liking these results. It looks like our error is coming from deep in the guts of Python, which is way beyond what I want to document today. So, let's hope we can get some better results with our pdb.set_trace().
Honest Mistake #2: I halted the grep command early, because it was taking a long time to complete.
Using Pdb
Finding the bug via the Pdb prompt.
Ready to Rumble
After about 10 minutes, the terminal window drops into the Pdb prompt (remember, we're watching Zope, using instance fg).> /var/plone/infocenters/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZEO/zrpc/connection.py(700)call() -> raise inst # error raised by server (Pdb)Type l (short for list) to see the location of the current line, in context:
695 if (isinstance(r_args, tuple) and len(r_args) > 1 696 and type(r_args[0]) == exception_type_type 697 and issubclass(r_args[0], Exception)): 698 inst = r_args[1] 699 import pdb; pdb.set_trace(); 700 -> raise inst # error raised by server 701 else: 702 return r_args 703 704 # For testing purposes, it is useful to begin a synchronous call 705 # but not block waiting for its response.(While this is a useful listing, it's much easier to open the file in an editor, since we already know which file it is.) Since the Pdb prompt will also run python statements, let's examine the environment like so:
(Pdb) dir() ['args', 'inst', 'method', 'msgid', 'pdb', 'r_args', 'r_flags', 'self'] (Pdb) method 'pack' (Pdb) args self = method = pack args = (1231182229.8963089, 1) (Pdb) r_args (, ) (Pdb) r_flags 0 (Pdb) self (Pdb) msgid 1861Unfortunately, that's less than useful. It appears that the error occurs in another part of this asynchronous system, and that this is just the message handler. Even a closer look at the code in connection.py indicates that there's some other part of the system responsible for this. Now, where could it be?
Remove the set_trace
Before you move on, don't forget to take the set_trace out, save the file and restart Zope. Otherwise, it will come back and bite you.
Fumbling in the Dark
As you can see, sometimes debugging Plone is not straight-forward. Still, let's keep going, using every idea that comes to mind. If nothing else, you will learn a lot about how Plone works! :)
Read the Source, Luke
Read the source code to find the bug.
Back to Square 1
We originally saw this error while trying the command-line zeopack.py utility. It is there that be begin our search. Without boring you with a lot of source-code snippets, here's the basic order of execution:
- main() is called (Line 120)
- Based on the command-line switches, we can guess that pack2() is called (Line 116)
- a ClientStorage object is called to pack the database (Line 67)
Now, we need to find the ClientStorage object. In line 33, we have:
from ZEO.ClientStorage import ClientStorage
Therefore, we know that there must be a ClientStorage.py somewhere on the filesystem.
Find and Grep are even better together
Linux users have a sweet combination at their fingertips. If you know part of the filename you want, you can try this trick:
$ find * | grep ClientStorageThe results:
lib/python/ZEO/ClientStorage.pyc lib/python/ZEO/ClientStorage.pyThis is even better than the locate command, because it only returns results below the current directory. The locate command searches the entire filesystem, possibly returning many irrelevant results.
A Brief Dip into the ZEO Library
By this time, we are on a desperate search for the bug. In lib/python/ZEO/ClientStorage.py, the trail goes as follows:
- line 846 - pack() calls self._server.pack().
- lines 541-543 have this comment about when self._server gets set:
# If verify_cache() finishes the cache verification process,
# it should set self._server. If it goes through full cache
# verification, then endVerify() should set self._server.
Incidentally, this is recorded in "instance fg" and also in the server logs with these lines:
2009-01-12 14:32:46 INFO ZEO.ClientStorage (18262) Verifying cache 2009-01-12 14:32:46 INFO ZEO.ClientStorage (18262) endVerify finishing 2009-01-12 14:32:46 INFO ZEO.ClientStorage (18262) endVerify finished
While this tells us what is happening, it's not terribly clear what type of object the server is. Rather than do a lot more grepping, let's try another set_trace.
Oh, server, where art thou?
Using a pdb.set_trace() with a ZEO server.
Find a specific run-time object with a set_trace
In lib/python/ZEO/ClientStorage.py, I've added a set_trace: def pack(self, t=None, referencesf=None, wait=1, days=0):
"""Storage API: pack the storage.
Deviations from the Storage API: the referencesf argument is
ignored; two additional optional arguments wait and days are
provided:
wait -- a flag indicating whether to wait for the pack to
complete; defaults to true.
days -- a number of days to subtract from the pack time;
defaults to zero.
"""
# TODO: Is it okay that read-only connections allow pack()?
# rf argument ignored; server will provide its own implementation
if t is None:
t = time.time()
t = t - (days * 86400)
import pdb; pdb.set_trace()
return self._server.pack(t, wait)
However, this time, the set_trace code is in the ZEO server code, not the Zope instance code. That means, we're going to have to run ZEO in the foreground, as well as Zope. (I hope you brought enough terminal windows!)
Running ZEO in the foreground
This is basically the same thing as running Zope in the foreground, although the actual command is somewhat different:
$ /var/Plone/bin/zeoserver fg
Watching ZEO in the foreground
Now, we're ready to restart both the ZEO server and the Zope instance, then issue the "pack" command via the ZMI.
What I found is that the ZEO server runs silently, even in the foreground.
OK, so this bit didn't work. I'm leaving it in here, so that you can see what I tried. Also, it's a good lesson that you can't always get what you want.
Zero-in on the Code
Finally, the troublesome line.
The zeoserver.log file
After trying (unsuccessfully) to debug the ZEO server using a pdb.set_trace(), I discovered the zeoserver.log file. (I learn something new every day.) Here's the relevant snippet:
(18529/127.0.0.1:51659) pack(time=1231187683.527384) started...
(18529) Error raised in delayed method
Traceback (most recent call last):
File "/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZEO/StorageServer.py", line 1247, in run
result = self._method(*self._args)
File "/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZEO/StorageServer.py", line 416, in _pack_impl
self.storage.pack(time, referencesf)
File "/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/blob.py", line 555, in pack
self._packUndoing(packtime, referencesf)
File "/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/blob.py", line 497, in _packUndoing
for oid, oid_path in self.fshelper.listOIDs():
File "/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/blob.py", line 391, in listOIDs
oid = utils.repr_to_oid(candidate)
File "/var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/utils.py", line 142, in repr_to_oid
as_bin = unhexlify(repr)
TypeError: Odd-length string
Obviously, there's some object (repr) that's causing unhexlify() to bomb in line 142 of /var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/utils.py. But if you look a few lines prior to that, there are some promising references to OIDs. (OID means Object ID.) Perhaps we can find out which object is the problem, after all!
Use a Try-Except clause
Here we've again got to be careful. We don't want to drop into a pdb prompt everytime this is called--it's got to be thousands of times. Rather, we just want to find out what's up the one value that's causing an error. In this case, let's wrap just that one line with try and except:
def repr_to_oid(repr):
if repr.startswith("0x"):
repr = repr[2:]
try:
as_bin = unhexlify(repr)
except TypeError:
import pdb; pdb.set_trace()
as_bin = "\x00"*(8-len(as_bin)) + as_bin
return as_bin
Honest Mistake #3: I made this change before running buildout (in the next step). I'll explain more later...
Why did grep not find this error?
When we ran grep, back in the beginning, looking for 'Odd-length string', it took a long time, so I halted the process. That was "Honest Mistake #2." Had I let it run, it would have found:
var/log/zeoserver.log:TypeError: Odd-length stringSometimes, grep will also complain about "Permission denied" when trying to search a file. In that case, you must have root access (via sudo) to search that file.
Simplify the Problem
Modify buildout to run Zope without ZEO.
Temporarily running Zope without ZEO
Debugging ZEO is apparently more than my brain can handle today. Consequently, I'm going to remove it from the problem. Thankfully, I'm using buildout, so this should be fairly painless. Here are the steps:
- Edit buildout.cfg and comment-out the ZEO-specific lines:
[buildout]
parts =
pil
python-ldap
plone
zope2
productdistros
svnproducts
instance1
zopepy
translations
#Don't forget to put this back before instance1!
# zeoserver
...
[instance1]
recipe = plone.recipe.zope2instance==2.4
zope2-location = ${zope2:location}
#Set zeo-client = true to enable ZEO.
zeo-client = false
zeo-address = ${zeoserver:zeo-address}
zodb-cache-size = 15000
zeo-client-cache-size = 500MB
...
- Run buildout and hope it works.
- Restart Zope in foreground mode.
Honest Mistake #3 Explained
When I made the change to the eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/utils.pyfile, it got overwritten when I re-ran buildout. By default, buildout will download a fresh copy of the codebase everytime you run it. So, be careful, and read up on how to make buildout use caches.
Fortunately, this code change was easy to make again.
Nailing down the bug
Finally, we find the root cause.
Source Code snippets
For the sake of continuity, here's the function source code that we're going to look at while debugging. I'm including it here, so that it doesn't get in the way of the debug session.
NOTE: You could also see these snippets interactively, using the Pdb list command. However, I prefer to open the file in a text editor, since it's easier for browsing code.
From /var/plone/infocenters/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/utils.py:
def repr_to_oid(repr):
if repr.startswith("0x"):
repr = repr[2:]
try:
as_bin = unhexlify(repr)
except TypeError:
import pdb; pdb.set_trace()
as_bin = "\x00"*(8-len(as_bin)) + as_bin
return as_bin
From /var/plone/infocenters/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/blob.py:
def listOIDs(self):
"""Lists all OIDs and their paths.
"""
for candidate in os.listdir(self.base_dir):
if candidate == 'tmp':
continue
oid = utils.repr_to_oid(candidate)
yield oid, self.getPathForOID(oid)
Also from blob.py:
@non_overridable
def _packUndoing(self, packtime, referencesf):
# Walk over all existing revisions of all blob files and check
# if they are still needed by attempting to load the revision
# of that object from the database. This is maybe the slowest
# possible way to do this, but it's safe.
base_dir = self.fshelper.base_dir
for oid, oid_path in self.fshelper.listOIDs():
files = os.listdir(oid_path)
for filename in files:
filepath = os.path.join(oid_path, filename)
whatever, serial = self.fshelper.splitBlobFilename(filepath)
try:
fn = self.fshelper.getBlobFilename(oid, serial)
self.loadSerial(oid, serial)
except POSKeyError:
remove_committed(filepath)
if not os.listdir(oid_path):
shutil.rmtree(oid_path)
Catching the Bug
After restarting Zope, I using the ZMI to re-run the "pack" command. After a while, it dropped to a Pdb prompt:
> /var/plone/infocenters/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/utils.py(146)repr_to_oid() -> as_bin = "\x00"*(8-len(as_bin)) + as_binWhat's on the stack?
(Pdb) dir() ['pdb', 'repr']Hmmm. What's repr?
(Pdb) repr '.layout'Interesting. What function called this?
(Pdb) up > /var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/blob.py(391)listOIDs() -> oid = utils.repr_to_oid(candidate)NOTE: We switched to another file: blob.py. So, what's candidate?
(Pdb) candidate '.layout'Well, that's not so helpful. OK, what's the stack hold now?
(Pdb) dir() ['candidate', 'oid', 'self'] (Pdb) oid '\x00\x00\x00\x00\x00\r\xf49' (Pdb) selfHow about up the stack?
(Pdb) up > /var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/blob.py(497)_packUndoing() -> for oid, oid_path in self.fshelper.listOIDs():I wonder, can we find out which directory is being affected?
(Pdb) dir() ['base_dir', 'filename', 'filepath', 'files', 'fn', 'oid', 'oid_path', 'packtime', 'referencesf', 'self', 'serial', 'whatever'] (Pdb) oid_path '/var/Plone/var/blobstorage/0x0df439'Let's take a look at that directory listing:
$ ll /var/Plone/var/blobstorage/0x0df439 total 184 -rwxr-xr-x 1 plone plone 184217 Sep 16 16:21 0x03777fad209a54ee.blobDrat. I don't think we're looking at the right stack. Let's try to get back to the listOIDs() stack frame.
(Pdb) down > /var/Plone/eggs/ZODB3-3.8.1b7-py2.4-linux-x86_64.egg/ZODB/blob.py(391)listOIDs() -> oid = utils.repr_to_oid(candidate)Good. We're here. I wonder what self.base_dir is?
(Pdb) self.base_dir '/var/plone/infocenters/var/blobstorage' (Pdb) candidate '.layout'Oh, wait a moment! Maybe there's a file in the base directory that's causing trouble. Let's see:
$ ll -a | grep layout -rw-r--r-- 1 root root 4 Dec 4 13:28 .layoutYippee! This is the .layout file that's causing trouble. We know from the source code of listOIDs() in blob.py that blob storage is expecting a hex name for everything except tmp. When it tries to convert the value ".layout" from hex, it bombs.
The Right Fix
Determining the correct course of action to solve this problem.
Possible Fixes
We've identified and located the problem. In this case, there are two possible solutions.
Simply delete the file
First, we need to know where this .layout file came from. After searching with both find and grep, I can't find anything interesting in the codebase. (Once I excluded names like "plone.app.layout" and a lot of CSS files, that is.)
The contents of the file is simply one word: "lawn". I'm guessing it's not terribly important. Still, perhaps I should check with the server admins, in case their scripts are responsible.
Patch the blobstorage package to handle unexpected files and folders
Arguably, there shouldn't be any weird files or folders in the blobstorage base directory. (Other than tmp, that is.) However, in the real world, there might just be. Some other time, I might do this. For now, that's enough debugging. :)
Problem Solved!
After deleting the errant file, I was able to run the pack operation again. This time, it completed without any problems.
Cleanup
When you are finished debugging, don't forget to:
- remove Pdb set_trace() statements
- return your buildout to the previous (unsimplified) configuration
- cleanup any other mess you may have made along the way
References & Credits
Where to go from here...and whom to blame.
References
Want more? Try these:
Credits
Thanks to:
- Martin Aspeli for opening my eyes to server-side tricks at Plone Conference 2008.
- The fine people on #plone, who have patiently answered my questions.
- God, for the extra patience to write this all down and the humility to admit my Honest Mistakes. ;)
