Fighting Spam with SpamBayes?
On roundup instances where anyone can create an account, spam easily becomes a problem. This customization example shows one way to deal with this by integrating with SpamBayes, a statistical anti-spam filter.
Requirements
You need access to a SpamBayes? XMLRPC Server, version 1.1a4 or later. Install the SpamBayes? server according to the documentation on http://spambayes.sf.net, and then run it, loading the XMLRPC module. This mailing list post has some details (although on the core_server.py commandline example, you need to replace "-m" with "-P", making the commandline look like this:
BAYESCUSTOMIZE=$SBDIR/bayescustomize.ini core_server.py -P XMLRPCPlugin
Theory of Operation
An auditor is added and fired upon set and create actions, on the file and msg classes. This auditor contacts the SpamBayes? server via XMLRPC, submits the content of the new file or msg instance together with some extra tokens created from msg/file metadata, and gets a score back. This score is stored as a property (spambayes_score) on the msg/file instance. Another property, spambayes_misclassified is set to False if the msg/file was successfully score (i.e., if there were no communication error or similar). Else, it's set to True to allow an administrator to search for msg/file instances that are not classified.
Roundup's security system is configured to disallow view of the content and summary properties of file and msg class instances for anonymous users (this is configurable, of course), to make sure that the roundup instance can't be used to boost search results for whatever uninteresting content the spammer tries to add. It is also configured to allow users with a special role (Coordinator, in my schema), to classify messages as spam or non-spam (ham) by pressing a button in roundup. This way, SpamBayes? can be trained on your type of data.
Get the Code
Begin by checking out http://svn.python.org/projects/tracker/instances/spambayes_integration :
svn co http://svn.python.org/projects/tracker/instances/spambayes_integration
This gives you two python files: detectors/spambayes.py and extensions/spambayes.py. The former is the auditor which scores msg and file instances when they are created. The latter is an extension for doing the classification from the web interface.
Symlink these two files into your instance's detectors and extensions directory:
cd /home/of/my/tracker
ln -s /path/to/spambayes_integration/detectors/spambayes.py detectors/spambayes.py
ln -s /path/to/spambayes_integration/extensions/spambayes.py extensions/spambayes.py
Copy /path/to/spambayes/integration/detectors/config.ini.template into detectors/config.ini, and adjust the uri to your spambayes server as well as the spam_cutoff value, if needed.
Modify Schema
The schema is modified, adding two properties to the file and
msg classes respectively. If your schema is based on the classic
template, here's your new file and msg definitions:
msg = FileClass(db, "msg",
author=Link("user", do_journal='no'),
recipients=Multilink("user", do_journal='no'),
date=Date(),
summary=String(),
files=Multilink("file"),
messageid=String(),
inreplyto=String(),
spambayes_score=Number(),
spambayes_misclassified=Boolean(),)
file = FileClass(db, "file",
name=String(),
spambayes_score=Number(),
spambayes_misclassified=Boolean(),)
Modify Templates
Now modify your html templates. You need to modify html/msg.item.html, html/file.item.html and html/issue.item.html.
Diff for msg.item.html from classic template:
Index: msg.item.html
===================================================================
--- msg.item.html (revision 56578)
+++ msg.item.html (working copy)
@@ -48,12 +48,45 @@
<th i18n:translate="">Date</th>
<td tal:content="context/date"></td>
</tr>
+
+ <tr>
+ <th i18n:translate="">SpamBayes Score</th>
+ <td tal:content="structure context/spambayes_score/plain"></td>
+ </tr>
+
+ <tr>
+ <th i18n:translate="">Marked as misclassified</th>
+ <td tal:content="structure context/spambayes_misclassified/plain"></td>
+ </tr>
+
</table>
+<p tal:condition="python:utils.sb_is_spam(context)" class="error-message">
+ Message has been classified as spam</p>
+
<table class="messages">
<tr><th colspan=2 class="header" i18n:translate="">Content</th></tr>
+ <th class="header" tal:condition="python:request.user.hasPermission('SB: May Classify')">
+ <form method="POST" onSubmit="return submit_once()"
+ enctype="multipart/form-data"
+ tal:attributes="action context/designator">
+
+ <input type="hidden" name="@action" value="spambayes_classify">
+ <input type="submit" name="trainspam" value="Mark as SPAM" i18n:attributes="value">
+ <input type="submit" name="trainham" value="Mark as HAM (not SPAM)" i18n:attributes="value">
+ </form>
+ </th>
<tr>
- <td class="content" colspan=2><pre tal:content="structure context/content/hyperlinked"></pre></td>
+ <td class="content" colspan=2
+ tal:condition="python:context.content.is_view_ok()"><pre
+ tal:content="structure context/content/hyperlinked"></pre></td>
+ <td class="content" colspan=2
+ tal:condition="python:not context.content.is_view_ok()">
+ Message has been classified as spam and is therefore not
+ available to unathorized users. If you think this is
+ incorrect, please login and report the message as being
+ misclassified.
+ </td>
</tr>
</table>
Diff for file.item.html from classic template:
Index: file.item.html
===================================================================
--- file.item.html (revision 56578)
+++ file.item.html (working copy)
@@ -29,6 +29,16 @@
</tr>
<tr>
+ <th i18n:translate="">SpamBayes Score</th>
+ <td tal:content="structure context/spambayes_score/plain"></td>
+ </tr>
+
+ <tr>
+ <th i18n:translate="">Marked as misclassified</th>
+ <td tal:content="structure context/spambayes_misclassified/plain"></td>
+ </tr>
+
+ <tr>
<td>
<input type="hidden" name="@template" value="item">
@@ -42,10 +52,30 @@
</table>
</form>
-<a tal:condition="python:context.id and context.is_view_ok()"
+<p tal:condition="python:utils.sb_is_spam(context)" class="error-message">
+ File has been classified as spam.</p>
+
+<a tal:condition="python:context.id and context.content.is_view_ok()"
tal:attributes="href string:file${context/id}/${context/name}"
i18n:translate="">download</a>
+<p tal:condition="python:context.id and not context.content.is_view_ok()">
+ Files classified as spam are not available for download by
+ unathorized users. If you think the file has been misclassified,
+ please login and click on the button for reclassification.
+</p>
+
+
+ <form method="POST" onSubmit="return submit_once()"
+ enctype="multipart/form-data"
+ tal:attributes="action context/designator"
+ tal:condition="python:request.user.hasPermission('SB: May Classify')">
+
+ <input type="hidden" name="@action" value="spambayes_classify">
+ <input type="submit" name="trainspam" value="Mark as SPAM" i18n:attributes="value">
+ <input type="submit" name="trainham" value="Mark as HAM (not SPAM)" i18n:attributes="value">
+ </form>
+
<tal:block tal:condition="context/id" tal:replace="structure context/history" />
</td>
Diff for issue.item.html from classic template:
Index: issue.item.html
===================================================================
--- issue.item.html (revision 56578)
+++ issue.item.html (revision 56595)
@@ -182,7 +182,12 @@
</tr>
<tr>
<td colspan="4" class="content">
- <pre tal:content="structure msg/content/hyperlinked">content</pre>
+ <p class="error-message"
+ tal:condition="python:utils.sb_is_spam(msg)">
+ Message has been classified as spam.
+ </p>
+ <pre tal:condition="python:msg.content.is_view_ok()"
+ tal:content="structure msg/content/hyperlinked">content</pre>
</td>
</tr>
</tal:block>
In summary, the item pages for file and msg are modified not
to display the content if this is not allowed, instead displaying a
message that the content has been classified as spam. There's also
buttons for reclassifications, if the current user is permitted to do
reclassification.
The item page for issue is modified the same way - not
displaying content from msg instances marked as spam to users
without permission to see the content.
Setup Permissions
Last but not least, we need to configure security. This is done in
schema.py as usual.
First, we add a new role, Coordinator. Users with this role are
allowed to reclassify messages, training SpamBayes?. Then we create two
new permissions, and assign one of them to the Coordinator role:
db.security.addRole(name='Coordinator', description='A coordinator')
db.security.addPermission(name="SB: May Classify")
db.security.addPermission(name="SB: May Report Misclassified")
db.security.addPermissionToRole('Coordinator', 'SB: May Classify')
Then the security settings for the Anonymous role are configured
as follows:
for cl in 'file', 'msg':
p = db.security.addPermission(name='View', klass=cl,
description="allowed to see metadata of file object regardless of spam status",
properties=('creation', 'activity',
'creator', 'actor',
'name', 'spambayes_score',
'spambayes_misclassified',
'author', 'recipients',
'date', 'files', 'messageid',
'inreplyto', 'type',
))
db.security.addPermissionToRole('Anonymous', p)
spamcheck = db.security.addPermission(name='View', klass=cl,
description="allowed to see metadata of file object regardless of spam status",
properties=('content', 'summary'),
check=may_view_spam(cl))
db.security.addPermissionToRole('Anonymous', spamcheck)
An Example Instance
The python-dev meta tracker schema is based on the classic template and has the integration described in this document already built in. Check out as follows:
svn co http://svn.python.org/projects/tracker/instances/meta
Credits
Erik Forsberg (http://efod.se) wrote the original version of the SpamBayes? integration as well as this document.
Thanks to Skip Montanaro for answering SpamBayes? questions.