<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Sphinn spam &#8211; some solutions</title>
	<atom:link href="http://www.scienceforseo.com/social-networks/sphinn-spam-some-solutions/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.scienceforseo.com/social-networks/sphinn-spam-some-solutions/</link>
	<description>a bridge between worlds</description>
	<lastBuildDate>Sat, 05 Feb 2011 14:02:08 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
	<item>
		<title>By: andymurd</title>
		<link>http://www.scienceforseo.com/social-networks/sphinn-spam-some-solutions/comment-page-1/#comment-20</link>
		<dc:creator>andymurd</dc:creator>
		<pubDate>Thu, 13 Nov 2008 17:34:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.scienceforseo.com/wordpress/?p=109#comment-20</guid>
		<description>I didn&#039;t use an SVM but I should definitely look into it. &lt;br/&gt;&lt;br/&gt;I tried to get around the lack of a spam feed by classifying posts by topic and grouping topics into &quot;good&quot; and &quot;bad&quot;. A lot of Sphinn&#039;s spam submissions are well-written web pages, but they&#039;re just not relevant, like posts about holidays to India.&lt;br/&gt;&lt;br/&gt;It was a fun exercise and I&#039;ll pick it up again.&lt;!-- Touched by JuLiA --&gt;</description>
		<content:encoded><![CDATA[<p>I didn&#8217;t use an SVM but I should definitely look into it. </p>
<p>I tried to get around the lack of a spam feed by classifying posts by topic and grouping topics into &#8220;good&#8221; and &#8220;bad&#8221;. A lot of Sphinn&#8217;s spam submissions are well-written web pages, but they&#8217;re just not relevant, like posts about holidays to India.</p>
<p>It was a fun exercise and I&#8217;ll pick it up again.<!-- Touched by JuLiA --></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: CJ</title>
		<link>http://www.scienceforseo.com/social-networks/sphinn-spam-some-solutions/comment-page-1/#comment-19</link>
		<dc:creator>CJ</dc:creator>
		<pubDate>Thu, 13 Nov 2008 17:25:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.scienceforseo.com/wordpress/?p=109#comment-19</guid>
		<description>I think this is an in-house issue, because they have all of the data, and without it you can&#039;t reliably train and test a classifier.  &lt;br/&gt;&lt;br/&gt;Have you tried using an SVM with the Naive Bayes? &lt;br/&gt;&lt;br/&gt;Def write something up and have another go :)&lt;!-- Touched by JuLiA --&gt;</description>
		<content:encoded><![CDATA[<p>I think this is an in-house issue, because they have all of the data, and without it you can&#8217;t reliably train and test a classifier.  </p>
<p>Have you tried using an SVM with the Naive Bayes? </p>
<p>Def write something up and have another go <img src='http://www.scienceforseo.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> <!-- Touched by JuLiA --></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: andymurd</title>
		<link>http://www.scienceforseo.com/social-networks/sphinn-spam-some-solutions/comment-page-1/#comment-18</link>
		<dc:creator>andymurd</dc:creator>
		<pubDate>Thu, 13 Nov 2008 17:08:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.scienceforseo.com/wordpress/?p=109#comment-18</guid>
		<description>I tried training a bayesian classifier to determine sphinn spam a couple of weeks ago. I got a reasonable start - 90% of spam correctly identified, 10% false negative, 1-3% false positive.&lt;br/&gt;&lt;br/&gt;These figures are good but not good enough for a production website. I couldn&#039;t improve them further due to a couple of problems:&lt;br/&gt;&lt;br/&gt;1. There is no publicy available sphinn spam data. Grabbing the upcoming and hot feeds and assuming anything that gets dumped within 3 hours is not accurate enough. I&#039;d need an RSS feed of known spam.&lt;br/&gt;&lt;br/&gt;2. Sphinn has a surprisingly wide variety of posts. You&#039;d think that stories about gambling should be off topic for Sphinn, but no, &lt;a HREF=&quot;http://sphinn.com/story/79400&quot; REL=&quot;nofollow&quot;&gt;this went hot&lt;/a&gt;.&lt;br/&gt;&lt;br/&gt;There are a lot of other good factors that could help, such as the age of the poster&#039;s account, number of repeated words in the comment, URL depth etc.&lt;br/&gt;&lt;br/&gt;I guess I should have another go, or at least write up my findings.&lt;!-- Touched by JuLiA --&gt;</description>
		<content:encoded><![CDATA[<p>I tried training a bayesian classifier to determine sphinn spam a couple of weeks ago. I got a reasonable start &#8211; 90% of spam correctly identified, 10% false negative, 1-3% false positive.</p>
<p>These figures are good but not good enough for a production website. I couldn&#8217;t improve them further due to a couple of problems:</p>
<p>1. There is no publicy available sphinn spam data. Grabbing the upcoming and hot feeds and assuming anything that gets dumped within 3 hours is not accurate enough. I&#8217;d need an RSS feed of known spam.</p>
<p>2. Sphinn has a surprisingly wide variety of posts. You&#8217;d think that stories about gambling should be off topic for Sphinn, but no, <a HREF="http://sphinn.com/story/79400">this went hot</a>.</p>
<p>There are a lot of other good factors that could help, such as the age of the poster&#8217;s account, number of repeated words in the comment, URL depth etc.</p>
<p>I guess I should have another go, or at least write up my findings.<!-- Touched by JuLiA --></p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic page generated in 0.762 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2013-05-20 02:52:58 -->
