Semantic web spam: SemSpam

SemanticWebSpam 300x213 Semantic web spam: SemSpam

Yes, it does already have a name and we’re already wondering how best to avoid it. Meta-tags have been rendered untrustworthy because of dishonest manipulation (honest manipulation is ok), especially the keywords tag which is pretty much useless now. Search engines depend on a wide variety of variables collected from web pages and websites, as well as the ecosystem around them. This allows them to gage realistically how authoritative a document is and also what it’s about amongst other things. The semantic web however depends on tags.

In this post we’ll mostly be referring to RDF when referring to semantic web technology because it’s (imho) the strongest contender right now. These tags depend on people (and machines) forming them truthfully and properly, which is sort of wishful thinking isn’t it?! Many ask “How will we deal with SemSpam?” and other cry out “The SemWeb is doomed!”. I don’t think we need to get dramatic about it, but yes, there are some things to consider.

The facts:

Like it ot not, the semantic web is open for business and as you’ll read there, Yahoo reports a 15% higher click-through rate for richly formatted snippets. BestBuy reports a 30% increase in traffic after using RDFa and also a significant increase in rankings.

This is not all about rankings anymore, this is about findability across lots of channels. You achieve that by sharing your resources and providing the best site possible for your users. But…there will inevitably be some spamming along the way, so how do we effectively deal with that?

The issue:

There has been mass buying of URLs for the purposes of SEO and also for brand protection in the past. This is a sensible thing to do if you have a big company and not at all cheating in any sense. Linked data URI’s can be secured in the same sort of way, and for the same reasons all in all. The idea is that you can create a big hub of data that will encourage other companies and individuals to build around. This is already happening and the NY Times, the BBC and other large SemWeb real estate areas are already in position so to speak.

The big idea behind the semantic web is that you publish your data in a machine readable format, linking it all together in a sensible way. Then other applications and websites will detect that data and reuse it in whatever way they decide. This creates lots of visibility for your data and also creates lots of links for you. You can start to see how this could easily be abused. In addition to this, the RDF tags need to be accurate and this is a little like the meta-tags issue. RDF depends on a trusted provenance.

Provenance:

The W3C provenance vocabulary allows publishers to use classes and properties to describe the provenance of their data, so it’s metadata about data. Provenance is used to assess the reliability of a data source. If you’re making an application like Foresquare for example and you want to use geolocation data, you’d probably want to know who published it in the first place so you can decide if you think it’s accurate enough for your purposes.

There are 2 types of provenance data:

  • information recorded by the application that performs the provenance-based evaluation of the data
  • information published by the providers of data or services (because the latter only allows for a small amount of info to be gathered)

Something like this might be created:


          <http://example.org/mapping>
          rdf:type prvTypes:TriplifyMapping ;
          prv:createdBy [ prv:performedAt "2008-03-11T12:00:00Z"^^xsd:dateTime ;
                          prv:performedBy <http://example.org/Carol> ] .

Here we can see that Carol was in charge of this mapping for example. With all of the social media involvement we have, and the fact that most (if not all) of us that are likely to publish data have a Facebook account or a twitter account or whatever, it might be useful to assess the reliability of each individual that way. If I know through XFN or FOAF or the Participation Ontology that Carol is a member of the W3C, an MIT Alumni, works at IBM and knows my mate Chris, I’ll probably trust her data. This brings about the whole issue of privacy in some respects because it becomes harder to publish things anonymously.

Olaf Hartig and Jun Zhao have been looking at how to solve the data quality conundrum. Their work focuses on adding further checks to the provenance process:

“Our provenance model introduces a new dimension of provenance information, i.e. the provenance of data access, to the existing provenance research. We are gathering feedback to our model from different communities and we foresee continuing development of our provenance vocabulary driven by well-defined use cases. In this paper, we demonstrate assessing the timeliness of data on the Web using our method. We plan to implement this method as part of a Web data publication framework in the near future and to apply this method to the assessment of other quality criteria, such as accuracy.”

This is an example of how scientists are currently looking at how to establish credibility to web resources. You’ll find quite a lot more about this sort of thing on the SWPM 2009 wiki.

Greater problems:

The idea of the semantic web is that you have machines that can infer things from the data:

X = Y according to Z

We run into some significant problems when other sources disagree or have been tagged up “creatively” (look at Del.icio.us tags for example). At the moment “Z” is the source of the data (e.g. BBC), but as the semantic web grows we run into problems with other things like there being many ways to describe something, and the changing context also. If we add to this the fact that most people (myself included) don’t really have to tag up a load of data in our day to day lives, where does that leave us?

Solutions:

There are quite a few tools that can automate the creation of tags (this blog uses OpenCalais), so for me the human labour aspect isn’t an issue right now. I can see there being problems with cheating on the tags and also creating huge hubs of misleading stuff, but if we rely on our individual identities to bring authority to our data, do we have a solution?

There are lots of issues I can think of just off the top of my head with this. We can create fake identities, hijack other people’s and so on, leading us to a really big tangle.

No question about it, order must come to the web. Using provenance data and trust factors from different resources seems sensible to me. Ian Davis has a lovely post explaining how spam could infiltrate the semantic web and gives some good examples. He explains how we can avoid that too.

Related Posts:


5 Comments Add Yours ↓

  1. 1

    Interesting post, CJ. Microformats, etc., will no doubt be used and mis-used, and Google & co will no doubt find ways to more-or-less filter the spam.

    I think the really interesting thing you introduced here was the impact of individual trustworthiness based on social media accounts, etc. As this is much harder to manipulate effectively (i.e. it’s harder to sneak social media spam past the guardians of social media: its users), I think this sort of signal has far greater potential, and that it will definitely happen (if it’s not happening already).

    I’ve spoken to our Canuk friend, Dave ‘the Gypsy’ Harry, and read his writings on the matter, and he’s convinced it’s quite a way off. Too hard to filter the noise. What are your thoughts?

  2. 2

    The spam problem is indeed a powerful issue for the semantic web, but there’s another issue that I think is just as important: content scrapers. The semantic web makes it easier for all machines to understand data better, including malicious content scrapers. It’ll make the content theft industry all that much more attractive.

  3. CJ #
    3

    Hey Glenn,

    I think that there are a lot of bright and stunningly creative people working in semantic web technology so I figure they will know what to do. I don’t see this stuff being far off because it’s already being tested :)

    Hey Barry,

    there are innumerable copies of my posts on loads of spam sites, and my content has also been hand lifted and inserted around other content, no credit given of course. I don’t think the semantic web is going to make that any easier or harder. I think that this is a different kind of problem.I mean if someone really wants to steal your work, they will. Credit cards and bank accounts get fleeced regularly, as well as databases infiltrated.

    I do unfortunately believe that there will be abuse. In some respects thats good, it drives innovation :)

  4. 4

    I hate to say this, but I added microformats to my jewelry site a couple weeks ago.. The day after Google made their little announcement about them.. They now have a hopefully measurable ROI for the effort in reverse engineering a shopping cart to make them work..

    Will they be manipulated and abused?? You bet.. Just like everything else out there, there are plenty of people that have little care for how their actions affect others.. Am I worried?? Nope.. I have more important things on my plate than to worry about what a spammer or scraper will do..

    What I do see as a problem though is all of the people that will be left behind due to a lack of technical knowledge.. There is already a widening gap between mom and pop sites and “professional” sites, and this will just push that divide even further.. It will further add to the segmenting of the web and make it harder and harder to find those hidden gems simply because they don’t have the technical skills to make them more visible..

  5. CJ #
    5

    You make a really important and interesting point Steve. I’ll raise it at the Semantic Web Group meeting next week and get back to you.



Your Comment






© 2009-2012 Science for SEO All Rights Reserved -- Copyright notice by Blog Copyright

SEO Powered by Platinum SEO from Techblissonline

Twitter links powered by Tweet This v1.8.1, a WordPress plugin for Twitter.