March 14th, 2010 - 5:32 pm § in Semantic web

Semantic web spam: SemSpam

Yes, it does already have a name and we’re already wondering how best to avoid it. Meta-tags have been rendered untrustworthy because of dishonest manipulation (honest manipulation is ok), especially the keywords tag which is pretty much useless now. Search engines depend on a wide variety of variables collected from web pages and websites, as well as the ecosystem around them. This allows them to gage realistically how authoritative a document is and also what it’s about amongst other things. The semantic web however depends on tags.

In this post we’ll mostly be referring to RDF when referring to semantic web technology because it’s (imho) the strongest contender right now. These tags depend on people (and machines) forming them truthfully and properly, which is sort of wishful thinking isn’t it?! Many ask “How will we deal with SemSpam?” and other cry out “The SemWeb is doomed!”. I don’t think we need to get dramatic about it, but yes, there are some things to consider.

The facts:

Like it ot not, the semantic web is open for business and as you’ll read there, Yahoo reports a 15% higher click-through rate for richly formatted snippets. BestBuy reports a 30% increase in traffic after using RDFa and also a significant increase in rankings.

This is not all about rankings anymore, this is about findability across lots of channels. You achieve that by sharing your resources and providing the best site possible for your users. But…there will inevitably be some spamming along the way, so how do we effectively deal with that?

The issue:

There has been mass buying of URLs for the purposes of SEO and also for brand protection in the past. This is a sensible thing to do if you have a big company and not at all cheating in any sense. Linked data URI’s can be secured in the same sort of way, and for the same reasons all in all. The idea is that you can create a big hub of data that will encourage other companies and individuals to build around. This is already happening and the NY Times, the BBC and other large SemWeb real estate areas are already in position so to speak.

The big idea behind the semantic web is that you publish your data in a machine readable format, linking it all together in a sensible way. Then other applications and websites will detect that data and reuse it in whatever way they decide. This creates lots of visibility for your data and also creates lots of links for you. You can start to see how this could easily be abused. In addition to this, the RDF tags need to be accurate and this is a little like the meta-tags issue. RDF depends on a trusted provenance.

Provenance:

The W3C provenance vocabulary allows publishers to use classes and properties to describe the provenance of their data, so it’s metadata about data. Provenance is used to assess the reliability of a data source. If you’re making an application like Foresquare for example and you want to use geolocation data, you’d probably want to know who published it in the first place so you can decide if you think it’s accurate enough for your purposes.

There are 2 types of provenance data:

  • information recorded by the application that performs the provenance-based evaluation of the data
  • information published by the providers of data or services (because the latter only allows for a small amount of info to be gathered)

Something like this might be created:


          <http://example.org/mapping>
          rdf:type prvTypes:TriplifyMapping ;
          prv:createdBy [ prv:performedAt "2008-03-11T12:00:00Z"^^xsd:dateTime ;
                          prv:performedBy <http://example.org/Carol> ] .

Here we can see that Carol was in charge of this mapping for example. With all of the social media involvement we have, and the fact that most (if not all) of us that are likely to publish data have a Facebook account or a twitter account or whatever, it might be useful to assess the reliability of each individual that way. If I know through XFN or FOAF or the Participation Ontology that Carol is a member of the W3C, an MIT Alumni, works at IBM and knows my mate Chris, I’ll probably trust her data. This brings about the whole issue of privacy in some respects because it becomes harder to publish things anonymously.

Olaf Hartig and Jun Zhao have been looking at how to solve the data quality conundrum. Their work focuses on adding further checks to the provenance process:

“Our provenance model introduces a new dimension of provenance information, i.e. the provenance of data access, to the existing provenance research. We are gathering feedback to our model from different communities and we foresee continuing development of our provenance vocabulary driven by well-defined use cases. In this paper, we demonstrate assessing the timeliness of data on the Web using our method. We plan to implement this method as part of a Web data publication framework in the near future and to apply this method to the assessment of other quality criteria, such as accuracy.”

This is an example of how scientists are currently looking at how to establish credibility to web resources. You’ll find quite a lot more about this sort of thing on the SWPM 2009 wiki.

Greater problems:

The idea of the semantic web is that you have machines that can infer things from the data:

X = Y according to Z

We run into some significant problems when other sources disagree or have been tagged up “creatively” (look at Del.icio.us tags for example). At the moment “Z” is the source of the data (e.g. BBC), but as the semantic web grows we run into problems with other things like there being many ways to describe something, and the changing context also. If we add to this the fact that most people (myself included) don’t really have to tag up a load of data in our day to day lives, where does that leave us?

Solutions:

There are quite a few tools that can automate the creation of tags (this blog uses OpenCalais), so for me the human labour aspect isn’t an issue right now. I can see there being problems with cheating on the tags and also creating huge hubs of misleading stuff, but if we rely on our individual identities to bring authority to our data, do we have a solution?

There are lots of issues I can think of just off the top of my head with this. We can create fake identities, hijack other people’s and so on, leading us to a really big tangle.

No question about it, order must come to the web. Using provenance data and trust factors from different resources seems sensible to me. Ian Davis has a lovely post explaining how spam could infiltrate the semantic web and gives some good examples. He explains how we can avoid that too.

Tweet This Post

Related Posts:


March 11th, 2010 - 10:42 pm § in Social networks

SLNA: social network analysis

Who is in your social network and how do they influence its behaviour? Which events affect it positively or negatively? You need to find out if you’re carrying out social media campaigns. The research that ignited some fire in my belly today is from a computer scientist (Andrew J. Scholand at[...]


March 3rd, 2010 - 9:15 pm § in Inspirations, Technology news

Experts, step up

This year a number of topics specific to online business strategies (of which SEO, social media, data mining and other such great things) come to mind. They are on my drawing board and will receive attention as the year unfolds. Last year allowed for some foundations to be laid and now everything[...]


March 2nd, 2010 - 8:12 pm § in Semantic web

3 Semantic Web Chrome Extensions

I’m always on the lookout for some useful browser extensions to make my life a little easier. My favourite browser is Chrome because it’s fast, clean and stable. Now that extensions are available, it makes sense to compile a little list of useful ones for the semantic web. There aren[...]


February 27th, 2010 - 1:29 pm § in SEO & marketing

Keywords and epistemology

“Epistemology” is the science of human knowledge. This area of philosophy looks at the origin, the structure and the validity of what we know, perceive or think. Formulating proper keywords when using a search engine and evaluating the results of that search involves the area of epistem[...]


February 11th, 2010 - 9:43 am § in Guest posts, Technology news

Predictive Analytics at your fingertips … yes, time has come to forecast customer behavior in Excel!

In all industries, including on-line marketing, when talking about forecasting customer behavior and predictive analytics, the first thoughts that come to mind are complexity and cost. The traditional way of deploying and executing predictive analytics usually involves the purchase of expensive sof[...]


February 9th, 2010 - 9:52 pm § in Information retrieval

How we categorise nouns

I read about some research today that is both inspiring and exciting. I’ve always been hugely interested in how humans deal with words in their heads and finally we have some interesting answers. This made me play with keyword research and think about how (un)intuitive our methods are right n[...]








© 2009-2010 Science for SEO All Rights Reserved -- Copyright notice by Blog Copyright

SEO Powered by Platinum SEO from Techblissonline

Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.