Yes, it does already have a name and we’re already wondering how best to avoid it. Meta-tags have been rendered untrustworthy because of dishonest manipulation (honest manipulation is ok), especially the keywords tag which is pretty much useless now. Search engines depend on a wide variety of variables collected from web pages and websites, as well as the ecosystem around them. This allows them to gage realistically how authoritative a document is and also what it’s about amongst other things. The semantic web however depends on tags.
In this post we’ll mostly be referring to RDF when referring to semantic web technology because it’s (imho) the strongest contender right now. These tags depend on people (and machines) forming them truthfully and properly, which is sort of wishful thinking isn’t it?! Many ask “How will we deal with SemSpam?” and other cry out “The SemWeb is doomed!”. I don’t think we need to get dramatic about it, but yes, there are some things to consider.
Like it ot not, the semantic web is open for business and as you’ll read there, Yahoo reports a 15% higher click-through rate for richly formatted snippets. BestBuy reports a 30% increase in traffic after using RDFa and also a significant increase in rankings.
This is not all about rankings anymore, this is about findability across lots of channels. You achieve that by sharing your resources and providing the best site possible for your users. But…there will inevitably be some spamming along the way, so how do we effectively deal with that?
There has been mass buying of URLs for the purposes of SEO and also for brand protection in the past. This is a sensible thing to do if you have a big company and not at all cheating in any sense. Linked data URI’s can be secured in the same sort of way, and for the same reasons all in all. The idea is that you can create a big hub of data that will encourage other companies and individuals to build around. This is already happening and the NY Times, the BBC and other large SemWeb real estate areas are already in position so to speak.
The big idea behind the semantic web is that you publish your data in a machine readable format, linking it all together in a sensible way. Then other applications and websites will detect that data and reuse it in whatever way they decide. This creates lots of visibility for your data and also creates lots of links for you. You can start to see how this could easily be abused. In addition to this, the RDF tags need to be accurate and this is a little like the meta-tags issue. RDF depends on a trusted provenance.
The W3C provenance vocabulary allows publishers to use classes and properties to describe the provenance of their data, so it’s metadata about data. Provenance is used to assess the reliability of a data source. If you’re making an application like Foresquare for example and you want to use geolocation data, you’d probably want to know who published it in the first place so you can decide if you think it’s accurate enough for your purposes.
There are 2 types of provenance data:
- information recorded by the application that performs the provenance-based evaluation of the data
- information published by the providers of data or services (because the latter only allows for a small amount of info to be gathered)
Something like this might be created:
<http://example.org/mapping> rdf:type prvTypes:TriplifyMapping ; prv:createdBy [ prv:performedAt "2008-03-11T12:00:00Z"^^xsd:dateTime ; prv:performedBy <http://example.org/Carol> ] .
Here we can see that Carol was in charge of this mapping for example. With all of the social media involvement we have, and the fact that most (if not all) of us that are likely to publish data have a Facebook account or a twitter account or whatever, it might be useful to assess the reliability of each individual that way. If I know through XFN or FOAF or the Participation Ontology that Carol is a member of the W3C, an MIT Alumni, works at IBM and knows my mate Chris, I’ll probably trust her data. This brings about the whole issue of privacy in some respects because it becomes harder to publish things anonymously.
Olaf Hartig and Jun Zhao have been looking at how to solve the data quality conundrum. Their work focuses on adding further checks to the provenance process:
“Our provenance model introduces a new dimension of provenance information, i.e. the provenance of data access, to the existing provenance research. We are gathering feedback to our model from different communities and we foresee continuing development of our provenance vocabulary driven by well-deﬁned use cases. In this paper, we demonstrate assessing the timeliness of data on the Web using our method. We plan to implement this method as part of a Web data publication framework in the near future and to apply this method to the assessment of other quality criteria, such as accuracy.”
This is an example of how scientists are currently looking at how to establish credibility to web resources. You’ll find quite a lot more about this sort of thing on the SWPM 2009 wiki.
The idea of the semantic web is that you have machines that can infer things from the data:
X = Y according to Z
We run into some significant problems when other sources disagree or have been tagged up “creatively” (look at Del.icio.us tags for example). At the moment “Z” is the source of the data (e.g. BBC), but as the semantic web grows we run into problems with other things like there being many ways to describe something, and the changing context also. If we add to this the fact that most people (myself included) don’t really have to tag up a load of data in our day to day lives, where does that leave us?
There are quite a few tools that can automate the creation of tags (this blog uses OpenCalais), so for me the human labour aspect isn’t an issue right now. I can see there being problems with cheating on the tags and also creating huge hubs of misleading stuff, but if we rely on our individual identities to bring authority to our data, do we have a solution?
There are lots of issues I can think of just off the top of my head with this. We can create fake identities, hijack other people’s and so on, leading us to a really big tangle.
No question about it, order must come to the web. Using provenance data and trust factors from different resources seems sensible to me. Ian Davis has a lovely post explaining how spam could infiltrate the semantic web and gives some good examples. He explains how we can avoid that too.