I wanted to talk about how computers deal with text, or rather how they deal with what text means. Todd Mintz posted about Google returning something other than the meta-description he had supplied for example. In an ideal world, there would be no need for meta-tags as machines could understand the content of the site based on the text it provides. In fact this is already in the process of being accomplished. The goal of this post is to give some idea as to what is involved in its most basic form. It’s not a tutorial on how to do it or anything like that.
Who should read this?
SEO readers who want to broaden their horizons a bit and understand what’s involved in moving away from meta-tags, and what a few of the options are for search engines.
If you’re on board, let’s get moving…
We have one big problem as a search engine:
There is a lot of legitimate SEO and a lot of well constructed websites out there. They rank well because they have all the right elements and they play by the “rules”. What happens to all those sites that provide excellent information and resources but that are built with no seo in mind? The way the engines work, they don’t pick them up as well, so they don’t rank, which means that it makes it harder for people to find them.
They do however deserve to be found. The seo’d sites are therefore causing a problem without really meaning to. SEO’s will do a competitor analysis based on who is in the rankings not who isn’t, so these little guys aren’t considered competition. I think a lot of them are though if you turn things around. I know of sites that have been tough to find and that I consider to be authorities on particular topics.
That’s what information retrieval is all about – finding things. And finding particular things is still difficult.
Relying on link structures, keywords and meta data for example feels like using crutches. I don’t for a minute say that I have made an alternative system able to discard these. I am saying that we are and should be investigating that though. There are tons of different disciplines that comeinto play but here I just want to look at text, words, knowledge, strings…
The following 2 areas of research are interesting and intensly complicated too. I’m not going to go into any depth but just introduce them briefly, because the idea is to think about new things rather than learn about those specific areas.
Natural language understanding (NLU):
This area of computer science aims to take text and put it through a process that puts it into a format that allows for a machine to understand what it means. We don’t even want to begin to get into what the definition of “understand” is. To be practical we’ll assume it means allowing a machine to do something based on the interpretation of the text. It’s defined as a psychological process and we still can’t say a mahcine has psyche. Yet. I don’t think…
NLU approaches are often symbolic. Symbols are created and defined meanings are attached to them. Full-parsing is such an example.
We can part-of-speech tag the text to understand what function the words have: SEO/NNP is/VBZ fun/NN ./.
The parse looks like this:
(NP (NNP SEO))
(VP (VBZ is)
(NP (NN fun)))
The parse tree looks like this:
This gives the machine some symbols attached to the words and also some information on how the whole things hangs together. This gives us the semantic information for that sentence (on a basic level). Now the machine needs to know what the words mean. This can be done lots of different ways.
We can then use machine readable dictionaries like WordNet for example to see how the words fit into the “world”: seo is fun – you’ll notice that there is no entry for “SEO”, this means oe has to be created. Another step of course is figuring out which sense is the correct one by that’s beyond the scope of this post.
What you have once you put together the syntactic and semantic information you have a representation of the sentence that a machine can use (theoretically but not in practice). We could then say we had everything we needed to allow for “understanding” to take place..but…not quite.
Frame semantics add a whole new level of complexity. Fillmore said that basically you couldn’t understand the word “SEO” unless you knew about the essential knowledge that relates to the word. So you’d have to know about the internet, the web, marketing, online advertising and so on otherwise the word makes no sense. Fine. So how does that work then?
Using FrameNet we can look at example frames to get an idea of what is involved. Using the Frame “Building”:
The red arrows mean that there is inheritance from one bubble to the next. We can see in this frame that “building” can be an object or a verb for example.
On the road to understanding…
If we carry on down this route we are thrown into a world that quickly becomes very very complicated, but shows great promise. We are effectively looking at getting semantic and syntactic information on sentences, setting that within a “knowledge space” (The Frames for example), carrying on down through this route and then applying it to the whole text on a page.
From there we’d have to find more patterns in the overall strutures of the text that we can now see through the previous processes. These show us what is or is not important in the text. Then…we can say that the machine knows what we’re on about (maybe).
If using a far more complex and effective method we could indeed make it so a machine could easily understand text and relate it to others, then we don’t need meta-data anymore.
But the semantic web is about adding meta-data:
We are in the dark ages here, so we have no way of putting information into a machine unless we are actually coding it in there. The markup languages that are used for the semantic web like RDFa help machines gather exta information on the resource. ”Semantic Annotation, Indexing, and Retrieval” is a good read if you’re interested in that side of things.
<meta name=”keywords” content=”seo,fun,amusing,whatever”>
<meta name=”description” content=”seo is fun”>
<meta name=”author” content=”Miss Brackets”>
The fun stuff:
<span property=”dc:creator”>Miss Brackets</span>
tells you why seo is fun,
for all sorts of different reasons.
The post is due to be published in
<span property=”dc:date” content=”2009-04-01″>April 2009</span>.
The semantic web is not the same discipline as NLU, but NLU can use the semantic web markup laguages and the ontologies that are being created to add meaning to text. This is one of the reasons that ther semantic web is so important. It is allowing for other techniques to be further developed. Like NLU they share the common trait of a search for meaning.
You can find more information about the semantic web and its relevance to seo on these other posts.
Why should you care:
We know about all the work being done for the semantic web, and putting it into perspective shows how it affects other areas of computing. These like NLU are sometimes directly related to the web and the work of information retrieval. If we know what the documents mean, then we can make a system that find them based on meaning rather than variables like links for example. This isn’t due to happen tomorrow though, but it will at some point.
Working with the meaning of documents allows for those sites that are not recieving the SEO treatment to be included as well. Now that should matter to you