I though this paper was particularly interesting for SEO people and also exciting as a study. ”The Web Changes Everything: Understanding the Dynamics of Web Content” by Adar, Teevan, Dumais and Elsas is fascinating on several levels It’s freely available to all, so I’ll just give a quick run down.
This experiment used a multi-week web crawl of 55,000 Web pages, selected to represent different user visitation patterns. They looked at how the web was changing. They looked at “Term staying power” within documents over time. They found that there was a bimodal distribution of terms, with some being very stable or changing rapidly over time. They saw stable terms as being the central topic of a document and also common function words for example. Terms with a high staying power are in multiple crawls of the page.
“As we saw with the longevity plots, terms with high staying power are either descriptive of the document’s ongoing central topic, represent common words, or are navigation elements. In an attempt to distinguish the set of vocabulary that is potentially more informative of the document’s central topic as well as having a strong staying power, we looked at the divergence or clarity of these terms with regard to the collection as a whole.”
They made “term lifespan plots” which showed the dynamics of vocabulary change over time. To assess it they adapted an algorithm which is usually used to determine query difficulty over a set of documents or to assess how different subsets of documents are to each other. They used it here to to discover which terms distinguish the language of a single document from that of the collection. This method gives a metric called divergence. For details, check the paper.
To determine how likely a term is to appear over time in a web page, they used a metric called “staying power”. This is the “likelihood of observing a word in document D at two different timestamps, t and t + a, where P(t) and P(a) are sampled uniformly”.
Pages were broken down by visitors and they saw that popular pages change more frequently. Educational and government domains don’t change very much. Pages with deep URL structure change less often then those at the root. “This may be because top level pages are used as jumping points into the site, with a high navigational to content ratio.” Internal pages that have low navigational to content ratio do not change often, but when they do, the change is often drastic.
They also look at structural changed: “Metrics for structural changes are frequently task specific. By creating a flexible serialized processing scheme we are able to rapidly test and measure structural change in different ways.”
They show that their work is of great value for crawlers, ranking, and user interaction. It also gives some insight into a very interesting metric for keyword research I believe.
Determining how much staying power a term has is relevant to the SEO, as it is to the PPC professional. There are terms we choose to optimise for because of them being highly searched for for example, but knowing that they will stay important for a site is useful. Some very large sites can be ambiguous and being able to track terms to optimise for on these. If you knew how the content evolved then it would allow for more effective keyword research. You could find out how prominent term change is on a group of sites also, like the competition for example. The main topic terms are going to stay but which ones are fluid in a particular site’s content structure? I’m sure you can think of other ways to use these metrics.