I came across a very cool paper from SIGKDD 2008 called “Blogosphere: Research Issues, Tools, and Applications” by Nitin Agarwal and Huan Liu from the University of Arizona. It’s an easy but long read, for the geek, but can also be quite happily understood by the layman. I’ve pulled out some things that I thought were interesting and given you a short taster here, but I urge you to read the paper, it’s brilliant.
There is a model of the web, called the webgraph, where each webpage is a node and each hyperlink an edge. It provides a visual model of the web, which can be used for many things, such as for example search engines that use this graph for ranking documents.
We can’t map the blogosphere in the same way because the number of links is sparse, and blog posts are dynamic and short-lived quite often. Also the comment structure which provides for interaction does not exist in the webgraph model. The webgraph assumes that sites build links over time, this isn’t so in the blogosphere. We cannot use a static graph like the webgraph.
One way to model the blogosphere is to gather data concerning link density, how often people create blog posts, burstiness and popularity, and how these blog posts are linked. also it’s possible to use the blogrolls to find similar blogs. This is what Lescovek et al. did, they used a cascade model usually used in epidemiology:
“This way any randomly picked blog can infect its uninfected immediate neighbors probabilistically, which repeats the same process until no node remains uninfected. In the end, this gives a blog network.”
Brooks and Montanez used tf-idf to find the top 3 words in every post and then computed blog similarity based on that, which means that they could cluster them.
The problem is that these methods are keyword based clustering and therefore have high-dimensionality and sparsity issues. You could reduce this by using LSI but the results still aren’t so good.
Many companies have already seen the usefulness of blogs for sentiment analysis, trend tracking and reputation management. Some systems use manually tagged sentences with negative/positive references, then using a naive-bayed classifier until everything has been classified.
Another way of finding the edges on the graph is by taking the topic similarity between 2 blogs. This is a good idea, but using this method is still under research and very difficult.
is a “blog epidemic analyzer”, and predicts if 2 blogs should be linked (BlogPulse
uses this). They look for “infection” (how the information is propagated), so their aim is to find the blog responsible for the epidemic. These are the authority blogger, the influential ones in the blogosphere. It’s good news when you find these bloggers because you can use them for word-of-mouth marketing as it were. They provide valuable information that companies may be interested in, they may employ the blogger for example because s/he gives brilliant information to people about their products.
Another method to infer this has been to predict the odds of a page being copied or read, and also look at topic stickiness. The most influential node is chosen with each iteration. It apparently outperforms both PageRank and hits for this task.
Splogs (spam blogs) are the equivalent of link spam in search engines. On the web algorithms include variables such as keyword frequency, tokenized url, length of words, anchor text and more. PageRank computed a score which it uses to identify splogs. This doesn’t work on blogs unsurprisingly because they are too dynamic for spam filters to be effective. This issue hasn’t been resolved as yet, although there is research in this area, and things are improving.
Link analysis is also used to find patterns. The text around the links is used, and based on those links hubs and authorities are found. You could use comments as links between the blogs. An influence score could be determined by taking into consideration inbound links, comments, length of posts, and links out.
This is a fun and really interesting are of research, keep an eye on new things emerging from this research community.