“Remembering what we like: Toward an agent-based model of Web traffic” by Bruno Gonçalves & Mark R. Meiss (Indiana University) and José J. Ramasco (CNLL, ISI foundation, Turin) is a paper due to be presented at WSDM 2009.
The PageRank model is “Markovian” meaning that it’s a rule based model which goes through an evolutionary type of process. The transition from a current state to another state is based on a distribution of probabilities and future state are independant of past ones. The “Random surfer” model means that the process is to go from a starting point to any of the neighbours with the probabilities set as equal.
The motivation behind this work is that the PageRank model is particularly poor in showing how people actually navigate the web. It has been extremely useful in showing how people interact with the web, but in order to use this kind of data effectively, it needs to be more focused around individual users. The “Random walk” model is not true for everyone for example, not all neighbours are equal. This means that the PageRank algorithm isn’t effective in predicting web traffic. The authors look at this particular flaw and work to also include individual traffic patterns.
“Using the empirical traffic patterns generated by a thousand users over the course of two months, we characterize the properties of Web traffic that cannot be reproduced by Markovian models, in which destinations are independent of past decisions. In particular, we show that the diversity of sites visited by individual users is smaller and more broadly distributed than predicted by the PageRank model; that link traffic is more broadly distributed than predicted; and that the time between consecutive visits to the same site by a user is less broadly distributed than predicted.”
They propose to use an “Agent-based” model which “carries state information and can account for both individual and aggregate traffic patterns observed in real-world data”. The “Agent-based” model simulates the actions of different individuals. The behaviour of these agents over a whole network is analysed so that it can be possible to predict future actions of this kind. Each agent makes it’s own mind up on how to deal with situations and makes decisions based on a set of given rules. An agent-based model consists of a number of these agents and the relationships between them. This way we can observe even very complex behavioural pattens.
They collected individual web traffic data and found that hyperlink traffic is more broadly distributed than PageRank makes out. They observed that the empirical diversity of sites that users visited was both smaller and more broadly distributed than Pagerank makes out. And finally they saw that the districution of times between consecutive visits to the same site is narrower than they expected.
Their model is called BookRank. A list of bookmarks is kept and used as a “teleportation targets”, and tabbed browsing is used throughout. The agents maintain a list of bookmarks and the number of previous visits. The agents navigate the web according to a number of rules for example, a bookmark is added when a site has not been previously visited and the bookmarks are ranked from most to least visited. They use 1000 agents, so the pattern analysis was rather complex.
The BookRank model did better at approximating link traffic values than PageRank for example but as the authors say, it still suffers from limitations such as the probability of ending a session lengths, it doesn’t take into the account that users often have many tabs open (like I do), and users have a whole host of unpredictable behaviours. They are going to look at topic dependant nodes as well so that the user only has a number of particular nodes to jump to. Also they rightly mention that not all pages are equally likely to be the end node of a session.
The best way I think is for more personalised results to appear so that we can each have a number of nodes that we are likely to transition to rather than take into the equation the entire web. There are a huge number of sites that I am never going to visit, so these sites have a much lower probability (0 even) that I will even visit them. I am unlikely to visit a site about “Salmon Breeding” or “pygmy hippos” - wait I just did. Those sites though, considering my searching habits over the last 5 years are very unlikely to show up in my web graph. The useful thing about narrowing down the probabilities is that I can get higher precision but recall may suffer unless it’s a flexible model.
Why should you care?
It’s just another interesting ranking algorithm and it shows you that there is still quite a lot research going in to this area, in fact it is more important than ever to be able to access information properly. This gives a different perspective on the problem than the PageRank model does.