I’m excited when I come across news like this because it’s my area of research. MIT covered this and presented TextRunner, a system capable of extracting meaning from billions of documents. It is not new, but it is working a lot better now which is why it’s exciting. It’s actually a Washington University project, the “Turing center knowitall” project.
“TextRunner searches hundreds of millions of assertions extracted from 500 million high-quality Web pages.” – It is capable of picking out dependencies between words and analyzing basic relationships between them.
“The significance of TextRunner is that it is scalable because it is unsupervised,” says Peter Norvig, director of research at Google, which donated the database of Web pages that TextRunner analyzes. “It can discover and learn millions of relations, not just one at a time. With TextRunner, there is no human in the loop: it just finds relations on its own.”
Norvig did remind us all that Google have also been working on this sort of thing for quite a while now too. The hard part if the “unsupervised” part. Machine learning algorithms that are unsupervised are left to learn from their own experience and then deal with new experiences accordingly. The system parameters are altered according to input and pre-specified internal rues that the system is given. This means that you can launch such a system on any dataset of any size and it will perform.
This does change how search engines function quite sugnificantly, and it has been in the pipeline for a long time. Finally we appear to be getting positive results.
If you look at the actual demo online, you’ll see that you have to either enter a question (3 words), or some arguments. The question is natural language you’ll notice as well. If you look at the “Who built the pyramids?” example you will notice that you get results in a long list of connected elements:
We get for example if I grab a few:
There is a different layout than the usual search engine would offer you. The list of 10 most relevant cannot work for much longer. This is because it’s not how language works, you don’t have 10 most relevant documents to a query when someone types in a word or three. You need to disambiguate first which is what this does.
No, you are not going to be using this instead of Google or whatever else you use right now, but it is a nice taste of what is to come and what is coming sooner than you may think. How to show up in those results? Well that’s a whole new experimental post isn’t it