Text mine: analyse your content

You might (rightly) ask why should you should analyse your content and what can you can gain from this.  Many SEO specialists have used methods such as “keyword density”, “readability”, “word frequency”, and in some cases even LSA to find similarities between webpages.  

These methods are not wrong but used in isolation they are not very effective.  Keyword density is a valid method but only if you have a huge amount of data, otherwise it will tell you what is basically visible to the naked eye! Also used in isolation something like keyword density will only show you one side of the story.

I would propose that all these tools be used on long articles, large blog posts and not small amounts of text often found on websites.  The reason for this is that such a small amount of data won’t yield anything very interesting.  On large articles or a series of blog posts (as in part 1, 2, 3, 4…) it can be difficult to gage exactly what’s going on, and this is the time to roll out the tools from the broom cupboard.  Use them only when you really need to because otherwise you’re wasting time gathering data that won’t tell you very much.

(I have been guilty of this too, collecting vast amounts of data that was actually useless and took ages to compile.  I carried out a full analysis on named entities, where they occurred in text and extracted reoccurring patterns.  After all that I wondered why I’d bothered!)

Ok, so here are a few different things you can look at.  It’s usually good to use them as toys until you see for yourself exactly what you can use them for.  Some of these require a bit of time to get used to but I never said it was going to be a walk in the park.  The idea is that now you can see that there is far more data available to you than you previously thought perhaps.

These are ALL open source (free):

 - WordSmith tools – This is concordance software.  ”Concordance” is an index of all the words in the text gathered along with their actual context. It does other things too. 

  - Collocate - this one find collocations in text.  ”Collocations” are juxtapositions of words in text, ones that commonly occur. This allows you to not only see which words are often found together but also relationships. 

  - Gate - this is a full text-mining IDE (interactive development environment), and it allows you to manipulate and identify patterns and relationships in text very easily.  It is used by huge companies like Glaxo Smith Kline for example.  In addition to all that it also does information exraction for you. 

 - SharpNLP - this is more for those of you who would like to tag up text for further analysis, parse it, tokenize, extract named entities.  It also deals with co-reference and plugs into WordNet which is a machine readable dictionary which can churn out a load of synonyms for you amongst other things.

- WordNet - this is already known to the SEO community and has been used as far as I can tell to find synonyms.  WordNet does however have a load of extensions mini rdf Text mine: analyse your contenttoo.

 Why should you bother with them?

Well if you feel like you’re perfectly happy with the current data you collect don’t, that’s fine too!  If you want to explore what else is available to you and start having a wee ponder on how you could use it, then enjoy and try not to waste too much time using them for useless things, stay focused, this is not a sweety shop and you are not a child!

Related Posts:


2 Comments Add Yours ↓

  1. 1

    Very interesting! But in my current overwhelmed state, I’m not real sure where to start with those tools. I’d love to see a post from you on each? ;-)

  2. CJ #
    2

    Hello Glenn, I shall see what I can do for you!



Your Comment






© 2009-2013 Science for SEO All Rights Reserved -- Copyright notice by Blog Copyright

SEO Powered by Platinum SEO from Techblissonline

Twitter links powered by Tweet This v1.8.1, a WordPress plugin for Twitter.