Stemmer: the Ruby version

Blue Standard UEA logo 300x179 Stemmer: the Ruby version

For those of you interested in NLP stuff, you might know about the Uea-Lite stemmer that I made back in 2005. Jason Adams from the “Mendicant Bug” has made a Ruby port which I’m really pleased about. Now we have my original Perl version, a Java version and a Ruby version icon smile Stemmer: the Ruby version

Thanks Jason for doing that.

Some background:

“Similar to other stemmers, UEA-Lite operates on a set of rules which are used as steps. There are two groups of rules: the first to clean the tokens, and the second to alter suffixes.

The first group of rules first avoids a small list of six frequent problem words. An improvement to the stemmer would be to expand this list by adding other problem words which the second rule set cannot deal with. Second, possessive apostrophes are removed and contractions are expanded. All hyphens are removed and tokens containing digits are left untouched. Strings which are all upper case and digits are left untouched unless there is a lower case terminal ‘s’ (i.e. transforming plural forms of acronyms to singular forms).

Proper nouns should not usually be stemmed, except to remove possessives; our implementation will respect PoS tags if they are present. If the text is untagged the stemmer uses the simple heuristic that any capitalized token not preceded by sentence breaking punctuation is a proper noun.

Many texts, particularly scientific papers, contain sequences of digits, single letters, and other non-word tokens. Our implementation ignores tokens containing digits, single-letter tokens, and tokens with embedded punctuation.

The second group of rules contains 139 suffix rules, each testing for a specific type of suffix. The rules are set in a particular order so that the longest suffix applicable is used rather a shorter one which could lead to nonsense words and more words not stemmed entirely to their root form.”

If you do use it and run some tests, let me know what you come up with. I’d like to know what you;re up to and what you;re thinking. Jason has the idea of turning all of the rules into finite state transducers for example. Great idea!

(BTW, I played ball and used the new University logo but let it be known that I preferred the retro one we used to have! Having been at the University for 10 years I feel that I should have been consulted about this matter. It was a terrible shock :s)

Related Posts:

  • No Related Posts

Your Comment






© 2009-2013 Science for SEO All Rights Reserved -- Copyright notice by Blog Copyright

SEO Powered by Platinum SEO from Techblissonline

Twitter links powered by Tweet This v1.8.1, a WordPress plugin for Twitter.