Friday, October 8, 2010

JavaScript Stemmer for French Language

About a month ago, I wrote a JavaScript port for the Porter French Stemming Algorithm in Snowball. Algorithm was pretty clear so, that was just a day of work :) I did this port for a requirement of the Google Summer of Code DocBook Webhelp Project which I worked on in the last few months.

If you are not familiar with what a Stemmer is, here's a brief introduction :). What a Stemmer basically does is extracting the root form of a given verb. Stemmers are very useful for Search engines such that users can enter search query in any variety, but view the content for the root word, which the users probably meant. ( Google does this ;) ) Following example shows what a stem is:
Playing     =
Played      ====> Play
Plays       =
Play        =
-----  

As the human languages are very complex, it is really difficuly to devise an algorithm to extract the exact root. Therefore, for some words, the extracted word may not be the exact root, but slightly different one. But for computations purposes and usages in applications, it is sufficient.  :) This issue is not just related to French, and is common for all the stemmers in other languages.

This is the Stemmer for French. The stemmer is now added to the Porter's Snowball site who wrote the algorithms along with other contributors maintains them. Download the Stemmer from:
http://snowball.tartarus.org/otherlangs/index.html
The Stemmer: http://snowball.tartarus.org/otherlangs/french_javascript.txt
To invoke the stemmer, call the stemmer function with the relevant word string.
stemmer(word);
ex: var stem = stemmer("foobar"); 
I ran the given test-cases to verify the accuracy of the implementation. It correctly stemmed nearly 19,500 words out of 21,000 words. The accuracy is more than 90%.

English Stemmer is already available on the Porter site. You can view all the existing Stemmer implementations at Snowball Site. Lot of implementation of Java and C++ are available, but lacks JavaScript port. That was the main reason for me to write this. Hope you all find this useful! Should write a french version of this post too.... :)