← Back to team overview

tamilspellchecker team mailing list archive

Re: code updated to launchpad and mit demo

 

2009/2/2 S.Selvam Siva <s.selvamsiva@xxxxxxxxx>

>
> what we need is one python file calling tamilpsell.py with some tamil text
> as argument.Though i have little knowledge on Open Office plugin
> mechnism,adding it to Open Office will require Open Office specific module
> (pyUNO,i guess).And our first aim need to be to develop a powerful spell
> cheking engine .So our plugin may not depend on hunspell.
>

You are the technical person. So, you have to figure that out. As far as I
know, OOo uses hunspell. So, the files may have to be converted to hunspell
if it is to be used there. Mozilla, too, has hunspell as the preferred
spellchecker.

As of now ,i just maintain list of tamil words (one per line) and make
> comparison to find out miss-spelled words (This is the starting point of our
> project).
>

Which is what the existing Tamil checkers do, albeit I think you have a more
extensive list of words, which is good.


> Affix rules seem to be critical part of tamil spell checking which i have
> not got any clue so far,except that AU-KBC has developed morphlogical
> analysis and released a software(Acharam.exe wrriten in java).we will be
> really happy if you can help us on affix rules .
>

Well, without the affix file, you'd be having a really huge word list with
probably about a million or so possible words. For example, take the root
verb 'kodu' (கொடு). Now, we have to identify all the ways it could be
modified by a suffix. Eg --

கொடு-க்கிறேன்
கொடு-க்கின்றேன்
கொடு-த்தேன்
கொடு-ப்பேன்
கொடு-க்கின்றேனா
கொடு-த்தேனா
கொடு-ப்பேனா
கொடு-க்கிறான்
கொடு-க்கின்றான்
கொடு-த்தான்
கொடு-ப்பான்
கொடு-க்கின்றானா
கொடு-த்தானா
கொடு-ப்பானா
etc...

Other root verbs that may function like like kodu would be...
படு
எடு
நடி
படி
etc...

So, we may classify above words as, say Class A, and link them to all the
rules that would be applicable to them. In that way, we could prevent a lot
of unnecessary repetition.

Do you have a sample affix file from another language? Say spanish or
french?

Regards,
Elan

Follow ups

References