← Back to team overview

tamilspellchecker team mailing list archive

Suggestion generation-perforamance issue

 

*Hi all,

I am now pasting my conversation with python-mailing list people for your
reference,*
-----------------------------------------------------------------------------------------
*Hi all,

I need some help.
I tried to find top n(eg. 5) similar words for a given word, from a
dictionary of 50,000 words.
I used python-levenshtein module,and sample code is as follow.

def foo(searchword):
    disdict={}
    for word in self.dictionary-words:
                   distance=Levenshtein.ratio(searchword,word)
                   disdict[word]=distance
    """
     sort the disdict dictionary by values in descending order
    """
    similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)

    return similarwords[:5]

foo() takes a search word and compares it with dictionary of 50,000 and
assigns each word a value(lies between 0 to 1).
Then after sorting in descending order it returns top 5 similar words.

The problem is, it takes long time for processing(as i need to pass more
search words within a loop),i guess the code could be improved to work
efficiently.Your suggestions are welcome...
-- 
Yours,
S.Selvam

------------------------------------------------------------

You may replace the last steps (sort + slice top 5) by heapq.nlargest - at
least you won't waste time sorting 49995 irrelevant words...
Anyway you should measure the time taken by the first part (Levenshtein), it
may be the most demanding. I think there is a C extension for this, should
be much faster than pure Python calculations.

-- 
Gabriel Genellina

--------------------------------------------------------------
There is also no need to build the 50000 entry word-distance dictionary.

import heapq, functools

def foo(searchword, n):
 distance = functools.partial(Levenshtein.ratio, searchword)
 return heapq.nlargest(n, words, distance)

If the distances are wanted along with the similar words, I strongly suspect
that it would be faster to recalculate a small number than to generate the
dict of 50000 pairs.


    Anyway you should measure the time taken by the first part
(Levenshtein), it may be the most demanding. I think there is a C extension
for this, should be much faster than pure Python calculations.


And such could be dropped into the code above.

Terry Jan Reedy
------------------------------------------------------------------

subdist: http://pypi.python.org/pypi/subdist/0.2.1
It uses a modified "fuzzy" version of the Levenshtein algorithm, which
I found more useful than the strict version. The only quirk to it is
that it accepts nothing but unicode. Other than that, it's a keeper.
It is extremely fast.

Cheers,
-Basilisk96
--------------------------------------------------------------------

You can use the distance instead of the ratio and put the words into bins of
the same length. Then if you find enough words with a distance <= 1 in the
bin with the same length as the search word you can stop looking.

You might be able to generalize this approach to other properties that are
fast to calculate and guarantee a minimum distance, e. g. set(word).

Peter
---------------------------------------------------------------------*

-- 
Yours,
S.Selvam