The nltk.probability.FreqDist class is used in many classes throughout NLTK for storing and managing frequency distributions. It’s quite useful, but it’s all in-memory, and doesn’t provide a way to persist the data. A single FreqDist is also not accessible to multiple processes. All that can be changed by building a FreqDist on top of Redis.
What is Redis?
- Redis is a data structure server that is one of the more popular NoSQL databases.
- Among other things, it provides a network-accessible database for storing dictionaries (also known as hash maps).
- Building a FreqDist interface to a Redis hash map will allow us to create a persistent FreqDist that is accessible to multiple local and remote processes at the same time.
Installation :
- Install both Redis and redis-py. The Redis website is at http://redis.io/ and includes many documentation resources.
- To use hash maps, install the latest version, which at the time of this writing is 2.8.9.
- The Redis Python driver, redis-py, can be installed using pip install redis or easy_install redis. The latest version at this time is 2.9.1.
- The redis-py home page is at http://github.com/andymccurdy/redis-py/.
- Once both are installed and a redis-server process is running, you’re ready to go. Let’s assume redis-server is running on localhost on port 6379 (the default host and port).
How it works?
- The FreqDist class extends the standard library collections.Counter class, which makes a FreqDist a small wrapper with a few extra methods, such as N().
- The N() method returns the number of sample outcomes, which is the sum of all the values in
the frequency distribution. - An API-compatible class is created on top of Redis by extending a RedisHashMapand then implementing the N() method.
- The RedisHashFreqDist (defined in redisprob.py) sums all the values in the hash map for the N() method
Code : Explaining the working
from rediscollections import RedisHashMap class RedisHashFreqDist(RedisHashMap): def N( self ): return int ( sum ( self .values())) def __missing__( self , key): return 0 def __getitem__( self , key): return int (RedisHashMap.__getitem__( self , key) or 0 ) def values( self ): return [ int (v) for v in RedisHashMap.values( self )] def items( self ): return [(k, int (v)) for (k, v) in RedisHashMap.items( self )] |
This class can be used just like a FreqDist. To instantiate it, pass a Redis connection and the name of our hash map. The name should be a unique reference to this particular FreqDist so that it doesn’t clash with any other keys in Redis.
Code:
from redis import Redis from redisprob import RedisHashFreqDist r = Redis() rhfd = RedisHashFreqDist(r, 'test' ) print ( len (rhfd)) rhfd[ 'foo' ] + = 1 print (rhfd[ 'foo' ]) rhfd.items() print ( len (rhfd)) |
Output :
0 1 1
Most of the work is done in the RedisHashMap class, which extends collections.MutableMapping and then overrides all methods that require Redis-specific commands. Outline of each method that uses a specific Redis command:
- __len__() : This uses the hlen command to get the number of elements in thehash map
- __contains__(): This uses the hexists command to check if an element existsin the hash map
- __getitem__(): This uses the hget command to get a value from the hash map
- __setitem__(): This uses the hset command to set a value in the hash map
- __delitem__(): This uses the hdel command to remove a value from thehash map
- keys(): This uses the hkeys command to get all the keys in the hash map
- values(): This uses the hvals command to get all the values in the hash map
- items(): This uses the hgetall command to get a dictionary containing all the keys and values in the hash map
- clear(): This uses the delete command to remove the entire hash map from Redis