A computer engineer poking at your cerebral cortex.

Python Dictonary Speed

Many times I get data from someone and they want the unique count of something in a bunch of text files. I want to test to see how fast python dictonaries are so I wrote some code. If the code runs under 5 min I'm pretty happy. When it starts exceeding that I look for ways to optimize. How far can dictionaries scale? I'm running this on Intel(R) Core(TM) i7-4750HQ, 16GB of RAM, Samsung 840 SSD, with python 2.7.6.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (c) 2014, Dan Sheffner http://digisoftinc.org/
# All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish, dis-
# tribute, sublicense, and/or sell copies of the Software, and to permit
# persons to whom the Software is furnished to do so, subject to the fol-
# lowing conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABIL-
# ITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
# SHALL THE AUTHOR BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
# IN THE SOFTWARE.

import random                                                                   
import string                                                                   
import time                                                                     
                                                                                
                                                                                
def randomword(length):                                                         
    return ''.join(random.choice(string.lowercase) for i in range(length))      
                                                                                
    
def checkSpeed(howMany, variance):                                               
    ipDict = {}                                                                  
                                                                                 
    # set time                                                                   
    t0 = time.time()                                                             
    for x in range(0, howMany):                                                  
        randWord = randomword(variance)                                          
                                                                                 
        # randWord = randWord.ljust(4, '0')                                      
                                                                                 
        if randWord in ipDict:                                                   
            ipDict[randWord] += 1                                                
        else:                                                                    
            ipDict[randWord] = 0                                                 
                                                                                 
    # get time                                                                   
    t1 = time.time()                                                             
    total = t1-t0                                                                
                                                                                 
    print "Length of dictionary is: " + str(len(ipDict))                         
    print "Varience of dictionary is: " + \                                      
        str(float(len(ipDict)) / float(howMany))                                 
    print "Total time in seconds: " + str(total)                                 
                                                                                 
                                                                                 
checkSpeed(100, 8)
Lets run it:

python test.py 
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.000412940979004
So what is this doing? I'm looping through 0 to 100 and generating a random string of 8 characters. Then I take that 8 character random string and put it into a dictionary as the key. If it is already there I increment the value. I then print out the length of the dictionary, the varience of the dictionary, and the time it took to go through that many items. Your machine will vary on speed depending your hardware. So 100 is super fast. Lets trying 10,000.

python test.py 
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0422868728638
Still damn fast. Lets scale farther to 10,000,000:

python test.py 
Length of dictionary is: 9999781
Varience of dictionary is: 0.9999781
Total time in seconds: 42.8735089302
Interesting varience is no longer 1.0 which means we started seeing duplicates with 8 character random strings. Lets try 100 million:

python test.py 
Length of dictionary is: 99975959
Varience of dictionary is: 0.99975959
Total time in seconds: 450.787559986
So ok 100 million goes over the 5 min threshold. 7.5 minutes total. So now lets play with the varience. The reason I want to adjust varience is you are taking a giant list and finding unique values. Chances are your data has a vairience allot closer to .005 then .99

.99 basically means almost every value is unique. This is useless test if all your data is unique. I will pad the string to always have 8 characters but now decrease variance. At this point you should be thinking map reduce. I'm going to still test python and see what it can do. Python update:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (c) 2014, Dan Sheffner http://digisoftinc.org/
# All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish, dis-
# tribute, sublicense, and/or sell copies of the Software, and to permit
# persons to whom the Software is furnished to do so, subject to the fol-
# lowing conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABIL-
# ITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
# SHALL THE AUTHOR BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
# IN THE SOFTWARE.

import random                                                                   
import string                                                                   
import time                                                                     
                                                                                
                                                                                
def randomword(length):                                                         
    return ''.join(random.choice(string.lowercase) for i in range(length))      
                                                                                
    
def checkSpeed(howMany, variance):                                               
    ipDict = {}                                                                  
                                                                                 
    # set time                                                                   
    t0 = time.time()                                                             
    for x in range(0, howMany):                                                  
        randWord = randomword(variance)                                          
                                                                                 
        randWord = randWord.ljust(4, '0')                                      
                                                                                 
        if randWord in ipDict:                                                   
            ipDict[randWord] += 1                                                
        else:                                                                    
            ipDict[randWord] = 0                                                 
                                                                                 
    # get time                                                                   
    t1 = time.time()                                                             
    total = t1-t0                                                                
                                                                                 
    print "Length of dictionary is: " + str(len(ipDict))                         
    print "Varience of dictionary is: " + \                                      
        str(float(len(ipDict)) / float(howMany))                                 
    print "Total time in seconds: " + str(total)                                 
                                                                                 
                                                                                 
checkSpeed(100, 4)
Lets test 100 million with 4 characters variance vs 8 to produce more duplicates.

python test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 298.798802853
Right under our 5 min threshold for 100 million checks with allot better varience. So we can basically scale to 100 million and still be in 5 min threshold. Lets try to speed it up with pypy.

pypy -v
Python 2.7.3 (2.2.1+dfsg-1, Nov 28 2013, 05:13:10)
[PyPy 2.2.1 with GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
And now for something completely different: ``we have no anti-impossible stick
that makes sure that all your programs halt''
>>>>
Lets try some pypy with 100 million and 4 variance.

pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 138.739334106
Sweet we get almost a 50% speed up with pypy. Lets try 2x now for 200 million with pypy:

pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 287.588469028
Still under 5 min which is good. So I can pretty much support any data structure under 200 million records in 5 min. Lets do some math to see if we can get close to 5 min as possible.

You take (200 million * 300) / 287 = X

Which would be 209,059,233.45 rounded to two decimal points. Lets try it and see how close we are with the whole integer 209,059,233 entries.

pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 296.72205019
Wow pretty damn close to 300 seconds or 5 min. Lets now say we extend our 5 min threshold to 30 min. Lets do some more math and see what we can scale to:

It says we can scale to: 1,254,355,400 which is 1.25 billion records.

pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 1772.08403897
Wow pretty close to 1800 seconds or 30 min. Linear scaling :)

You never want a single sample size so lets run it again:

100 variance 8

Run 1:
python test.py 
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.000412940979004

Run 2:
python test.py
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.000426054000854
10,000 variance 8

Run1:
python test.py 
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0422868728638

Run 2:
python test.py
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0424108505249
10,000,000 variance 8

Run 1:
python test.py 
Length of dictionary is: 9999781
Varience of dictionary is: 0.9999781
Total time in seconds: 42.8735089302

Run 2:
python test.py 
Length of dictionary is: 9999755
Varience of dictionary is: 0.9999755
Total time in seconds: 42.176858902
100 million variance 8

Run 1:
python test.py 
Length of dictionary is: 99975959
Varience of dictionary is: 0.99975959
Total time in seconds: 450.787559986

Run 2:
python test.py 
Length of dictionary is: 99975915
Varience of dictionary is: 0.99975915
Total time in seconds: 481.742855072
Now I'm going to turn down variance.

100 million 4 variance

Run 1:
python test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 298.798802853

Run 2:
python test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 289.573727846
Now we start using pypy

100 million 4 variance

Run 1:
pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 138.739334106

Run2 2:
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 143.796292067
200 million 4 variance with pypy

Run 1:
pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 287.588469028

Run 2:
pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 289.412295818
Close to 5 min as possible with calculation using 209,059,233 variance 4.

Run 1:
pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 296.72205019

Run 2:
pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 299.482395172
Now we want to test 30 min with 1.25 billion records variance 4

Run 1:
pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 1772.08403897

Run 2:
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 1842.9967351
For the hell of I test python with the giant test. It just killed itself. Haha

python test.py
Killed
Chances are you don't have my exact machine consider it is a custom built laptop. Also you should be testing on production servers vs local laptops. Sometimes local is faster then production servers due to cost. Going to test on Ubuntu 14.04 64 bit, r3.xlarge EC2 instance, with 4 vCPUs, 30.5 RAM, and 80 GB SSD.

100 and 8 variance.
python test.py 
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.000587940216064
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.000612020492554
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.00058913230896
10,000 and 8 variance

python test.py 
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0571658611298
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0567030906677
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0561099052429
10,000,000 and 8 variance

python test.py 
Length of dictionary is: 9999734
Varience of dictionary is: 0.9999734
Total time in seconds: 58.6800808907
Length of dictionary is: 9999746
Varience of dictionary is: 0.9999746
Total time in seconds: 57.5085248947
Length of dictionary is: 9999772
Varience of dictionary is: 0.9999772
Total time in seconds: 57.4015569687
100,000,000 and 8 variance

python test.py 
Length of dictionary is: 99976136
Varience of dictionary is: 0.99976136
Total time in seconds: 586.001760006
Length of dictionary is: 99975995
Varience of dictionary is: 0.99975995
Total time in seconds: 584.615675926
Length of dictionary is: 99975999
Varience of dictionary is: 0.99975999
Total time in seconds: 585.366222858
100,000,000 and 4 variance

python test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 421.710411072
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 425.94264698
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 425.917401791
100,000,000 and 4 variance with pypy

pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 167.102699995
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 163.064245939
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 163.33949995
200,000,000 and 4 variance with pypy

pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 325.355190039
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 322.537681103
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 321.350142002
The 5 min interval at 209,059,233 and variance 4.

pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 346.197941065
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 338.920826912
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 340.864919186
1.25 billion with 4 variance and pypy

pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 2011.19430304
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 2010.20652199
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 2009.70536113
What about jython with the 5 min test of 209,059,233 and variance 4. Um Ok....

java -jar jython.jar ~/test.py 
Traceback (most recent call last):
  File "/home/ubuntu/test2.py", line 35, in <module>
    checkSpeed(209059233, 4)
  File "/home/ubuntu/test2.py", line 15, in checkSpeed
    for x in range(0, howMany):
java.lang.OutOfMemoryError: Java heap space
	at java.lang.Object.clone(Native Method)
	at java.util.Arrays$ArrayList.toArray(Arrays.java:2851)
	at java.util.ArrayList.<init>(ArrayList.java:164)
	at org.python.core.PyList.<init>(PyList.java:52)
	at org.python.core.PyList.<init>(PyList.java:64)
	at org.python.core.__builtin__.range(__builtin__.java:886)
	at org.python.core.__builtin__.range(__builtin__.java:894)
	at org.python.core.BuiltinFunctions.__call__(__builtin__.java:136)
	at org.python.core.PyObject.__call__(PyObject.java:408)
	at org.python.pycode._pyx0.checkSpeed$3(/home/ubuntu/test2.py:32)
	at org.python.pycode._pyx0.call_function(/home/ubuntu/test2.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:165)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:149)
	at org.python.core.PyFunction.__call__(PyFunction.java:327)
	at org.python.pycode._pyx0.f$0(/home/ubuntu/test2.py:37)
	at org.python.pycode._pyx0.call_function(/home/ubuntu/test2.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:165)
	at org.python.core.PyCode.call(PyCode.java:18)
	at org.python.core.Py.runCode(Py.java:1275)
	at org.python.util.PythonInterpreter.execfile(PythonInterpreter.java:235)
	at org.python.util.jython.run(jython.java:247)
	at org.python.util.jython.main(jython.java:129)

java.lang.OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space
I tried with 4GB and 20GB of ram to the JVM. It didn't help. Max I saw was about 12GB of RAM before the JVM just crashes.

Then I came across:
from collections import defaultdict 
ipDict = {}                                                                 
ipDict = defaultdict(int)
Then you don't have to an if else statement but I'm not getting any better peformance with pypy.
pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 293.383059978
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 299.771702051
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 293.569992065
Lets make sure the python version is not faster with this collections module

python test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 608.467285872
There are a bunch of sites telling me to use xrange vs. range but whatever pypy is doing under the hood is making it already as fast as possible with range. Here are the results from the 5 min test with xrange vs range. No performance gains.
pypy test.py 
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 299.924518824
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 307.82821703
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 281.461761236
There is a branch of pypy where it has the GIL removed. Python-stm

I compiled python-stm from source:

sudo apt-get install libbz2-1.0 libbz2-dev libbz2-ocaml libbz2-ocaml-dev mercurial
hg clone https://bitbucket.org/pypy/pypy/src/stmgc-c7/
rpython/bin/rpython -Ojit pypy/goal/targetpypystandalone.py
Now the results

./pypy-c /home/thesheff17/test.py  
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 268.066026926
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 262.150718927
Length of dictionary is: 456976f
Varience of dictionary is: 0.00218586853803
Total time in seconds: 264.11204195
They seem to be a little bit faster. I'm not sure I'm using python-stm correctly or not but have to read more about it. I don't see all the cores firing on my laptop so maybe my code is still not atomic or I'm doing something wrong. I also trust the regular pypy and python version more. This is more of an experimenetal version that might eventually replace the regular pypy.

Conclusion:

Dict{} are pretty damn fast and even faster with pypy. I don't face many data problems in the billions or even millions. I'm sure people that deal with math, scientific, and finacial data do all the time. I was thinking of testing python 3.x and see if the speeds are any better. What do you use for your fastest key:value system? If you made it this far thanks for reading.