# Python Dictonary Speed

Many times I get data from someone and they want the unique count of something in a bunch of text files. I want to test to see how fast python dictonaries are so I wrote some code. If the code runs under 5 min I'm pretty happy. When it starts exceeding that I look for ways to optimize. How far can dictionaries scale? I'm running this on Intel(R) Core(TM) i7-4750HQ, 16GB of RAM, Samsung 840 SSD, with python 2.7.6.```
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (c) 2014, Dan Sheffner http://digisoftinc.org/
# All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish, dis-
# tribute, sublicense, and/or sell copies of the Software, and to permit
# persons to whom the Software is furnished to do so, subject to the fol-
# lowing conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABIL-
# ITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
# SHALL THE AUTHOR BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
# IN THE SOFTWARE.
import random
import string
import time
def randomword(length):
return ''.join(random.choice(string.lowercase) for i in range(length))
def checkSpeed(howMany, variance):
ipDict = {}
# set time
t0 = time.time()
for x in range(0, howMany):
randWord = randomword(variance)
# randWord = randWord.ljust(4, '0')
if randWord in ipDict:
ipDict[randWord] += 1
else:
ipDict[randWord] = 0
# get time
t1 = time.time()
total = t1-t0
print "Length of dictionary is: " + str(len(ipDict))
print "Varience of dictionary is: " + \
str(float(len(ipDict)) / float(howMany))
print "Total time in seconds: " + str(total)
checkSpeed(100, 8)
```

```
python test.py
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.000412940979004
```

```
python test.py
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0422868728638
```

```
python test.py
Length of dictionary is: 9999781
Varience of dictionary is: 0.9999781
Total time in seconds: 42.8735089302
```

```
python test.py
Length of dictionary is: 99975959
Varience of dictionary is: 0.99975959
Total time in seconds: 450.787559986
```

.99 basically means almost every value is unique. This is useless test if all your data is unique. I will pad the string to always have 8 characters but now decrease variance. At this point you should be thinking map reduce. I'm going to still test python and see what it can do. Python update:

```
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (c) 2014, Dan Sheffner http://digisoftinc.org/
# All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish, dis-
# tribute, sublicense, and/or sell copies of the Software, and to permit
# persons to whom the Software is furnished to do so, subject to the fol-
# lowing conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABIL-
# ITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
# SHALL THE AUTHOR BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
# IN THE SOFTWARE.
import random
import string
import time
def randomword(length):
return ''.join(random.choice(string.lowercase) for i in range(length))
def checkSpeed(howMany, variance):
ipDict = {}
# set time
t0 = time.time()
for x in range(0, howMany):
randWord = randomword(variance)
randWord = randWord.ljust(4, '0')
if randWord in ipDict:
ipDict[randWord] += 1
else:
ipDict[randWord] = 0
# get time
t1 = time.time()
total = t1-t0
print "Length of dictionary is: " + str(len(ipDict))
print "Varience of dictionary is: " + \
str(float(len(ipDict)) / float(howMany))
print "Total time in seconds: " + str(total)
checkSpeed(100, 4)
```

```
python test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 298.798802853
```

```
pypy -v
Python 2.7.3 (2.2.1+dfsg-1, Nov 28 2013, 05:13:10)
[PyPy 2.2.1 with GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
And now for something completely different: ``we have no anti-impossible stick
that makes sure that all your programs halt''
>>>>
```

```
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 138.739334106
```

```
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 287.588469028
```

You take (200 million * 300) / 287 = X

Which would be 209,059,233.45 rounded to two decimal points. Lets try it and see how close we are with the whole integer 209,059,233 entries.

```
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 296.72205019
```

It says we can scale to: 1,254,355,400 which is 1.25 billion records.

```
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 1772.08403897
```

You never want a single sample size so lets run it again:

100 variance 8

```
Run 1:
python test.py
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.000412940979004
Run 2:
python test.py
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.000426054000854
```

```
Run1:
python test.py
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0422868728638
Run 2:
python test.py
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0424108505249
```

```
Run 1:
python test.py
Length of dictionary is: 9999781
Varience of dictionary is: 0.9999781
Total time in seconds: 42.8735089302
Run 2:
python test.py
Length of dictionary is: 9999755
Varience of dictionary is: 0.9999755
Total time in seconds: 42.176858902
```

```
Run 1:
python test.py
Length of dictionary is: 99975959
Varience of dictionary is: 0.99975959
Total time in seconds: 450.787559986
Run 2:
python test.py
Length of dictionary is: 99975915
Varience of dictionary is: 0.99975915
Total time in seconds: 481.742855072
```

100 million 4 variance

```
Run 1:
python test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 298.798802853
Run 2:
python test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 289.573727846
```

100 million 4 variance

```
Run 1:
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 138.739334106
Run2 2:
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 143.796292067
```

```
Run 1:
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 287.588469028
Run 2:
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 289.412295818
```

```
Run 1:
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 296.72205019
Run 2:
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 299.482395172
```

```
Run 1:
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 1772.08403897
Run 2:
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 1842.9967351
```

```
python test.py
Killed
```

100 and 8 variance.

```
python test.py
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.000587940216064
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.000612020492554
Length of dictionary is: 100
Varience of dictionary is: 1.0
Total time in seconds: 0.00058913230896
```

```
python test.py
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0571658611298
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0567030906677
Length of dictionary is: 10000
Varience of dictionary is: 1.0
Total time in seconds: 0.0561099052429
```

```
python test.py
Length of dictionary is: 9999734
Varience of dictionary is: 0.9999734
Total time in seconds: 58.6800808907
Length of dictionary is: 9999746
Varience of dictionary is: 0.9999746
Total time in seconds: 57.5085248947
Length of dictionary is: 9999772
Varience of dictionary is: 0.9999772
Total time in seconds: 57.4015569687
```

```
python test.py
Length of dictionary is: 99976136
Varience of dictionary is: 0.99976136
Total time in seconds: 586.001760006
Length of dictionary is: 99975995
Varience of dictionary is: 0.99975995
Total time in seconds: 584.615675926
Length of dictionary is: 99975999
Varience of dictionary is: 0.99975999
Total time in seconds: 585.366222858
```

```
python test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 421.710411072
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 425.94264698
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 425.917401791
```

```
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 167.102699995
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 163.064245939
Length of dictionary is: 456976
Varience of dictionary is: 0.00456976
Total time in seconds: 163.33949995
```

```
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 325.355190039
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 322.537681103
Length of dictionary is: 456976
Varience of dictionary is: 0.00228488
Total time in seconds: 321.350142002
```

```
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 346.197941065
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 338.920826912
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 340.864919186
```

```
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 2011.19430304
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 2010.20652199
Length of dictionary is: 456976
Varience of dictionary is: 0.000364311422425
Total time in seconds: 2009.70536113
```

```
java -jar jython.jar ~/test.py
Traceback (most recent call last):
File "/home/ubuntu/test2.py", line 35, in <module>
checkSpeed(209059233, 4)
File "/home/ubuntu/test2.py", line 15, in checkSpeed
for x in range(0, howMany):
java.lang.OutOfMemoryError: Java heap space
at java.lang.Object.clone(Native Method)
at java.util.Arrays$ArrayList.toArray(Arrays.java:2851)
at java.util.ArrayList.<init>(ArrayList.java:164)
at org.python.core.PyList.<init>(PyList.java:52)
at org.python.core.PyList.<init>(PyList.java:64)
at org.python.core.__builtin__.range(__builtin__.java:886)
at org.python.core.__builtin__.range(__builtin__.java:894)
at org.python.core.BuiltinFunctions.__call__(__builtin__.java:136)
at org.python.core.PyObject.__call__(PyObject.java:408)
at org.python.pycode._pyx0.checkSpeed$3(/home/ubuntu/test2.py:32)
at org.python.pycode._pyx0.call_function(/home/ubuntu/test2.py)
at org.python.core.PyTableCode.call(PyTableCode.java:165)
at org.python.core.PyBaseCode.call(PyBaseCode.java:149)
at org.python.core.PyFunction.__call__(PyFunction.java:327)
at org.python.pycode._pyx0.f$0(/home/ubuntu/test2.py:37)
at org.python.pycode._pyx0.call_function(/home/ubuntu/test2.py)
at org.python.core.PyTableCode.call(PyTableCode.java:165)
at org.python.core.PyCode.call(PyCode.java:18)
at org.python.core.Py.runCode(Py.java:1275)
at org.python.util.PythonInterpreter.execfile(PythonInterpreter.java:235)
at org.python.util.jython.run(jython.java:247)
at org.python.util.jython.main(jython.java:129)
java.lang.OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space
```

Then I came across:

```
from collections import defaultdict
ipDict = {}
ipDict = defaultdict(int)
```

```
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 293.383059978
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 299.771702051
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 293.569992065
```

```
python test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 608.467285872
```

```
pypy test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 299.924518824
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 307.82821703
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 281.461761236
```

I compiled python-stm from source:

```
sudo apt-get install libbz2-1.0 libbz2-dev libbz2-ocaml libbz2-ocaml-dev mercurial
hg clone https://bitbucket.org/pypy/pypy/src/stmgc-c7/
rpython/bin/rpython -Ojit pypy/goal/targetpypystandalone.py
```

```
./pypy-c /home/thesheff17/test.py
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 268.066026926
Length of dictionary is: 456976
Varience of dictionary is: 0.00218586853803
Total time in seconds: 262.150718927
Length of dictionary is: 456976f
Varience of dictionary is: 0.00218586853803
Total time in seconds: 264.11204195
```

Conclusion:

Dict{} are pretty damn fast and even faster with pypy. I don't face many data problems in the billions or even millions. I'm sure people that deal with math, scientific, and finacial data do all the time. I was thinking of testing python 3.x and see if the speeds are any better. What do you use for your fastest key:value system? If you made it this far thanks for reading.