Zunayed Ali Morsalin Home

# Bloom filters and you

### Use cases

Bloom filters are a probabilistic data structure that allows you to check for set membership. Figuring out if a item is in set is a `0(1)` operation. You might be thinking big deal I can do that with a hashtable but with a hash table you’re space complexity is `0(n)` and will grow quite massive for large lists. In a bloom filter you are storing hashes of your values as bits in a bitarray. You don’t actually store the data itself only information that will tell you if the item is in your list. Some use cases are a spell checker or a banned ip list. In fact chrome at one pointed shipped a bloom filter that would check if the website you’re trying to visit is a malicious site. A bloom filter storing info on a 100,000 sites with a .001 probability of false positives will only take up 1,437,759 bits (175.51 KB). Such a small footprint means you can ship the entire bitarray to the client to do client side checks and avoid server processing.

### Python example

Here’s my attempt at recreating a bloom filter in python

``````import math
import hashlib

import bitarray
import mmh3

def calc_optimal_hash_func(lenght_of_entries):
m = (-lenght_of_entries * math.log(0.01)) / ((math.log(2)) ** 2)
k = (m / lenght_of_entries) * math.log(2)

return int(m), int(math.ceil(k))

def lookup(string, bit_array, seeds):
for seed in range(seeds):
result = mmh3.hash(string, seed) % len(bit_array)
# print "seed", seed, "pos", result, "->", bit_array[result]
if bit_array[result] == False:
return string, "Def not in dictionary"

return string, "Probably in here"

data        = []
word_loc    = '/usr/share/dict/words'

with open(word_loc, 'r') as f:
data.append(word.strip())

return data

def get_bit_array():
w_length            = len(words)
array_length, seeds  = calc_optimal_hash_func(w_length)
bit_array           = array_length * bitarray.bitarray('0')

for word in words:
try:
for seed in range(seeds):
# print "word", word, "seed", seed, "pos", result, "->", bit_array[result]
pos = mmh3.hash(word, seed) % array_length
bit_array[pos] = True
except:
pass

return seeds, bit_array

if __name__ == '__main__':
seeds, ba = get_bit_array()