Bloom filters are a probabilistic data structure that allows you to check for set membership. Figuring out if a item is in set is a 0(1)
operation. You might be thinking big deal I can do that with a hashtable but with a hash table you’re space complexity is 0(n)
and will grow quite massive for large lists. In a bloom filter you are storing hashes of your values as bits in a bitarray. You don’t actually store the data itself only information that will tell you if the item is in your list. Some use cases are a spell checker or a banned ip list. In fact chrome at one pointed shipped a bloom filter that would check if the website you’re trying to visit is a malicious site. A bloom filter storing info on a 100,000 sites with a .001 probability of false positives will only take up 1,437,759 bits (175.51 KB). Such a small footprint means you can ship the entire bitarray to the client to do client side checks and avoid server processing.
Here’s my attempt at recreating a bloom filter in python
import math
import hashlib
import bitarray
import mmh3
def calc_optimal_hash_func(lenght_of_entries):
m = (-lenght_of_entries * math.log(0.01)) / ((math.log(2)) ** 2)
k = (m / lenght_of_entries) * math.log(2)
return int(m), int(math.ceil(k))
def lookup(string, bit_array, seeds):
for seed in range(seeds):
result = mmh3.hash(string, seed) % len(bit_array)
# print "seed", seed, "pos", result, "->", bit_array[result]
if bit_array[result] == False:
return string, "Def not in dictionary"
return string, "Probably in here"
def load_words():
data = []
word_loc = '/usr/share/dict/words'
with open(word_loc, 'r') as f:
for word in f.readlines():
data.append(word.strip())
return data
def get_bit_array():
words = load_words()
w_length = len(words)
array_length, seeds = calc_optimal_hash_func(w_length)
bit_array = array_length * bitarray.bitarray('0')
for word in words:
try:
for seed in range(seeds):
# print "word", word, "seed", seed, "pos", result, "->", bit_array[result]
pos = mmh3.hash(word, seed) % array_length
bit_array[pos] = True
except:
pass
return seeds, bit_array
if __name__ == '__main__':
seeds, ba = get_bit_array()
print(lookup('badwordforsure', ba, seeds))
print(lookup('cat', ba, seeds))
print(lookup('hello', ba, seeds))
print(lookup('jsalj', ba, seeds))