ANswer from [This Link][2]
From [Wikipedia][1]:
> Bloom filters have a strong space
> advantage over other data structures
> for representing sets, such as
> self-balancing binary search trees,
> tries, hash tables, or simple arrays
> or linked lists of the entries. Most
> of these require storing at least the
> data items themselves, which can
> require anywhere from a small number
> of bits, for small integers, to an
> arbitrary number of bits, such as for
> strings (tries are an exception, since
> they can share storage between
> elements with equal prefixes). Linked
> structures incur an additional linear
> space overhead for pointers. A Bloom
> filter with 1% error and an optimal
> value of k, on the other hand,
> requires only about 9.6 bits per
> element — regardless of the size of
> the elements. This advantage comes
> partly from its compactness, inherited
> from arrays, and partly from its
> probabilistic nature. If a 1% false
> positive rate seems too high, each
> time we add about 4.8 bits per element
> we decrease it by ten times.
Pretty clear to me.
A bloom filter doesn't store the elements themselves, this is the crucial point. You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly **not** present, since it guarantees no false negatives. This lets you not do extra work for elements that don't exist in a set (such as disk IO to look them up).
And all in significantly less space than something like a hash table (which is likely going to be partially on disk for large data sets). Though you may use a bloom filter *in conjunction* with a structure like a hash table, once you're certain the element has a chance of being present.
So an example usage pattern might be:
You've got a lot of data, on disk -- you decide on what error bound you want (e.g. 1%), that prescribes the value of *m*. Then the optimal *k* is determined (from the formula given in the article). You populate your filter from this disk-bound data once.
Now you have the filter in RAM. When you need to process some element, you query your filter to see if it stands a chance of existing in your data set. If it doesn't, no extra work is done. No disk reads, etc. (Which you would have to do if it were a hash or tree, etc).
Otherwise, if the filter says "Yes, it's in there", there's a 1% chance that it's wrong, so you do the necessary work to find out. 99% of the time, it really *will* be there, so the work was not for naught.
[1]: http://en.wikipedia.org/wiki/Bloom_filter
[2]: https://stackoverflow.com/a/4282445/10997525
From [Wikipedia][1]:
> Bloom filters have a strong space
> advantage over other data structures
> for representing sets, such as
> self-balancing binary search trees,
> tries, hash tables, or simple arrays
> or linked lists of the entries. Most
> of these require storing at least the
> data items themselves, which can
> require anywhere from a small number
> of bits, for small integers, to an
> arbitrary number of bits, such as for
> strings (tries are an exception, since
> they can share storage between
> elements with equal prefixes). Linked
> structures incur an additional linear
> space overhead for pointers. A Bloom
> filter with 1% error and an optimal
> value of k, on the other hand,
> requires only about 9.6 bits per
> element — regardless of the size of
> the elements. This advantage comes
> partly from its compactness, inherited
> from arrays, and partly from its
> probabilistic nature. If a 1% false
> positive rate seems too high, each
> time we add about 4.8 bits per element
> we decrease it by ten times.
Pretty clear to me.
A bloom filter doesn't store the elements themselves, this is the crucial point. You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly **not** present, since it guarantees no false negatives. This lets you not do extra work for elements that don't exist in a set (such as disk IO to look them up).
And all in significantly less space than something like a hash table (which is likely going to be partially on disk for large data sets). Though you may use a bloom filter *in conjunction* with a structure like a hash table, once you're certain the element has a chance of being present.
So an example usage pattern might be:
You've got a lot of data, on disk -- you decide on what error bound you want (e.g. 1%), that prescribes the value of *m*. Then the optimal *k* is determined (from the formula given in the article). You populate your filter from this disk-bound data once.
Now you have the filter in RAM. When you need to process some element, you query your filter to see if it stands a chance of existing in your data set. If it doesn't, no extra work is done. No disk reads, etc. (Which you would have to do if it were a hash or tree, etc).
Otherwise, if the filter says "Yes, it's in there", there's a 1% chance that it's wrong, so you do the necessary work to find out. 99% of the time, it really *will* be there, so the work was not for naught.
[1]: http://en.wikipedia.org/wiki/Bloom_filter