Fast Statistical Spam Filter by Approximate Classifications

Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo

Outline • Motivations of this paper • The concrete problems • Basic idea and solutions • Questions needed to clarify

Motivations • Speedup the classification process in order to defense against spam quickly, furthermore, improve the throughout of system. • Improve the scalability of the statistical-based classification methods. • Keep high classification accuracy.

The background and concrete problem • Background • Statistical-based Bayesian filters and its variants are used to block spam. • The statistical value of each individual token is stored by a dictionary. • A decision-making is based on the summarization of values of much tokens. • Problems needed to research • How to improve the performance of value retrieval operation for each individual token. (the motivation 1 and 2) • The solutions should not have much negative effect on the classification accuracy. (the motivation 3)

Basic idea and solutions (1) • A straightforward idea • Use the Bloom filters to store the values of tokens, and retrieve the value of any token on demand. • The first obstacle • How to extend the standard Bloom filter?

0 1 0 1 0 0 1 1 0 1 1 1 0 0 0 x y Data set A Data set B a b c d A hash function family A bit vector m-1 0

0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 x y token universe test set B token1 token2 token3 token4 Second dimension A hash function family q-1 Multi-bit vector 0 First dimension Bit-wise AND 0 output value 0 1

Basic idea and solutions (2) • Instead the bit vector with a two dimensions vector, with (multiply m by q) size. • The first dimension denotes the hash locations for each token in a m bits vector, the same as the standard Bloom filter. • The second dimension of each hash locations denotes the value of token. One bit for one identical value. • The second obstacle • The size of value universe is usually large even huge. It is impossible to allocate bits in the second dimension for all elements of the value universe.

Basic idea and solutions (3) • Encode • In this field, the value universe ranges from 0 to 1. • This paper does not propose new encoding method, just use a algorithm referred from the paper [20]. • Choose and tune the parameter q , which denotes the number of possible elements resulting from encoding algorithm.

Why the idea can meet the motivation one and two? • Space (for the set of pairs (token, value)) • If use the extended Bloom filter to store them, it need less space than others . K bits for each token. • Given the allocated memory, the solution can store more pairs (token, value) than others. • Time • Extended Bloom filter are small enough to load in memory. No other I/O operations. • The response delay is a constant for the query with any input no matter how many pairs have been stored. • In the same time slot, the solution can retrieve the values of more tokens than previous solutions.

The negative effects on the classification accuracy (1) • The query based on the extended Bloom filter may output two kinds of mistake. • For any query with a token outside of the test data set as input, may get a useful output entry (just one bit is set to 1). • For any query with a token inside the test data set as input, may get a conflict output entry (more than one bits are set to 1). • For any token, the decoding result usually does not equal the real statistical value.

0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 x y token set A token set B token1 token2 token3 token4 Second dimension A hash function family q-1 Multi-bit vector 0 First dimension Bit-wise AND 1 output value 0 1

The negative effects on the classification accuracy (2) • The misclassification • The former error will affect the summarization of values of a message, and maybe influence the decision. • For a multi-bits error, choose the smallest value. If it is wrongly chosen, the error only makes the classification result less likely as spam, and maybe result in a false negative. This can be tolerated. • The decoding deviation • It can not been avoided. Design better algorithms and/or select the parameters carefully.

Questions needed to clarify(1) • For a query output entry, the possibility for a single bit of the output entry being zero as Pm,n,h(0)=1-Pm,n,h(fpos) =1-(1-(1-1/m)n*h)h • For a query output entry , the probability of the former case: Pm,n,h,q(fpos)=1-(Pm,n,h(0))q (6) • The probability of the latter case: Pm,n,h,q(multi)=1-(Pm,n,h(0))q -q* (1-Pm,n,h(0))(q-1) (7)

Questions needed to clarify(1) • The formulas 6 and 7 are wrong or not consistent with the error definitions. • The probability of the event (just one bit of the output entry is set to 1) is: • The probability of the event (more than one bits of the output entry are set to 1) is: • One minus the probability of all bits being set to 0 and the probability of only one bit getting 1.

Questions needed to clarify(2) • In order to store and retrieve values, can this idea be a general way to improve the standard Bloom filter? • The size of value universe. • The multi-bit output error. • Deletion operation of pairs (key,value).

Questions and Answers

Beyond Bloom Filters: From Approximate MembershipChecks to Approximate State Machines Authors: Flavio Bonomi Michael Mitzenmacher Rina Panigrahy SIGCOMM 2006 Reader: Deke Guo

Questions • How to track the simultaneous state of a large number of connections at each network device. • The size of tracking result should be small in order to load in on-chip memory.

Solution(1) • Uses standard bloom filters to summarize the simultaneous state of a large number of connections. • lookups the state of each connection according to its summarization. • Introduces a new error named “don’t know” besides false positive and false negative.

Solution(1) • Introduces the timing-based deletion mechanism to deal with ill-behaving or non-terminating. • Operations: • Put (id, state) • Lookup (id) or Lookup (id, state) • Delete (id, state) • Update (id, old state, new state) • Ill-behaving or attacking may result in false negative error.

0 1 0 1 0 0 1 1 0 1 1 1 0 0 0 x y Data set A Data set B a b c d h1(x) h2(x) h3(x) hk(x) h1(y) h2(y) h3(y) hk(y) x doesn’t belong to set B, yet its bits have been set 1 y doesn’t belong to set B, and its bits aren’t all 1. a belongs to set B, and its bits are all 1.

0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 x y Data set A Data set B a b c d a belongs to set B, and its bits are not all 1 after the false deletion of x. A false positive error may result in at most k false negative.

Solution(2) • Introduce the Stateful Bloom Filter Approach. • Instead the bit vector used by standard bloom filters with cell vector. • Its rate of false positive is less than that of standard bloom filters. Note that the storage space used by two filters are not same. Thus, it is need to compare more carefully.

0 1 0 1 0 0 1 2 0 0 1 2 1 3 0 0 0 x y Data set A Data set B a b c d h1(x) h2(x) h3(x) hk(x) X don’t belong to set B. The lookup based on the filter also make right judge.

Solution(3) • An Approach Using d-left Hashing • The authors did not explain why it is the best solution among the three solutions through formal compare and analysis. • The simulation tries to prove it, but it is not strong enough, especially don’t compare under the same space used.

1 1 1 3 3 3 0 0 2 0 0 0 0 0 0 x y Data set A Data set B a b c d

Questions needed to analyze • Analyze the relationship between false positive and false negative, and try to give formula. • If the old value of a cell was “don’t know”, then the cell keeps the value before its register becomes 0. • Analyze the fraction of cell which value is “don’t know”, and compute the rate of this error. • If the register becomes 1 from a larger value, value “don’t know” should become a identify value, but SBF can’t support this transformation. • If we use the idea of SBF to redesign the standard Bloom Filters, whether we can achieve some benefits, such as lower false positive rate.

Fast Statistical Spam Filter by Approximate Classifications