Signature files l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Signature Files PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on
  • Presentation posted in: General

Signature Files. Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992. (Chapters 4). Signature Files. Characteristics Word-oriented index structures based on hashing

Download Presentation

Signature Files

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Signature files l.jpg

Signature Files

Information Retrieval: Data Structures and Algorithms

by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

(Chapters 4)


Signature files2 l.jpg

Signature Files

  • Characteristics

    • Word-oriented index structures based on hashing

    • Low overhead (10%~20% over the text size) at the cost of forcing a sequential search over the index

    • Suitable for not very large texts

    • Inverted files outperform signature files for most applications


Structure l.jpg

Structure

  • Use superimposed coding to create signature.

  • Each text is divided into logical blocks.

  • A block containsn distinct non-common words.

  • Each word yields “word signature”.

  • A word signature is aB-bit pattern, with m 1-bit.

    • Each word is divided into successive, overlapping triplets. e.g. free --> fr, fre, ree, ee 

    • Each such triplet is hashed to a bit position.

  • The word signatures are OR’ed to form block signature.

  • Block signatures are concatenated to form the document signature.


Example l.jpg

Example

  • Example (n=2, B=12, m=4)wordsignaturefree001000110010text000010101001block signature001010111011

  • Search

    • Use hash function to determine the m 1-bit positions.

    • Examine each block signature for 1’s bit positions that the signature of the search word has a 1.


False drop l.jpg

False Drop

  • false alarm (false hit, or false drop) Fdthe probability that a block signature seems to qualify, given that the block does not actually qualify.Fd = Prob{signature qualifies/block does not}

  • For a given value of B, the value of m that minimizes the false drop probability is such that each row of the matrix contains “1”s with probability 0.5.Fd = 2-mm = B ln2/n


Sequential signature file ssf l.jpg

Sequential Signature File (SSF)

documents

assume documents span exactly one logical block

the size of document signature F = the size of block signature B


Classification of signature based methods l.jpg

Classification of Signature-Based Methods

  • CompressionIf the signature matrix is deliberately sparse, it can be compressed.

  • Vertical partitioningStoring the signature matrix column-wise improves the response time on the expense of insertion time.

  • Horizontal partitioningGrouping similar signatures together and/or providing an index on the signature matrix may result in better-than-linear search.


Classification of signature based methods8 l.jpg

Classification of Signature-Based Methods

  • Sequential storage of the signature matrix

    • without compressionsequential signature files (SSF)

    • with compressionbit-block compression (BC)variable bit-block compression (VBC)

  • Vertical partitioning

    • without compressionbit-sliced signature files (BSSF, B’SSF)frame sliced (FSSF)generalized frame-sliced (GFSSF)


Classification of signature based methods continued l.jpg

Classification of Signature-Based Methods(Continued)

  • with compressioncompressed bit slices (CBS)doubly compressed bit slices (DCBS)no-false-drop method (NFD)

  • Horizontal partitioning

    • data independent partitioningGustafson’s methodpartitioned signature files

    • data dependent partitioning2-level signature files5-trees


  • Criteria l.jpg

    Criteria

    • the storage overhead

    • the response time on single word queries

    • the performance on insertion, as well as whether the insertion maintains the “append-only” property


    Compression l.jpg

    Compression

    • idea

      • Create sparse document signatures on purpose.

      • Compress them before storing them sequentially.

    • Method

      • Use B-bit vector, where B is large.

      • Hash each word into one (or k) bit position(s).

      • Use run-length encoding (McIlroy 1982).


    Slide12 l.jpg

    Compression using run-length encoding

    data0000 0000 0000 0010 0000

    base0000 0001 0000 0000 0000

    management0000 1000 0000 0000 0000

    system0000 0000 0000 0000 1000

    block signature0000 1001 0000 0010 1000

    L2

    L3

    L4

    L5

    L1

    [L1] [L2] [L3] [L4] [L5]

    where [x] is the encoded vale of x.

    search: Decode the encoded lengths of all the preceding intervals

    example: search “data”

    (1) data ==> 0000 0000 0000 0010 0000

    (2) decode [L1]=0000, decode [L2]=00, decode [L3]=000000

    disadvantage: search becomes low


    Slide13 l.jpg

    Bit-block Compression (BC)

    Data Structure:

    (1) The sparse vector is divided into groups of consecutive bits

    (bit-blocks).

    (2) Each bit block is encoded individually.

    Algorithm:

    Part I. It is one bit long, and it indicates whether there are any

    “1”s in the bit-block (1) or the bit -block is (0). In

    the latter case, the bit-block signature stops here.

    0000 1001 0000 0010 1000

    0 1 0 1 1

    Part II. It indicates the number s of “1”s in the bit-block. It consists

    of s-1 “1” and a terminating zero.

    10 0 0

    Part III. It contains the offsets of the “1”s from the beginning of the

    bit-block.

    0011 10 00

    說明:4bits,距離為0, 1, 2, 3,編碼為00, 01, 10, 11

    block signature: 01011 | 10 00 | 00 11 10 00


    Slide14 l.jpg

    Bit-block Compression (BC)

    (Continued)

    Search “data”

    (1) data ==> 0000 0000 0000 0010 0000

    (2) check the 4th block of signature 01011 | 10 0 0 | 00 11 10 00

    (4) OK, there is at least one setting in the 4th bit-block.

    (5) Check furthermore. “0” tells us there is only one setting in

    the 4th bit-clock. Is it the 3rd bit?

    (6) Yes, “10” confirms the result.

    Discussion:

    (1) Bit-block compression requires less space than Sequential

    Signature File for the same false drop probability.

    (2) The response time of Bit-block compression is lightly less

    then Sequential Signature File.


    Vertical partitioning l.jpg

    Vertical Partitioning

    • ideaavoid bringing useless portions of the document signature in main memory

    • methods

      • store the signature file in a bit-sliced form or in a frame-sliced form

      • store the signature matrix column-wise to improve the response time on the expense of insertion time


    Slide16 l.jpg

    Bit-Sliced Signature Files (BSSF)

    Transposed bit matrix

    documents

    (document signature)

    transpose

    documents

    represent


    Slide17 l.jpg

    documents

    F bit-files

    search:(1) retrieve m bit-files.

    e.g., the word signature of free is 001 000 110 010

    the document contains “free”: 3rd, 7th, 8th, 11th bit are set

    i.e., only 3rd, 7th, 8th, 11th files are examined.

    (2) “and” these vectors. The 1s in the result N-bit vector

    denote the qualifying logical blocks (documents).

    (3) retrieve text file through pointer file.

    insertion: require F disk accesses for a new logical block (document),

    one for each bit-file, but no rewriting


    Frame sliced signature file fssf l.jpg

    Frame-Sliced Signature File (FSSF)

    • Ideas

      • random disk accesses are more expensive than sequential ones

      • force each word to hash into bit positions that are closer to each other in the document signature

      • these bit files are stored together and can be retrieved with a few random accesses

    • Procedures

      • The document signature (F bits long) is divided into k frames of s consecutive bits each.

      • For each word in the document, one of the k frames will be chosen by a hash function.

      • Using another hash function, the word sets m bits in that frame.


    Slide19 l.jpg

    Frame-Sliced Signature File (Cont.)

    documents

    frames

    Each frame will be kept in consecutive disk blocks.


    Fssf continued l.jpg

    FSSF (Continued)

    • Example (n=2, B=12, s=6, f=2, m=3)WordSignaturefree000000 110010text010110 000000 doc. signature010110 110010

    • Search

      • Only one frame has to be retrieved for a single word query. I.E., only one random disk access is required.e.g., search documents that contain the word “free”->because the word signature of “free” is placed in 2nd frame,only the 2nd frame has to be examined.

      • At most k frames have to be scanned for an k word query.

    • Insertion

      • Only f frames have to be accessed instead of F bit-slices.


    Vertical partitioning with compression l.jpg

    Vertical Partitioning with Compression

    • idea

      • create a very sparse signature matrix

      • store it in a bit-sliced form

      • compress each bit slice by storing the position of the 1s in the slice.


    Compressed bit slices cbs l.jpg

    Compressed Bit Slices (CBS)

    • Rooms for improvements

      • Searching

        • Each search word requires the retrieval of m bit files.

        • The search time could be improved if m was forced to be “1”.

      • Insertion

        • Require too many disk accesses (equal to F, which is typically 600-1000).


    Compressed bit slices cbs continued l.jpg

    Compressed Bit Slices (CBS)(Continued)

    documents

    • Let m=1. To maintain the same false drop probability, F has to be increased.

    • To compress each bit file, we store only the positions of the “1”s.

    • For unpredictable number of “1”s, we store them in buckets of size Bp.

    Size of a signature

    Sparse bit matrix


    Slide24 l.jpg

    • Differences with inversion

      • The directory (hash table) is sparse

      • The actual word is stored nowhere

      • Simple structure

    Obtain the pointers to the

    relevant documents from

    buckets

    Hash a word to

    obtain bucket address

    h(“base”)=30


    Slide25 l.jpg

    Doubly Compressed Bit Slices

    Idea:

    compress

    the sparse

    directory

    當S變小

    碰撞在一

    起的的機會

    變大,採用

    中間buckets

    為了區別

    真碰撞和假

    碰撞,多了

    一個hash

    function

    Distinguish synonyms partially.

    Follow the pointers of posting

    buckets to retrieve the qualifying

    documents.

    h2(“base”)=011

    h1(“base”)=30


    Slide26 l.jpg

    No False Drops Method

    To distinguish between synonyms completely.

    Using pointer to the word

    in the text file


    Slide27 l.jpg

    Horizontal Partitioning

    1. Goal: group the signatures into sets, partitioning the signature

    matrix horizontally.

    2. Grouping criterion

    documents


    Partitioned signature files l.jpg

    Partitioned Signature Files

    • Using a portion of a document signature as a signature key to partition the signature file.

    • All signatures with the same key will be grouped into a so-called “module”.

    • When a query signature arrives,

      • examine its signature key and look for the corresponding modules

      • scan all the signatures within those modules that have been selected


  • Login