SMILES Multigram Compression

SMILES Multigram Compression Roger Sayle1 and Jack Delany2 1 Metaphorics LLC, Santa Fe, New Mexico 2 Daylight CIS, Santa Fe, New Mexico Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Introduction One of the major benefits of line notations, such as SMILES, over traditional connection tables is their compact representation. For NCI95, the SMILES average 33 bytes for each molecule, but MDL .mol file, for example, is over 1400 bytes. This advantage has enabled Daylight’s software to store even the largest chemical databases in memory since the early 1980s, and to access and search this data much faster than disk-based systems. Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Multigram Encoding • There are only 70 different characters can occur in a valid SMILES string. #%()*+-./0123456789:=>@ ABCDEFGHIKLMNOPRSTUVWXYZ[\] abcdefghiklmnoprstuy • Allowing for a (null) terminator character, there are 185 byte values that cannot normally occur in a SMILES. • Multigram compression uses these unused values to represent commonly occurring SMILES substrings. • Compression occurs because the entire substring (or multigram) is encoded as a single byte. Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Advantages of Multigrams • Conceptually very simple to implement. • Extremely fast data decompression. • Each SMILES decompress independently. • Domain-specific ‘a priori’ statistical model. • Guaranteed worst case performance. • Uncompressed data is treated identically. • Efficient compression implementation. • Processing of compressed form possible. Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Examples of Multigrams • Canonical Smiles [nH] c1ccccc1 [N+](=O)[O-] S(=O)(=O) c1ccc(cc1) [N+] Cl [n+] (C) [O+] C=C C(=O)N • Isomeric Smiles [C@@H] [C@H] [C@@] [C@] /C=C/ /C=C\ • Reaction Smiles [cH: [CH :1 [CH2: [c: [O: Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Multigram Decompression • Decompression of multigram encoded SMILES is almost trivial: extern char *MultiGram[256]; dst = outp; for( i=0; inp[i]; i++ ) { src = MultiGram[inp[i]]; while( *src ) *dst++ = *src++; } Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Multigram Compression • Efficient compression is more tricky… Given a simple alphabet of only “A” and “B”. with the set of multigrams “A”, “B”, “AB” and “BAA”. Encode the string “ABAA”. The greedy solution uses 3 bytes “AB”, “A” and “A”. An optimal solution uses only 2 bytes, “A” and “BAA”. Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Dynamic Programming • The computer science solution to such 1D tiling problems is a two pass algorithm called “Dynamic Programming”. • For each prefix, the optimal length is the shortest sub-prefix before each valid suffix multigram. To Encode the string “ABAA” encode(“A”) = 1 encode(“AB”) = min(encode(“A”)+1,1) = 1 encode(“ABA”) = encode(“AB”)+1 = 2 encode(“ABAA”) = min(encode(“ABA”)+1,encode(“A”)+1) = 2 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Trie Construction Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

FSM Construction Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Multigram Training Sets • train.smi: 36311 SMILES from WDI, NCI, ACD, SYNTH, totaling 1762189 bytes. [48.5 bytes/mol] • train.ism: 27451 isomeric SMILES from WDI and ACD totaling 2555727 bytes [93.1 bytes/mol] • train.rism: 17159 reaction SMILES from SYNTH, totaling 4927586 bytes. [287.2 bytes/mol] Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Training Set Performance • Train.smi smizip 580818/1762189 33.0% smizip (renum) 546094/1762189 31.0% gzip -9 465891/1762189 26.4% • Train.ism smizip 654737/2540347 25.8% smizip (renum) 610254/2540347 24.0% gzip -9 514941/2540347 20.3% • Train.rism smizip 1425113/4918376 29.0% smizip (renum) 1397922/4918376 28.4% gzip -9 1071673/4918376 21.8% Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Multigram Cross-Validation Smi+ism is a combination of the 155 best absolute SMILES multigrams and the 30 best isomeric SMILES multigrams. Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

General Results • Chemical Database Results ACD002 (254865 SMILES) 3719633/10277254 36.2% NCI00 (162148 SMILES) 2389223/6132973 38.9% WDI011 (28298 ISOMERS) 893902/3064043 29.2% SYNTH97 (102934 ISORXNS) 8592831/29397074 29.3% • Oracle Cartridge Results • No measurable effect on index creation/insertion time. • Cartridge index data is 20% smaller for NCI00. • Fingertest, Tanimoto and Tversky are 5-15% faster. • Contains and Matches (with triage) are 0-1% slower. Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

Bibliography • A. Aho and M. Corasick, "Efficient String Matching: An Aid to Bibliographic Search", Communications of the ACM, Vol. 18, pp. 333-340, 1975. • Wai-Hong Leung & Steven S. Skiena, "Inducing Codes from Examples", Proceedings of the 1991 Data Compression Conference (DCC91), Eds. James A. Storer and John H. Reif, Snowbird, Utah, Extended Abstract, pp. 267-276, April 1991. • R.A. Wagner, "Common Phrases and Minimum-Space Text Storage", Communications of the ACM, Vol. 16, pp. 148-152, 1974. Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA.

SMILES Multigram Compression

SMILES Multigram Compression

Presentation Transcript

SMILES 2

BRIGHT SMILES

Miles for Smiles

Smiles to go

Gummy smiles

Symphonic Smiles

Smiles

Volatility Smiles

SMILES

Hollywood Smiles

SMILES

SMILES

Smiles for Kids - Craft Smiles Pediatric Dentistry

Masterpiece Smiles

Masterpiece Smiles

FeatherSound Smiles

FeatherSound Smiles

Miles of Smiles

SMILES 2

SMILES

Cascadia Smiles

Authentic Smiles