Modeling Regular Replacement for String Constraints: Addressing Input Vulnerabilities
This paper presents a comprehensive approach to modeling regular replacement semantics for string constraints, specifically focusing on the automatic discovery of vulnerabilities in server inputs. We explore the impact of malicious scripts, illustrating common errors in text sanitation processes and proposing an advanced method utilizing finite state transducers (FSTs) to model alternative semantics. Our contribution includes a detailed analysis of greedy and reluctant semantics, constrained replacements, and highlighting challenges in this domain, while emphasizing efficient methods for solving atomic constraints on string inputs.
Modeling Regular Replacement for String Constraints: Addressing Input Vulnerabilities
E N D
Presentation Transcript
Modeling Regular Replacement for String Constraints Solving Xiang Fu Hofstra University Chung-Chih Li Illinois State University NFM 2010
malicious scripts Hacker Cool page! Server Background Problem? Lack of Sufficient Sanitation of Text Inputs NFM 2010
One Typical Error 1 <?php 2 $msg = $_POST[”msg”]; 3 $sanitized = pregreplace( 4 ”/\< s c r i p t .*?\>.*?\<\/ s c r i p t .*?\ >/ i ”, 5 ” ” , 6 $msg ) ; 7 savetodb($sanitized ) 8 ?> Reluctant Kleene Star Attacker’s Input <<script></script>script>alert(’a’)</script> <script>alert(’a’)</script> NFM 2010
Bigger Picture • Objective: Automatic Discovery of Vulnerabilities SUSHI Bytecode Symbolic Execution String Constraint Solver Test Replayer Attack Pattern NFM 2010
Our Contribution • Atomic Replacement Constraints • Consider Two Semantics • Greedy • Reluctant • Modeling Using Finite State Transducer (FST) • Compact Representation of FST • Security Analysis NFM 2010
Finite State Transducer • Accepts Regular Relation • Union, Concat, Composition • Intersection, Complement • Used for Modeling Rewriting Rules [Kaplan94, Karttunen96] ε:1 a:2 1 2 3 4 b:3 A (ab,123) ∈ L(A) NFM 2010
Hierarchical FST &Modeling Declarative Semantics Goal: Replacement Regular Search Pattern Any String not Containing patter r Identical Relation Id(∑* - ∑* r ∑*) r : ω Id(∑* - ∑* r ∑*) 2 1 3 4 ε:ε NFM 2010
Modeling Reluctant Semantics • 2 Steps • Mark the beginning of pattern • Do the replacement Goal: Key: Left-Most Matching NFM 2010
Input Word Search Pattern a a b b c d a b c a b d a+b+c x Begin Marker # a # a b b c d # a b c a b d s1 reluc(r)#’ : ω #: ε x d x a b d ε: ε s2 f1 Id(∑) NFM 2010
The Challenge: Begin Marker Input Word Search Pattern a a b b c d a b c a b d a+b+c x # # # # Look-ahead Capability? 3 Steps: End marker Generic end marker Begin marker Non-determinism NFM 2010
1 2 3 4 Preliminary End Marker Search Pattern a+b+c x c:c b:b a: a b :b ε:$ Idea: Start with End Marker for Reverse of Search Pattern a: a 5 Reversed Pattern A1 cb+a+ Problem: Input tape accepts cb+a+ only! NFM 2010
Generic End Marker Pattern cb+a+ c:c b:b 1 c:c b:b a:a ε:$ 1 Deterministic! a:a a:a a:a b:b c:c c:c 2 3 4 5 5,1 2,1 4,1 3,1 b:b A2 c c b a a c c b a $ a $ Input Word Output Word NFM 2010
Finally, the Begin Marker Search Pattern a+b+c x 0 ε:ε c:c b:b ε:ε ε:ε c:c 1 b:b a:a ε:# 1 3 2 4 5 a:a a:a b:b 2,1 5,1 4,1 3,1 c:c c:c b:b A3 NFM 2010
Input Word Search Pattern a a b b c d a b c a b d a+b+c x Begin Marker # a # a b b c d # a b c a b d s1 reluc(r)#’ : ω #: ε x d x a b d ε: ε s2 f1 Id(∑) NFM 2010
Greedy Semantics Goal: greedy Challenge: Look-ahead longestmatch NFM 2010
Search Pattern a+ x aabab Step 1: Begin Marker #a#ab#ab Step 2: ND End Marker #a#ab#a$b #a$#a$b#a$b #a#a$b#a$b #a#a$b#ab Step 3: Pairing Markers #aa$b#a$b #aaba$b Step 4: Checking Match #a$#a$b#a$b Step 5: Check Longest Step 6: Replacement xbxb NFM 2010
Login Servlet Applications • Solve String Constraints Input: user name After filtering single quote and length restriction NFM 2010
Solving Atomic Constraint Goal: A1 Id(P) Project to Input Tape Solution NFM 2010
SUSHI Constraint Solver Type I Type II Type III • Solves Simple Linear String Constraints (SISE) • Relies on • dk.brics.automaton for FSA operations • Self-made Java package for FST operations • Supports 16-bit Unicode • Compact Transition Representation (I,I) (II,I) (III,II) NFM 2010
Login Servlet Efficiency of Solver 1.4 Seconds on 2Ghz PC Benchmark Equations 1 Flex SDKXSS Attack 2 Equation Size: 565 74 Seconds Shorter than Security Track #1022748 3 4 NFM 2010
Related Work Our Contribution: Precise Modeling of Various Regular Substitution Semantics • Forward String Analysis • Christensen & Møller [SAS’03] • Wasserman & Su [PLDI’07, ICSE’08] • Bjørner & Tillmann [TACAS’09] • Backward String Analysis • Kiezun & Ganesh [ISSTA’09] • Yu & Bultan [SPIN’08, ASE’09] • Fu [COMPSAC’07, TAVWEB’08] • Natural Language Processing • * Kaplan and Kay [CL’1994] NFM 2010
Limitations • SISE String Constraints • All Variables Appear on LHS (Once) • No Easy Solution for Equation System Yet • No string length • Future Directions • Encoding string length in automata • Finite model on bit-vector NFM 2010
Questions? NFM 2010