Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets

Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal

Overall Goal • Informatics tools for biological data integration driven by: • Data explosion • Data size & number of data sources • New analysis tools • Autonomous resources • Heterogeneous data representation & various interfaces • Frequent Updates • Common Situations: • Flat-file datasets • Ad-hoc sharing of data

Current Approaches • Manually written wrappers • Problems • O(N2) wrappers needed, O(N) for a single updates • Mediator-based integration systems • Problems • Need a common intermediate format • Unnecessary data transformation • Integration using web/grid services • Needs all tools to be web-services (all data in XML?)

Our Approach • Automatically generate wrappers • Transform data in files of arbitrary formats • No domain- or format-specific heuristics • Layout information provided by users • Help biologists write layout descriptors using data mining techniques

Our Approach: Challenges • Description language • Format and logical view of data in flat files • Easy to interpret and write • Wrapper generation and Execution • Correspondence between data items • Separating wrapper analysis and execution • Interactive tools for writing layout descriptors • What data mining techniques to use ?

Wrapper Generation System Overview Layout Descriptor Schema Descriptors Parser Mapping Generator Data Entry Representation Schema Mapping Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer

Key Open Questions • How hard is it to write layout descriptors ? • Given a flat file, how hard is it to learn its layout? • Can we make the process semi-automatic ?

Learning Layout of a Flat-File • In general – intractable • Try and learn the layout, have a domain expert verify • Key issue: what delimiters are being used ?

Finding Delimiters • Difficult problem • Some knowledge from domain expert is required (Semi-automatic) • Naïve approaches • Frequency Counting • Counts frequently occurring single tokens (word separated by space) • Sequence Mining • Counts frequently occurring sequence of tokens

Frequency Counting • Problems • Some tokens, appearing very frequently, are not delimiters • Delimiters could be a sequence of token rather than a single token • Possible Solution • Use knowledge from frequency of token sequence and all its subsequences to decide possible delimiter sequence

Sequence Mining Example • For any sequence of tokens s, f(s) represents frequency of s • Lets say A,B,C are tokens • Case 1: • f(ABC)=10, f(AB)=10, f(BC)=10, f(CA)=10 • Information about AB, BC, CA is already embedded in ABC • ABC is possible delimiter but AB, BC, CA are not • Case 2: • f(ABC)=10, f(AB)=20, f(BC)=10, f(CA)=10 • BC and CA occur less frequently than AB • ABC cannot be a delimiter • AB is possible delimiter

Limitations of Sequence Mining • Does not work very well if token frequencies are distributed in a skewed manner • An example where it does not work in (Pfam dataset) • \n, #=GF, AC are tokens with • f(\n,#=GF)>>f(#=GF,AC) • F(\n,#=GF)>>f(\n,#=GF,AC) • \n #=GF is concluded as possible delimiter • In reality \n #=GF AC is a delimiter

Can we do better? • Biological datasets are written for humans to read • It is very unlikely that delimiters will be scattered all around, in different places in a line • Position of the possible delimiters might provide useful information • Combination of positional and frequency information might be a better choice

Positional Weight • Let P be the different positions in a line where a token can appear • For each position iє P, tot_seqji represents total # of token sequences of length j starting at position i • For each position iє P, tot_unique_seqji represents total # of unique token sequences of length j starting at position i • For any tuple (i,j), p_ratio(i,j) is defined as shown above • p_ratio(i,j) can be log normalized to get positional weight, p_wt(i,j) with the property p_wt(i,j)є (0,1)

Delimiter score (d_score) • Frequency weight for any token sequence sji with length j and starting at position i, f_wt(sji), is obtained by log normalizing frequency f(sji) • Obviously, f_wt(sji) є (0,1) • Positional and frequency weight now can be combined together to get d_score as follows, • d_score(sji)= α * p_wt(i,j) + (1-α) * f_wt(sji) • Where αє(0,1) • Thus d_scrore has the following two properties, • d_score(sji) є(0,1) • d_score(sji) > d_score(sjk) implies sji is more likely to be a delimiter than sjk

Finding delimiters using d_score • Since delimiter sequence length is not known in advance, an iterative algorithm is used to get a superset S of potential delimiters, where, • At any iteration i, ci represents the cut-off value which is determined by observing a substantial difference in sorted d_score values • All token sequences above ci are called Ni

Generating layout descriptor • Once the delimiters are identified, an NFA can be built scanning the whole database where, delimiters are different states of the NFA • This NFA can be used to generate a layout descriptor since it nicely represents optional and repeating states • The following figures shows an NFA, where A, B, C, D, and E are delimiters with B being an optional delimiter and C D being a repeating delimiters

Realistic Situation • The task of identifying complete list of correct delimiters is difficult • Most likely we will end up with getting an incomplete list of delimiters • The delimiters which does not appear in every data record (optional) are the ones to be possibly missed

Identifying Optional Delimiters • Given a list of incomplete delimiters how can we identify optional delimiters, if any? • Build a NFA based on given incomplete information • Perform clustering to identify possible crucial delimiters • Perform contrast analysis

Crucial delimiter • A delimiter is considered crucial, if missing delimiters will appear immediately following these delimiters • The goal is to create two clusters, • one having delimiters which are not crucial • The other one having crucial delimiters

Identifying crucial delimiters:A few definitions • Succ(X): Set of delimiters that can immediately follow X • Dist_App: # of groups of occurrences of X based on # of text lines between X and immediately next delimiter • Info_Tuple(nXi,fXi,tXi): Information for each Dist_App • Info_Tuple_List Lx: For any X, list of all possible Info_Tuple.

Metric for clustering • rXf is likely to be low if an optional delimiter appears immediately after X, and high otherwise • Choose a suitable cut-off value rc and assign delimiters to different groups as follows,- • If rXf < rc, assign X to a group containing possible crucial delimiters • Else assign X to the group containing non crucial delimiters

Observations and Facts • Missing optional delimiters can appear immediately after crucial delimiters ONLY • Non-crucial delimiters can be pruned away • Consider two Info_Tuples (nX1, fX1 ,tX1) and (nX2, fX2 ,tX2) in LX • If a missing delimiter appears immediately after the appearance corresponding to the first tuple but not the second one,- • nX1 > nX2 • Missing delimiter will appear in tX1 but not in tX2

A hypothetical example illustrating Contrast Analysis • Suppose, X is a crucial delimiter having 2 Info_tuples, L1 and L2 , as follows, • L1=(50, 20, l1 .txt) • L2=(20, 12, l2 .txt) • Sequence mining on l1 .txt and l2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows, • S1={ f1 , f5 , f6 , f8 , f13 , f21 } • S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 } • Since but , f5 is a possible missing delimiter • f5is a missing delimiter only if it has a high d_score or is verified by a domain expert as a valid delimiter

Contrast Analysis • For any i,j, if nXi > nXj , look for frequently occurring sequences in tXi and tXj, call them fsXi and fsXj respectively • If there exists a frequent sequence fs such that, but then, fs is quite likely to be a possible delimiter • If fs has a fairly high d_score or identified by a domain expert as valid delimiter add it to the incomplete list as newly found delimiter

Generalized Contrast Analysis • In case of more than two Info_Tuples, identify mean of all nXi values • Form a group by appending text from all Info_Tuples, where • Form another group by appending text from all Info_Tuples, where • Perform contrast analysis among all such possible groups

Another example illustrating Generalized Contrast Analysis • Suppose, X is a crucial delimiter having 3 Info_tuples, L1 , L2 , L3 , as follows, • L1=(50, 20, l1 .txt) • L2=(20, 12, l2 .txt) • L3=(15, 10, l3 .txt) • Mean number of lines, • Append l2 .txt and l3 .txt , call it t2 .txt • Sequence mining on l1 .txt and t2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows, • S1={ f1 , f5 , f6 , f8 , f13 , f21 } • S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 } • Since but , f5 is a possible missing delimiter • f5is a missing delimiter only if it has a high d_score or is verified by a domain expert as a valid delimiter

Overall Algorithms

Results: Optional delimiters • % Pruning=

Results: Non-optional Missing delimiters • Even though designed for finding optional delimiters, our algorithms works, in some cases, for missing non-optional delimiters too • If a missing non-optional delimiter appears exactly in the same location in each record, then our algorithm fails • If a non-optional delimiter has a backward edge coming from a delimiter that appears later in a topologically sorted NFA then our algorithm works

Summary • Semi-automatic tool for learning the layout of a flat-file dataset • Mechanism for identifying missing optional delimiters • Automatic tool for wrapper generation • Once the layout descriptor is known • Can ease integration of new/updated sources

Questions..

Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets

Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets

Presentation Transcript

Data Mining Techniques Clustering

Data Mining Techniques

CS6220: Data Mining Techniques

BIOLOGICAL Data Mining

Mining Biological Data

Dynamic Modification of Collision Boxes Using Data-Mining Techniques

Analyzing Stock Quotes using Data Mining Techniques

Basic Data Mining Techniques

Using XQuery for Flat-File Scientific Datasets

Anomaly Detection Using Data Mining Techniques

Basic Data Mining Techniques

Biological Data Mining

Biological Data Mining

Data Mining Techniques

32.4 Using Biological Techniques

Biological Data Mining

Anomaly Detection Using Data Mining Techniques

Data Mining Techniques

Biological Data Mining

Biological Data Mining

Patient Segmentation Using Data Mining Techniques