REU Summer 2009 Project

REU Summer 2009 Project Association Rule Preprocessing By: Walter Garcia University of Houston - Downtown

Project Goals • Convert Heartfelt Study data set into MAFIA format. • Run the converted data set through the MAFIA program to find maximal frequent item sets. • Convert the MAFIA output into SemAna format. • Run the converted data set through the SemAna program to find unknown and correct relations. • Use our method to validate other studies that have been performed on the Heartfelt Study. • Find a relation that is interesting, useful, and correct that has not yet been discovered.

The Heartfelt Study examined 383 children aged 11-16 years. It included 140 African-American, 117 Hispanic, and 126 Non-Hispanic White. The original heartfelt itemset contains 16911 unique transactions and each transaction contains 101 different attributes (items) such as heart rate, age, posture, BMI, obesity, etc. Here is a screenshot if the file.

MAFIA is an acronym for MAximalFrequent ItemsetAlgorithm. It finds the most frequent subsets in a transactional dataset.MAFIAaccepts input in the format below. As you can see every transaction is a set of intergers. However, the original dataset includes integers, real numbers, and “?” that represent missing data. MAFIA Format Original Format

I used a program called WEKA program from the University of Waikato to analyze and discretize the items into 10 unique items or less each. This assigned a unique integer to each item as required by the MAFIA program.

After discretizing the items in each attribute I saved the results in an excel file for cross referencing later. Here is a screenshot:

My program converts the original itemset file into MAFIA format by performing the following actions: • Read transaction as a STRING • Converts the STRING into a character array • Tokenizes the char array into multiple character arrays and outputs a matching integer value to an outputset.ascii file as it tokenizes • Repeat until the End of File

Once the program completes the conversion the outpuset file looks like this: Ready for MAFIA!

When the outpuset file is loaded into the MAFIA program we get the following output:

What does the output mean? In the previous example we ran MAFIA with the following parameters: mafia –mfi .7 –ascii outputset.ascii mfi.txt This means that MAFIA will accept the input file in ascii format and find the most frequent subsets from the item dataset with a minimum support of 70% or found in at least 11838 transactions.

What does the output mean? If we take one line from the MAFIA output MFI file we can find out what it means by cross referencing it with the excel file: For example, we examine the line below. The number in parenthesis means that the subset {351, 314, 239, 136} was found 11874 times in the dataset. 351 314 239 136 (11874) By looking at the excel file: 351 means a RELAX1 selection of 1 (Child was relaxed) 314 means a TAXHYN selection of 0 (Anger Traits were High) 239 means a AGE2 selection of <= 14 (Age less than or equal to 14 years) 136 means a RAW.S.AN selection of <= 113 (Raw Trait Anger score < 113)

What does the output mean? One problem with the MAFIA output that we saw in the previous slide is that MAFIA will find every single frequent subset. It includes subsets that are trivial or incorrect. What we need now is a way to filter the MAFIA output to find subsets that are interesting, useful, unknown and correct. For this we use a program called SemAna (Semantic Analyzer). When the MAFIA output set is converted and processed through the SemAna program it places all of the trivial subsets in a file called trivial.rule all of the unknown and correct subsets in a file called UnKnownCorrect.rule file. Here is a screenshot of the unknown/correct file.

Validating our Method In order to validate our findings (frequent subsets) I am comparing our results to studies that have already been performed by other scientists on the Heartfelt Study to see if they match.

Validating our Method The first study I analyzed was “Blood Pressure and Sexual Maturity in Adolescents” found in the American Journal of Human Biology (2001). This study found that Systolic Blood Pressure in adolescents increases as their Sexual Maturity increases.

Validating our Method Using our method I found the subsets below. This shows that as the TANNER (Sexual Maturity Measurement) increases Systolic Blood Pressure also increases. TANNER='(1.8-2.6]' MATURE=0 SBP='(102.75-119]' ZHTCM='(-1.96-.7792]' TANNER='(1.8-2.6]' MATURE=0 SBP='(102.75-119]' OBESITY=0 TANNER='(2.6-3.4]' MATURE=0 SBP='(102.75-119]' WHRATIO='(.775-.85]' TANNER='(2.6-3.4]' SBP='(119-135.25]' MATURE=0 TANNER='(3.4-4.2]' SBP='(119-135.25]' APWAIST='(64.88-78.06]' MATURE=1 OBESITY=0 TANNER='(3.4-4.2]' SBP='(119-135.25]' MAP='(80-94]' MATURE=1

Future Work • I will continue to validate more studies using our method. • Find a relation that is interesting, useful, and correct that has not yet been discovered.

REU Summer 2009 Project

REU Summer 2009 Project

Presentation Transcript

Summer 2009

REU Project

NSF-REU Project at UIC

REU 2013 Incubation Project

REU Summer Research in Computer Security

2009 Research for Undergraduates (REU) Program

REU Summer 2012 Sensor Project

REU 2009- LEGO MINDSTORMS NXT SOCCER

REU Project

Space Weather Radiation Hazards REU Summer School

REU Project

2011 CWRU Summer REU Group

2009 Southern Nevada Writing Project Summer Institute

REU Summer Seminar Series July 7, 2008

REU Project

Summer 2009

REU Summer 2002 Weekly Report 3

wbs-for-summer-reu-application

2010 NSF REU Summer Program

REU 2009

CEAS REU Project 4

UCF Computer Vision Summer REU 2007