1 / 16

REU Summer 2009 Project

REU Summer 2009 Project. Association Rule Preprocessing By: Walter Garcia University of Houston - Downtown. Project Goals. Convert Heartfelt Study data set into MAFIA format. Run the converted data set through the MAFIA program to find maximal frequent item sets.

Download Presentation

REU Summer 2009 Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. REU Summer 2009 Project Association Rule Preprocessing By: Walter Garcia University of Houston - Downtown

  2. Project Goals • Convert Heartfelt Study data set into MAFIA format. • Run the converted data set through the MAFIA program to find maximal frequent item sets. • Convert the MAFIA output into SemAna format. • Run the converted data set through the SemAna program to find unknown and correct relations. • Use our method to validate other studies that have been performed on the Heartfelt Study. • Find a relation that is interesting, useful, and correct that has not yet been discovered.

  3. The Heartfelt Study examined 383 children aged 11-16 years. It included 140 African-American, 117 Hispanic, and 126 Non-Hispanic White. The original heartfelt itemset contains 16911 unique transactions and each transaction contains 101 different attributes (items) such as heart rate, age, posture, BMI, obesity, etc. Here is a screenshot if the file.

  4. MAFIA is an acronym for MAximalFrequent ItemsetAlgorithm. It finds the most frequent subsets in a transactional dataset.MAFIAaccepts input in the format below. As you can see every transaction is a set of intergers. However, the original dataset includes integers, real numbers, and “?” that represent missing data. MAFIA Format Original Format

  5. I used a program called WEKA program from the University of Waikato to analyze and discretize the items into 10 unique items or less each. This assigned a unique integer to each item as required by the MAFIA program.

  6. After discretizing the items in each attribute I saved the results in an excel file for cross referencing later. Here is a screenshot:

  7. My program converts the original itemset file into MAFIA format by performing the following actions: • Read transaction as a STRING • Converts the STRING into a character array • Tokenizes the char array into multiple character arrays and outputs a matching integer value to an outputset.ascii file as it tokenizes • Repeat until the End of File

  8. Once the program completes the conversion the outpuset file looks like this: Ready for MAFIA!

  9. When the outpuset file is loaded into the MAFIA program we get the following output:

  10. What does the output mean? In the previous example we ran MAFIA with the following parameters: mafia –mfi .7 –ascii outputset.ascii mfi.txt This means that MAFIA will accept the input file in ascii format and find the most frequent subsets from the item dataset with a minimum support of 70% or found in at least 11838 transactions.

  11. What does the output mean? If we take one line from the MAFIA output MFI file we can find out what it means by cross referencing it with the excel file: For example, we examine the line below. The number in parenthesis means that the subset {351, 314, 239, 136} was found 11874 times in the dataset. 351 314 239 136 (11874) By looking at the excel file: 351 means a RELAX1 selection of 1 (Child was relaxed) 314 means a TAXHYN selection of 0 (Anger Traits were High) 239 means a AGE2 selection of <= 14 (Age less than or equal to 14 years) 136 means a RAW.S.AN selection of <= 113 (Raw Trait Anger score < 113)

  12. What does the output mean? One problem with the MAFIA output that we saw in the previous slide is that MAFIA will find every single frequent subset. It includes subsets that are trivial or incorrect. What we need now is a way to filter the MAFIA output to find subsets that are interesting, useful, unknown and correct. For this we use a program called SemAna (Semantic Analyzer). When the MAFIA output set is converted and processed through the SemAna program it places all of the trivial subsets in a file called trivial.rule all of the unknown and correct subsets in a file called UnKnownCorrect.rule file. Here is a screenshot of the unknown/correct file.

  13. Validating our Method In order to validate our findings (frequent subsets) I am comparing our results to studies that have already been performed by other scientists on the Heartfelt Study to see if they match.

  14. Validating our Method The first study I analyzed was “Blood Pressure and Sexual Maturity in Adolescents” found in the American Journal of Human Biology (2001). This study found that Systolic Blood Pressure in adolescents increases as their Sexual Maturity increases.

  15. Validating our Method Using our method I found the subsets below. This shows that as the TANNER (Sexual Maturity Measurement) increases Systolic Blood Pressure also increases. TANNER='(1.8-2.6]' MATURE=0 SBP='(102.75-119]' ZHTCM='(-1.96-.7792]' TANNER='(1.8-2.6]' MATURE=0 SBP='(102.75-119]' OBESITY=0 TANNER='(2.6-3.4]' MATURE=0 SBP='(102.75-119]' WHRATIO='(.775-.85]' TANNER='(2.6-3.4]' SBP='(119-135.25]' MATURE=0 TANNER='(3.4-4.2]' SBP='(119-135.25]' APWAIST='(64.88-78.06]' MATURE=1 OBESITY=0 TANNER='(3.4-4.2]' SBP='(119-135.25]' MAP='(80-94]' MATURE=1

  16. Future Work • I will continue to validate more studies using our method. • Find a relation that is interesting, useful, and correct that has not yet been discovered.

More Related