1 / 43

Patients.txt

Patients.txt. Variable Name Description Type Valid Values PATNO Patient Number Character Numerals GENDER Gender Character ‘M' or 'F' VISIT Visit Date MMDDYY10 Any valid date HR Heart Rate Numeric 40 to 100

Download Presentation

Patients.txt

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Patients.txt • Variable • Name Description Type Valid Values • PATNO Patient Number Character Numerals • GENDER Gender Character ‘M' or 'F' • VISIT Visit Date MMDDYY10 Any valid date • HR Heart Rate Numeric 40 to 100 • SBP Systolic Blood Pres. Numeric 80 to 200 • DBP Diastolic Blood Pres. Numeric 60 to 120 • DX Diagnosis Code Character 1 to 3 digits • AE Adverse Event Character '0' or '1'

  2. Patients.txt

  3. Patients.txt • Variable • Name Description Type Valid Values • PATNO Patient Number Character Numerals • GENDER Gender Character ‘M' or 'F' • VISIT Visit Date MMDDYY10 Any valid date • HR Heart Rate Numeric 40 to 100 • SBP Systolic Blood Pres. Numeric 80 to 200 • DBP Diastolic Blood Pres. Numeric 60 to 120 • DX Diagnosis Code Character 1 to 3 digits • AE Adverse Event Character '0' or '1'

  4. Distribution

  5. Some of Invalid value

  6. HR - Heart Rate (BETWEEN 40 AND 100) • SBP - systolic Blood Pressure (BETWEEN 80 AND 200) • DBP - Diastolic Blood Pressure (Between 60 to 120)

  7. DBP - Diastolic Blood Pressure (Between 60 to 120)

  8. DBP - Diastolic Blood Pressure (Between 60 to 120)

  9. SBP - systolic Blood Pressure (BETWEEN 80 AND 200)

  10. SBP - systolic Blood Pressure (BETWEEN 80 AND 200)

  11. HR - Heart Rate (BETWEEN 40 AND 100)SBP

  12. Data integration • combining/merging data from heterogeneous data sources. • is the process of combining data residing at different sources (internal data sources and external data sources) • providing the user with a unified view of these data.

  13. SCHEMA INTEGRATION • use different representations or definitions of schema but it refers to or represent the same information. • as the entity identification problem.

  14. For example • How can we identify that customer_id in one data set and customer_no in another refer to the same entity?

  15. Schema matching • Currently, most of the schema matching is done manually. • tedious, • time-consuming, • error-prone.

  16. We need automated support for schema matching • faster, • error-free and • less labor-intensive.

  17. A mapping between Global Schema and Local Schema

  18. The architecture for data integration

  19. Correlation Analysis • Redundancy • apply correlation analysis

  20. Correlation Analysis • Given two attributes (X1, X2); • Measure the correlation of one attribute (X1) to another attribute (X2).

  21. Correlation Analysis

  22. Correlation Analysis

  23. Correlation Analysis • Table 2 is generated by the following criteria: • i) For the number of bytes in the attributes, if total number of bytes is less than or equal to 8 byte, we put it as 1, else it would be 0. • ii) For 1 attribute frequently access, we propose to sum the total frequency of one attribute, which is (6 1+2) = 9. The average frequently accessed = 9 / 3 = 3. Any number which is less than average frequently accessed, would be converted into 0, else it is 1.

  24. Correlation Analysis

  25. Correlation Analysis • We apply correlation analysis to find out among attributes where are pairs as a redundancy.

  26. Correlation Analysis

  27. Correlation Analysis

  28. Correlation Analysis

  29. Correlation Analysis • If the resulting value is greater than 0, then X2 and X3 are positively correlated. • The higher the value (approaching 1), the more each attribute implies the other. • Therefore, it is recommended that X2 (or X3 ) may be removed as they are redundant variables.

  30. Clustering • To explain how we apply a clustering algorithm to generate clusters, • we assume that a relation has 10 attributes involved in query processing. • Furthermore, one disk page can only take less than 100 bytes

  31. Clustering • Table 6.1 shows the length of each attributes. • We use a frequent access table to keep track the number of times users access in a particular relation as shown in Table 6.1. • When the users access the relation, the frequent access table will be updated. The frequent access table also shows the length of attribute.

  32. Clustering

  33. Clustering • From Table 6.1, we would like covert those numeric figures into Y or N condition based on some criteria. • We propose the following converting scheme: • For number of bytes in the attributes, if total number bytes less than one fetch of instruction cycle way 100 byte, we put it as Y else it would be N. • For 1 attribute frequently access, we propose to sum the total frequent of one attribute which is (7 + 2 + 4 + 3 + 2 + 8 + 5 + 4 + 9 + 3) = 47. • The average frequently access = 47 / 10 = 4.7. • Any number is less than average frequently access, we would like to convert it into N else it is Y.

  34. Clustering

  35. Clustering

  36. DATA TRANSFORMATION • In metadata, a data transformation converts data from a source data format into destination data.

More Related