Course 2  Data Preprocessing

Course 2 Data Preprocessing PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

???. ??????????????????????????????????????????. ??????????. ???????????(Dirty)???(Incomplete):??????????????,?????????, ??=

Download Presentation

Course 2 Data Preprocessing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


1. Course 2 ????? Data Preprocessing

2. ??? ????????? ?????? ???? ??????? ???? ?????????? ??

3. ?????????? ???????????(Dirty) ???(Incomplete):??????????????,???????? ?, ??=“ ”,????=20 (???????) ???(Noise):???????? ?, ??=“-10” ???(Inconsistent):???????????? ?, ??=“42” ??=“03/07/1997” ?, ???? “1,2,3”, ???? “A, B, C” ?????????,?: Customer_ID, Customer_Num

4. ??????????? ????? ??????????????? ???????????????????? ???????????????? ???? (????)?? ????????? ???????????? ?????? ????? ?????? ????????? (?, ????????) ??????????

5. ?????????? ???? (Data Cleaning) ???????????????????????????? ???? (Data Integration) ??????????????? ???? (Data Transformation) ?????? ???? (Data Reduction) ??????????????????? ????? (Data Discretization) ???????,???????????

6. ???????

7. ???????????? No quality data, no quality mining results! ????????????????? e.g., duplicate or missing data may cause incorrect or even misleading statistics. ?????????????????????

8. Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy (???) Completeness (???) Consistency (???) Timeliness (???) Believability (???) Value added (????) Interpretability (????) Accessibility (???)

9. ????????? ?????? ???? ??????? ???? ?????????? ??

10. ??????? ??:????????? ????????? ????????? ???????? ????????? ?????????

11. ??????? (Central Tendency) ??? (Mean; ?????): ????: ????: ???? ??? (Median):???? ????????????????????, ?????????????? ???????(????): ?? (Mode) ??????? ????,???,??? ????(Pearson’s Method):

13. ????????? (Dispersion) ????,??????? ????: Q1 (?25?????), Q3 (?75?????) ????: IQR = Q3 – Q1 ??????: min, Q1, M, Q3, max ???:??????????,???????????,??????? (whiskers, ??) ???????????,?????????? ???: ?/??? 1.5 x IQR ??????? (??: s, ??: s) ???: (??, ?????) ??? s (or s) ??? s2 (or s2) ???

14. ????? ??????: Minimum, Q1, M, Q3, Maximum ??? ??????? ????????????????,?????????? ??????????? ??????? (whiskers, ??) ???????????

15. ???????: ?????

17. ???????????? ????? ????? ??????? ?????????,??????????????????????????

18. ???? ??????????? (?????????????????) ??????? ??? xi ????, fi ???100% fi??????xi??xi??

19. ???-????(Q-Q) ?????????????????????????????? ????????????????????

20. ??? ?????????????????????? ?????????????,????????????

23. ?????? ?????????????,???????????? ??????????,??????????:?????,?????????????

24. ????????? ?????? ???? ??????? ???? ?????????? ??

25. ????? (Data Cleaning) ??? “??????????????????”—Ralph Kimball “?????????????”—DCI survey ?????? ????? ??Outliers??? (??) ?? ???????? ???????????

26. ??? (Missing Data) ??????available ???? (Tuple) ???????????????,???????????????????:Customer income in sales data ??Missing data??????: equipment malfunction (????) inconsistent with other recorded data and thus deleted (???????????????) data not entered due to misunderstanding (????????????) certain data may not be considered important at the time of entry (????,????????????????)

27. ???????? ??????:????????????????????,?????????,??????????????. ???????????:????????? ????: ?????? (global constant) ????? : ?., “??”, ???! ???????????? ?????????????: ????? ??????????????:????????????????????????????

28. ???? (Noisy Data) Noise:??????????????????? ??Noise??????: faulty data collection instruments (?????????) data entry problems (??????) data transmission problems (??????)

29. ????????? ??? ?????????,?????????? ????????????,????????,?????????? ?? ???????????????? ?? ???????? ???????? ?????????? (?., ???????)

30. ???????: ??? ????????????????,??????????? “??” (Neighborhood) ?? (??) ?? ???N?????: ???? ?? A ? B ??????????, ?????: W = (B –A)/N. ????????????? ??????????? ?? (??) ?? ???N???, ???????????? ??????? ??????????

31. ????????????

32. ??

33. ????

34. ?????? ???? ?????? (?., ??, ??, ??, ??) ??????(field overloading) ??????????????????? ?????? ????:???????? (??????????) ?????????? ??????:??????????????,??????????(?.,????????????,???????????) ??????? ??????:???????? ETL (Extraction/Transformation/Loading)??:??????????????? ?????? ????? (?., Potter’s Wheels)

35. ????????? ?????? ???? ??????? ???? ?????????? ??

36. ????? (Data Integration) ????: ?????????????????? ????????????: Schema integration (????) Redundancy (????) Detection and resolution of data value conflicts (???????????) ?????????????????,??????/???????????????????????

37. ????(Schema integration): ???????????? ?., A.cust-id ? B.cust-# ????: ?????? ????????????????, ?., Bill Clinton = William Clinton ???????????? ???????, ?????????? ????: ????, ????, ?., ?????

38. ???????????? ??????????????? ????: ??????, ????????????? ????: ????????????????????, ?., ??? ?????????????? (correlation analysis) ??

39. ??????(????) ??????(Pearson’s product moment coefficient)

40. ??????(????) ?2 (chi-square) ?? ?2 ???, ?????? ??2?????????????????????? ?????????????? ??????????????????? ???????????? (???) ??????

41. Chi-Square ??: ?? ?2 ?? (???????????, ??????????????) ??????, ??? like_science_fiction ? play_chess ???

42. ????? (Data Transformation) ???(Smoothing):???? ??(Aggregation):????????? ?????(Generalization):?????? ???(Normalization):???????????????? min-max ??? z-score??? ?????? ????(Attribute Construction) ???????????

43. ????: ??? min-max ???: ??? [new_minA, new_maxA] ?. ????? $12,000 to $98,000 ???? [0.0, 1.0]. ? $73,000 ???? Z-score??? (µ: ??, s: ???): ?. ? µ = 54,000, s = 16,000. ? ??????

44. ????????? ?????? ???? ??????? ???? ?????????? ??

45. ??????? (Data Reduction Strategies) ???????? ?????????????????? ??????????????????????? ???? ????????????????????,???????????????? ?????? ?????? ?????? , ?. ??????? ???? ???? — ?., ??????? ??????????

46. ?????? (Data Cube Aggregation) ??????? (?????) ????????? ???????? ????????????? ?????? ?????????????

47. ?????? ??????????????????,??????????????????????????? ???? (?., ???????): ??????????,????????????????????????????? ???????????,??????????????? ??? (n??????2n?????): ?????? ?????? ???????????? ?????

49. ???? ?????????????????????????? Lossless (????): If the original data can be reconstructed from the compressed data without any loss of information. Lossy (????): we can reconstruct only an approximation of the original data.

50. ????:???? ??????(DWT): ??????, ?????? ?????: ???????????? ??????????(DFT), ??????????????? ??: ??, L, ??? 2 ?????(???????????0???) ???????????:???, ??? ????? ????,?????????L/2??? ???????????????????,??????????????????

51. ?????????k???????, k = n ?? ?????????: Each attribute falls within the same range ??k?????, ?????? ?????????????? ????????????? ?????????????,????????????????????????????????,???????????? ???????? ?????????? ????: ?????? (PCA)

52. ???? ???????????????????? ????? ??????????????,?????????????,????????? (???????) ??: Log-linear ??— ?m????????????????????? ?????? ????? ????: ???, ??, ??

53. ????: Y = w X + b ??????w?b???????, ???????????????? ????Y1, Y2, …, X1, X2, …. ????????? ??????: Y = b0 + b1 X1 + b2 X2. ?????????? ??????: ???????????????????,??????????????? ??: p(a, b, c, d) = ?ab ?ac?ad ?bcd ?????? (1): ?????????

54. ???? (2): ??? ??????????,???????????? ????: ??: ?????? ???(??) V-??:????????? (???????????????) MaxDiff:???????????ß-1???????

55. ???? (3): ?? ???????????????? (?.,?????) ?????????????, ?????????? ?????????????????????? T????????????????? ???????????

56. ?????? (4): ?? ??:??????????????s??????? N ?????????? ????????,????????????? ???????? ????????? ????????? ???? ????: ?????????????(???????)?????????? ?????????

58. ??: ???????

59. ????????? ?????? ???? ??????? ???? ?????????? ??

60. ????????????? ??? ?????????????????? (intervals) ??????? ???????????????? ???????? ?? (????) ??? (????) ??????????????? ?????? ???????????????? (?????????) ????????? (??????) ???????

62. ???????????????? ????: ????????? ??? ???????,???? ????? ???????,???? ???? ????????????????,???? ?????:???,??????? ?2???????:???? ,?????? ???????:???????,????

63. ????? ???? S, ?? S ???? T ??????? S1 ? S2 , ???????? ?????????????????. ??? m ??, S1 ??? pi ??? i ? S1 ??? ???????,???????????????????? ???????????????? ?????????????????????

64. ?2??????? ?? (????) ??:????????????,????????????? ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002] ???????????A??????? ??????????2 ?? ??????2????????? ?????????????????????(?????, ????, ?????)

65. ??????? ???????????????????????. ????3, 6, 7?9????,???????3?? ????2, 4?8????,???????4????? ????1, 5?10????,???????5???? ???????????????,?????????????; ?????????????????????,???????????????,????????,?????????????e.g. 5%-95%

66. 3-4-5 ????

67. ?????????? ???????????????????????? ? < ?? < ? < ?? ????????????????? {Urbana, Champaign, Chicago} < Illinois ??????? ?., ?? ? < ??, ???? ????????????????????????? ?., ????: {? , ??, ?, ??}

68. ????????????????????????? ?????????????????? ??, ?., ??, ?, ?, ?

69. ??? ??????????????????????? ?????????????????? ??????? ??????? ????????? ??? ????????????????,???????????????????????,?????????????????

  • Login