Flood Classification Based on Improved Principal Component Analysis and Hierarchical Cluster Analysis. HUI GE College of Hydrology and Water Resources, Hohai University [email protected] Contents. Introduction. Principal Component Analysis. Hierarchical Cluster Analysis. Case Study.
College of Hydrology and Water Resources,
Principal Component Analysis
Hierarchical Cluster Analysis
If there are p indexes and n flood processes, that matrix of flood samples is
Original matrix comprises the information in two aspects:
variance —— the information of variation degree of indexes ;
correlation coefficient matrix—— the information of interaction degree between indexes.
Standardization makes variance of each index is 1, which eliminates the difference of indexes.
So, the dimensionless method of original matrix must be improved, mean value method is one of the better methods.
mean value processing matrix
The mean value processing does not change the correlation coefficient between indexes, all information of the correlation coefficient matrix is reflected in corresponding covariance matrixes.
the covariance matrix of is
mean value of each index in Y is 1,as a result
That is, principal diagonal elements of the covariance matrix of mean value processing data are the squares of variation coefficients of indexes.
Then eigenvalue and eigenvector of Up×p are calculated. If the eigenvalues are arranged in descending order, that is, and the corresponding eigenvector is , Consider the linear combinations,
Where, is respectively known as the first principal component, second principal component, …, and kth principal component. The first principal component is the linear combination with maximum variances.
The number of principal components k is determined by the accumulative percentage of explained variance E, namely the smallest k when
Then these k components can “replace” the original p variables without much loss of information.
Sum of the component scores of k principal components ,
total component score is
The weight of each principal component is its variance contribution rate
The flood intensity can be evaluated according to the total score.
The weight is the proportion of total variance explained by kth principal component.
Squared Euclidean Distance
Ward considered hierarchical clustering procedures based on minimizing the “loss of information” from joining two groups.
This method is usually implemented with loss of information taken to be an increase in an error sum of squares criterion.
The sum of the squared deviations of every item in the cluster is from the squared Euclidean distance of cluster mean.
Table 3 The result of flood classification
Table 1 Historical flood processes of Yichang station
Table 2 Variance explained
Note：Hm, Qm, W3d, W7d, W15d respectively represent the maximum flood level, Peak flow, 3-day, 7-day, 15-day flood volume.
Note：Prin1 and Prin2 respectively represent the first principal component and the second principal component.
Fig.1 scatter diagram of principal component score
2. Improve and apply
Flood samples has significant influence on classification. This paper made an preliminary analysis due to limited flood samples. With the optimization of samples, classification and description of flood types can be more accurate and effective.
Index of flood classification is also worthy to be further studied and discussed in the future.
Mean value method can overcome the disadvantages of traditional PCA, and effectively improved the dimensionless method.
The model we proposed is universal, and can be applied to a wide range of applications in other similar systems.
IPCA-HCA model has fully considered the multifactor influence on flood classification, is characterized by clear principle and simple calculation, and can yet be regarded as a new approach for flood classification.