Intelligent Disease Identification based on Discriminant Analysis of Clinical Data

Discriminant analysis was applied as a method of disease identification, using data obtained from blood analysis of several patients. The investigated compounds in human blood samples were organic compounds of clinical interest (glucose, triglycerides, cholesterol, creatinine and urea), inorganic compounds (Na, K, Ca, Mg and Fe) and enzymes (Lactate Dehydrogenase (LDH), Alanine Transaminase (ALT), Aspartate Aminotransferase (AST), Alkaline Phosphatase (ALP) and Gamma Glutamyltransferase (GGT)). According to their concentration level the following diseases have been selected for study: hydroelectric disorders, hepatic diseases, lipid disorders, diabetes and renal disorders. Some patients resulted to be healthy. Discriminant analysis was not only used for classifying the patients according to their disease but also for detecting the most important variables that discriminate between the groups. For example it has been found that the greatest contribution to the discriminatory power of the model is given by glucose ( λ * = 0.263; F = 44). The obtained results confirm that clinical analysis combined with the multidimensional interpretation of data gives an interesting and very useful way of disease correlations, interpretations, problem solving and cost effectiveness.

Modern science and techniques have revolutionized the possibility of rational and objective diagnosis and treating of all kinds of diseases.Together with ultrasonic-, X-ray-, and nuclear magnetic resonance-based methods, clinical laboratory tests provide a sensitive and objective indicator of the patient's condition.Since laboratory tests provide quantitative, reproducible and specific results they are, and must be, an integral part of diagnosis, therapy control and management of a patient disease [1][2][3][4][5][6][7].
The primary goal of a clinical chemistry laboratory is to correctly perform analytical procedures that yield accurate and precise information to aid in patient diagnosis.In order to achieve reliable results, the laboratory must include the ability to use basic supplies and equipment correctly and an understanding of fundamental concepts critical to any chemical test [8].
The volume of data generated by the clinical chemistry laboratory is enormous and must be summarized to be most useful to the analyst and clinician.Consequently, the use of multivariate chemometrical methods becomes extremely necessary as a way to handle the very large and complex data sets [9][10][11].Principal Components Analysis [12,13], Cluster Analysis [14] and Discriminant Analysis [14][15][16][17][18] are the most applied multivariate methods for data processing and maximum information extraction.
In this paper discriminant analysis was applied as a method of disease identification, using data obtained from blood analysis of several patients.The investigated compounds in human blood samples were organic compounds of clinical interest (glucose, triglycerides, cholesterol, creatinine and urea), inorganic compounds (Na, K, Ca, Mg and Fe) and enzymes (Lactate Dehydrogenase (LDH), Alanine Transaminase (ALT), Aspartate Aminotransferase (AST), Alkaline Phosphatase Dedicated to Prof. P.T. Frangopol on the occasion of his 75th anniversary (ALP) and Gamma Glutamyltransferase (GGT)).According to their concentration level the following diseases have been selected for study: hydroelectric disorders, hepatic diseases, lipid disorders, diabetes and renal disorders.Some patients resulted to be healthy.

Discriminant function analysis
Discriminant function analysis or simply, discriminant analysis (DA) is based on the extraction of linear discriminant functions of the independent variables by means of a qualitative dependent variables and several quantitative independent variables [9,10,15,16].DA can be formulated as follows: let X = {x 1 , ..., x n } ⊂ R p be a finite set of characteristic vectors, where n is the number of samples (measurements) and p is the number of the original variables (predictors), T and y be a nominal characteristic (grouping variable), with k values, each of which characterizes one of the k partition composing the partition substructure of the given data set.The partition of X into k groups is computationally very similar to analysis of variance (ANOVA/MANOVA), sharing many of the same assumptions and tests; the most important variables are selected, and variables contributing only marginally to the discrimination of groups will be removed.In a similar way as with principal component analysis [9][10][11][12][13], first the total variance/covariance matrix is calculated according to the following expression V = T XDX, (1) where X is the centered data matrix, T X is the transpose matrix, D is the diagonal matrix (in most cases is the unity matrix).
Considering a new characteristic defined as c = Xu, one can calculate its variance by applying the relation ( 2) The total variance V may be decomposed into two components: the between-group variance B and withingroup variance W, namely V = B + W, (3) and, as a consequence, the variance of the characteristic c becomes ||c|| 2 = T uVu = T uBu + T uWu.(4) In this case, it is very easy to observe that equation ( 4) can be rewritten in the following form (5) and because any term from the left side is positive, equivalent results will be obtained indifferent of the maximum/minimum condition.
However, in practice the first ratio in equation ( 5) is maximized (6) or, in a different form, of a generalized eigenvalue problem: Let us recall that the matrix V of the total variance is symmetrical and positive definite.As such, this equation may be rewritten to a matrix equation similar to that obtained in the case principal component analysis results where λ and u represent the eigenvalues (known, as well, as characteristic roots) and eigenvectors of the matrix V - 1 B. The vector u 1 , named the first discriminant factor corresponds to the highest value of λ; the higher this value the higher will be the discriminant power of this factor.After obtaining the first discriminant characteristic c 1 = Xu 1 , in a similar way can be obtained the discriminant characteristic c 2 = Xu 2 , uncorrelated with the first and so on.It appears clearly that eigenvectors corresponding to the matrix V -1 B namely u 1 , u 2 , .., u k-1 , ranked in decreasing order of the positive values λ 1 , …, λ 2 , …, λ k-1 , are successive solutions of the above matrix equation.
If the vector of the discriminant function is u = (u 1 , …, u 2 , …, u p ), then the projection of sample x i on this axis represents the distance to the origin: The vectors u are called the discriminant factors and the vectors c represent the discriminant scores.The linear function described by equation ( 9) is called discrimination function.
Finally, we have to emphasize that even if the power of discrimination does not depend on standardization of data, generally standardized data are used.
The quality of discrimination and the selection of the most discriminant independent variables can be evaluated by applying different criteria.The Wilks' lambda F test is used to test whether the discriminant model as a whole is significant; the larger the lambda, the more likely it is significant.In the same order can be used λ * statistic defined by the equation 10. (10) The smaller the value of λ*, the more the model is discriminating.
Concerning the contribution of the independent variables to the discrimination of groups, this can be appreciated either by the assay of the classes homogeneity using statistic F like in the case of ANOVA/MANOVA method, either by using Wilks' lambda for each variable.Wilks' lambda is the standard statistic used to express the significance of the overall discriminatory power of the variables in the model.A value of 1 indicates absolutely no discriminatory power, whereas 0 indicates a perfect discriminatory power.The partial Wilks' lambda describes the unique contribution of each variable to the discriminatory power of the model.The closer the partial lambda is to 0, the better the discriminatory force of the variable.In addition, the tolerance value gives information to the redundancy of the respective variable in the model, and is computed as 1 minus R-squares of the respective variable, with all other variables included in the model.Put in other words, it is the proportion of the variance contributed by respective variable.If variable is completely redundant, the squared tolerance value approaches zero.
This kind of information can be obtained from value of the discriminant coefficients associated to the p descriptive variables, and also from the correlation coefficients between each variable and the vector score.The larger the discriminant coefficient and the closer to 1 the correlation coefficient is, the larger the variable importance for the samples separation in defined groups is.Also, the standardized discriminant coefficients, like beta weights in regression, are used to asses the relative classifying importance of the independent variables.

Experimental part
The studied parameters have been analyzed with spectrophotometric and electrochemical methods using the Johnson & Johnson complex chemical system.Sodium and potassium have been measured through a potentiometric method using an ion-selective electrode.Atomic Absorption Spectrometry was used to measure the concentrations of Ca, Mg and Fe.The Automatic Clinical Analyzer uses complexometric titration with methylthymol blue and a chelating agent to remove Ca interference in its Mg analysis.The concentration of Ca has been determined using the same principle as for Mg.Many automated methods for total calcium are based on the complexometric reaction between Ca and orthocresolphthalein complexone, often with 8-hydroxyquinoline added to prevent Mg interference.
Molecular Absorption Spectrometry was used to measure the concentrations of glucose, urea, creatinine, cholesterol, triglycerides, and enzymes.

Results and Discussions
Discriminant Analysis was applied as a method of disease identification using data obtained from blood analysis of several patients.The compounds determined and used to establish the diagnosis are presented in Table 1, together with the normal range of their concentrations in the human blood.
The training data set consisted of 100 patients and the following 16 characteristics: age, Na, K, Ca, Mg, Fe, glucose, creatinine, urea, cholesterol, triglycerides, ALT, AST, LDH, GGT, ALP.Statistical information on the 16 measured variables is presented in table 2.
After application of the standard DA to the data matrix, the variables presented in table 3 were retained in the model.The statistics from this table illustrates the contribution to the discrimination of the components present in human blood according to different parameters.
A canonical correlation analysis has been done, that determined the successive functions and canonical roots.The maximum number of functions will be equal to the  number of groups minus one, or the number of variables in the analysis, whichever is smaller.The eigenvalues (characteristic roots) and the corresponding standardized canonical discriminant function coefficients are showed in table 4.

Table 4 STANDARDIZED COEFFICIENTS FOR CANONICAL VARIABLES
The first function presented a relatively high eigenvalue (6.351).The eigenvalue drops to 5.442 for the second axis, and further to 3.896 for the third axis.
A few remarks are in order.We notice an excellent discriminating power for the first root.A projection of all the data on the axis of root 1 shows a clear separation of the Renal and Lipid classes, with all the others grouped together.Similarly, root 2 discriminates between (a) Hepatic, (b) Lipid and Renal, and (c) the remaining classes.Root 3 shows a clear separation of the Diabetes class, a good separation of Hepatic class, with all the others grouped together.
On the other side, combinations of two roots discriminate better than the individual roots by themselves.As such, the projection on roots 1 and 3 shows a clear separation of four of the six classes (Renal, Lipid, Diabetes and Hepatic), with a good separation of the remaining two classes (Healthy and Hydroelectric).
On a different perspective, the apparent intertwining of the classes Healthy and Hydroelectric is consistent with the remark that hydroelectric disorders are found on a regular basis to people considered as generally healthy.
The classification result of this procedure is the classification matrix, which shows the number of cases that were correctly classified (on the diagonal of the matrix) and those that were misclassified.
The classification matrix presented in table 5 indicates a satisfactory separation of patients in a good agreement to their origin.The group having lipid disorders, for example, showed that 100% of patients were very well classified.Also the group having renal disorders showed a good separation (90%).The poorest classification was obtained for those having hydroelectric disorders (70%).Other representative results can be obtained from the cases classification table, which describes group membership of the cases.

Conclusions
The obtained results confirm that clinical analysis combined with the multidimensional interpretation of data gives an interesting and very useful way of disease correlations, interpretations, problem solving and cost effectiveness.
It is interesting to point out the role isolated data plays with the method.A future study should be aimed at robust methods of Discriminant Analysis, targeted around weighted contributions of data samples, possibly through the use of the fuzzy sets theory.

Table 1
NORMAL VALUES OF THE STUDIED COMPOUNDS IN HUMAN BLOOD

Table 2
STATISTICS OF THE MEASURED VARIABLES