Variable selection of near-infrared spectroscopy for measuring wheat protein based on MC-LPG
-
Graphical Abstract
-
Abstract
In order to realize the nondestructive determination of protein content in wheat, simplify the prediction model of portable wheat protein detection devices, and improve prediction accuracy of models, the near infrared diffuse transmission-reflectance spectra of wheat was measured from 950 to 1690 nm. The wavelength variable was selected by a combined Monte Carlo Sampling (MCS) technology and the Latent Projective Graph (LPG) method. The LPG is another expression of the principal component projective graph, and it is a technique developed in Chemical Factor Analysis (CFA) for investigating the nature of hyphenated data. Latent variables (loading) of a data matrix and the projection of objects onto the latent variables (score) are obtained by Principal Component Analysis (PCA), the nature of the data matrix can be analyzed by the loading and score plots, because the latent variables are linear combination of measured variables and the projection defines uniquely the sample relations in the reduced variable space spanned by the latent variables. So the LPG is adopted in wavelength selection for Near-Infrared (NIR) spectral analysis, the loading matrix is used to state the relationship among different samples, and the score matrix is used to select the wavelength variables. Model Population Analysis (MPA) is first obtained from the sub-dataset by MCS, then some sub-models are built for each sub-dataset. Finally, a statistical analysis is made from the sample space, variable space, parametric space and model space about the parameters which contribute to sub-models building,. Therefore, according to MPA, 500 sub-datasets of samples were established by MCS technology. For each sub-dataset, the proportion of calibration and prediction is 2:1.There are 61 kinds of wheat as calibration and 32 kinds of wheat as prediction. The LPG was obtained by PCA, assuming that linear spectral variables in LPG have the same contribution for modeling, a small number of wavelength variables were selected for building 500 predictable sub-models, 458 sub-models which have the smaller root mean square error (RMSEP) that is smaller than 0.55 were selected. The frequency number of the selected variables which are in 458 sub-models was analyzed statistically, the 12 wavelength of highest frequency number were selected as the influential variables (IVs), they were 1060, 1094, 1403, 1494, 1511, 1521, 1545, 1551, 1607, 1612, 1620, and 1630 nm. The RMSEP of the prediction model is reduced from 0.5245 to 0.2548 and the RPD value is increased from 1.7496 to 3.3985 by the new model which was built by the IVs. Therefore, the variable selection with Monte Carlo Sampling technology and Latent Projective Graph method (MC-LPG) is feasible for improving the precision of prediction model.
-
-