Application of Entropy Theory and Principal Component Analysis to Determine Input Variables for Estimating Solar Radiation using Machine Learning Algorithms

Document Type : Full length article

Authors

1 Department of Water Sciences and Engineering, Faculty of Agriculture and Natural Resources, Ardakan University, Ardakan, Iran

2 Department of Computer Engineering, Faculty of Engineering, Ardakan University, Ardakan, Iran

10.22059/jphgr.2025.386916.1007858

Abstract

ABSTRACT
Solar radiation is crucial in energy balance models and plant growth simulations. This research investigates the performance of Principal Component Analysis (PCA) and Shannon Entropy Theory (ENT) in determining the input for machine learning models – Random Forest (RF), Linear Regression (LR), Support Vector Regression (SVR), K-Nearest Neighbors (KNN), Decision Tree (DT), and XGBoost (XGB) – for estimating solar radiation at the Yazd synoptic station between 2006 and 2023. Daily data for average temperature, minimum temperature, maximum temperature, sunshine hours, relative humidity, and solar radiation were obtained from the Meteorological Organization. Extraterrestrial radiation, the relative Earth-Sun distance, solar declination angle, and maximum sunshine hours were calculated using existing formulas and selected as inputs for the pre-processing methods. The results of machine learning algorithms indicated their acceptable accuracy in estimating solar radiation. By reducing the dimensionality of the input data to the machine learning algorithms, the results showed that the Principal Component Analysis (PCA) method increased the model's accuracy. Among the models used, the PCA-SVR model showed the best result at the Yazd station with a coefficient of determination of 0.923 and an accuracy of 92.84%. It is worth mentioning that the Shannon entropy theory method failed to improve the modeling results compared to the method without initial pre-processing. This analysis shows that using dimensionality reduction techniques and selecting appropriate models can lead to greater accuracy and less computational complexity in prediction problems. However, sufficient care should be taken when selecting a pre-processing model for the initial data.
Extended Abstract
Introduction
In terms of selecting all influential parameters and the lack of statistical information, the complexity of meteorological and hydrological systems makes complete modeling of these systems impossible. Using system modeling based on mathematical relationships is of interest in such conditions. Solar radiation is one of the important and effective meteorological variables in estimating evapotranspiration and the water needs of plants, and it is the energy source for all atmospheric and surface processes. Although the measurement of this variable has a relatively long history in Iran, due to the high costs of measuring instruments, many existing stations in the country lack a radiometer or pyranometer, or face issues such as calibration problems and the accumulation of water and dust on the sensor. Even at weather stations that measure radiation, there are days when radiation data is not recorded, or unrealistic values outside the expected range are observed due to equipment malfunctions or other issues. On the other hand, due to the many factors affecting solar radiation studies, it is impossible to include all elements in the relevant equations. As a result, only a limited number of these variables are applicable for estimating solar radiation using empirical and semi-empirical equations. In recent years, many researchers have focused their studies on using data mining methods and mathematical modeling to estimate solar radiation.
 
Methodology
The data used in this research are daily climatic variables measured at the Yazd synoptic station from 2006 to 2023. The Yazd station is located at 31.8974° North latitude and 54.3569° East longitude, at an altitude of 1216 meters above sea level. The average solar and extraterrestrial radiation at the Yazd synoptic station are 19.35 and 32 megajoules per square meter per day. The ratio of sunshine hours to maximum possible sunshine hours is 0.75, the average relative humidity is 27%, and the average temperature is 28°C. Data from 2006 to 2014 were used for calibrating the equations, and data from 2015 to 2023 were used for evaluating the results. Extraterrestrial radiation and maximum daily sunshine hours, which depend on the geographical latitude and day number based on the Gregorian calendar, were calculated using the relationships presented by Duffie and Beckman (1991). This research investigates the performance of Principal Component Analysis (PCA) and Shannon Entropy (ENT) for determining the input variables of Random Forest (RF), Linear Regression (LR), Support Vector Regression (SVR), K-Nearest Neighbors (KNN), Decision Tree (DT), and XGBoost (XGB) machine learning models in estimating solar radiation. Daily data for mean temperature, minimum temperature, maximum temperature, sunshine hours, relative humidity, and solar radiation were obtained from the Meteorological Organization. Extraterrestrial radiation, relative earth-sun distance, solar declination angle, and maximum sunshine hours were calculated using existing relationships and selected as inputs for the preprocessing methods.
 
Results and discussion
Results showed that, in the training phase, the employed models were well-trained and exhibited acceptable results. In the testing phase, the modeling results for the raw input data (without pre-processing) also yielded satisfactory results for all models. The coefficient of determination varied between 0.790 for the KNN model and 0.893 for the SVR model, depending on the algorithms used. In other words, regarding R-squared values, all the algorithms used showed good results for solar radiation prediction. Considering all evaluation metrics, the Support Vector Regression (SVR) algorithm performed better than other models to predict solar radiation with RMSE = 1.732, MSE = 0.003, MAE = 0.826, R² = 0.893, and an accuracy of 90.75%. Results showed that using Principal Component Analysis (PCA) for dimensionality reduction, the first principal component accounted for approximately 49% of the variance, and the second principal component accounted for approximately 36%. The first two principal components comprised over 85% of the original data's variability; therefore, these two components were considered as inputs for the predictive models to estimate solar radiation. Based on the training results, the PCA-DT and ENT-DT models exhibited the best performance in solar radiation estimation and model training at the Yazd station, achieving zero mean squared error and mean absolute percentage error, and a coefficient of determination of 1.00 compared to other models. The results of the model testing section indicate that the PCA-SVR model outperforms other methods. As can be seen, the PCA-SVR model, with a coefficient of determination of 0.923 and an accuracy of 92.84%, achieved the best results among the mentioned models at Yazd station, exhibiting the lowest error metrics. The ENT-DT model, with a coefficient of determination of 0.535 and an accuracy of 79.34%, showed weaker results among the models used at Yazd station.
 
Conclusion
Given the importance of accurate solar radiation estimation in hydrological phenomena and the need for advanced methods in its estimation, this research utilized Principal Component Analysis (PCA) and entropy theory for data pre-processing.  Model inputs for the estimation models were identified using these two methods. Modeling was performed using Random Forest (RF), Linear Regression (LR), Support Vector Regression (SVR), K-Nearest Neighbors (KNN), Decision Tree (DT), and XGBoost (XGB) models.  Entropy theory results indicated that at the Yazd station, solar declination angle, minimum temperature, minimum relative humidity, and average relative humidity were effective variables in estimating solar radiation.  Furthermore, PCA reduced the number of input variables to two principal components, and modeling was performed using these two derived input variables.  Overall, the modeling results showed that the PCA-SVR model outperformed other models in estimating solar radiation.  In general, PCA pre-processing demonstrated that this method determines better inputs for the estimation models. It is worth noting that Shannon's theoretical method did not improve the modeling results compared to the method without pre-processing. This analysis shows that using dimensionality reduction techniques and selecting appropriate models can lead to higher accuracy and lower computational complexity in prediction problems. However, care must be taken when selecting the pre-processing model for the initial data. Similar research using new data or in different geographical conditions could also help further validate the results.
 
Funding
There is no funding support.
 
Authors’ Contribution
In this study, the authors' contributions are as follows: Somayeh Soltani-Gardfaramarzi was responsible for the study design, data collection, analysis, writing the initial draft, and final editing of the article, and Mojgan Askarizadeh was responsible for modeling and results.
 
Conflict of Interest
Authors declared no conflict of interest.
 
Acknowledgments
We are grateful to all the scientific consultants of this paper.

Keywords

Main Subjects


  1. Abdelhafidi, N., Bachari, N.E.I., & Abdelhafidi, Z. (2021). Estimation of solar radiation using stepwise multiple linear regression with principal component analysis in Algeria. Meteorology and Atmospheric Physics, 133(2), 205-216. http://doi: 10.1007/s00703-020-00739-0.
  2. Avazpour, S., Bakhtiari, B., & Qaderi, K. (2019). Performance evaluation of Neural Network and Multivariate Regression Methods for Estimation of Total Solar Radiation at several stations in Arid and Semi-Arid Climates. Iranian Journal of Soil and Water Research, 50(8), 1855-1869. http://doi: 20.1001.1.24234931.1399.7.2.3.2. [In Persian]
  3. Boroughani, M., Soltani, S., Ghezelseflu, N., & Pazhouhan, I. (2022). A comparative assessment between artificial neural network, neuro-fuzzy, and support vector machine models in splash erosion modelling under simulation circumstances. Folia Oecologica, 49(1), 23-34. http://doi:10.2478/foecol-2022-0003.
  4. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. DOI: 10.1023/A:1010933404324.
  5. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27. http://doi: 10.1109/TIT.1967.1053964.
  6. Demir, V., & Citakoglu, H. (2023). Forecasting of solar radiation using different machine learning approaches. Neural Computing and Applications, 35(1), 887-906. http://doi: 10.1007/s00521-022-07831-z.
  7. Djeldjeli, Y., Taouaf, L., Alqahtani, S., Mokaddem, A., Alshammari, B.M., Menni, Y., & Kolsi, L. (2024). Enhancing solar power forecasting with machine learning using principal component analysis and diverse statistical indicators. Case Studies in Thermal Engineering, 61, 104924. http://doi: 10.1016/j.csite.2024.104924.
  8. Draper, N. R., & Smith, H. (1998). Applied regression analysis. Wiley-Interscience.
  9. Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Systems, 9, 155-161. https://www.researchgate.net/publication/309185766 Support vector regression machines.
  10. Duffie, J.A., & Beckman, W.A. (1991). Solar Engineering of Thermal Processes. Wiley, New York.
  11. Hunt, L.A., Kuchar, L., & Swanton, C.J. (1998). Estimation of solar radiation for use in crop modeling. Agric. Meteorol. 91, 293–300. http://doi: 10.1016/S0168-1923(98)00085-4.
  12. Liu, C.W., Lin K.H., & Kuo Y.M. (2003). Application of factor analysis in the assessment of groundwater quality in a blackfoot disease area in Taiwan. Science of the Total Environment, 313, 77-89. http://doi: 10.1016/S0048-9697(02)00683-6.
  13. Meenal, R., & Selvakumar, A.I. (2018). Assessment of SVM, empirical and ANN based solar radiation prediction models with most influencing input parameters. Renewable Energy, 121, 324-343. http://doi: 10.1016/j.renene.2017.12.005.
  14. Mohammadi, B., Aghashariatmadari, Z., & moazenzadeh, R. (2019). Determination of Input Variables to Estimate Solar Radiation Using Entropy Theory and Principal Component Analysis. Iranian Journal of Soil and Water Research, 50(3), 625-639. http://doi: 10.22059/ijswr.2018.257150.667906. [In Persian]
  15. Mohammadi, B., & Emamgholizadeh, S. (2017). Using principal component analysis to inputs the effective rainfall estimates based on entries to help support vector machine and artificial neural network. Journal of Rainwater Catchment Systems; 4 (4), 67-75. http://doi:20.1001.1.24235970.1395.4.4.6.9. [In Persian]
  16. Olalekan, S., Abdullahi, M. I., & Olabisi, A. (2018). Modeling of Solar Radiation Using Artificial Neural Network for Renewable Energy Application. Journal of Applied Physics, 10(2), 6-12. http://dx.doi.org/10.9790/4861-1002030612.
  17. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106. http://doi: 10.1007/BF00116251.
  18. Radosevic, N., Duckham, M., Liu, G.J., & Sun, Q. (2020). Solar radiation modeling with KNIME and Solar Analyst: Increasing environmental model reproducibility using scientific workflows. Environmental Modelling & Software, 132, 104780. http://doi: 10.1016/j.envsoft.2020.104780.
  19. Rahimikhoob, A. (2010). Estimating global solar radiation using artificial neural network and air temperature data in a semi-arid environment. Renew Energy, 35, 2131-2135. http://doi: 10.1016/j.renene.2010.01.013.
  20. Sabziparvar, A.A., & Shetaee, H. (2007). Estimation of global solar radiation in arid and semi-arid climates of East and West Iran, Energy, 32, 649–655. http://doi: 10.1016/j.energy.2006.06.006.
  21. Saffaripour, M., & Mehrabian, M. (2009). Predicting the total amount of solar radiation in Kerman using geometric, astronomical, geographical and meteorological characteristics. Sharif, 51 (1), 3-13 [In Persian]
  22. Saraswat, R., Jhanwar, D., & Gupta, M. (2024). Enhanced Solar Power Forecasting Using XG Boost and PCA-Based Sky Image Analysis. Traitement du Signal, 41(1). http://doi: 10.18280/ts.410104.
  23. Shannon, C.E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(379–423), 623–656. http://doi: 10.1002/j.1538-7305.1948.tb01338.x.
  24. Sheikholeslami, N., Ghahraman, B., Mosaedi, A., Davari, K., & Mohajeri, M. (2014). Estimating Reference Evapotranspiration by Using Principal Component Analysis (PCA) and The Development of a Regression Model (MLR-PCA) (Case Study: Mashhad Station). Water and Soil, 28(2), 420-429. http://doi: 10.22067/jsw.v0i0.25711. [In Persian]
  25. Soltani-Gerdefaramarzi, S., Taghizadeh-Mehrjerdi, R., & Ghasemi, M. (2015). Prediction of Longitudinal Dispersion Coefficient in Natural Streams using Soft Computing Techniques. Iranian Journal of Soil and Water Research, 46(3), 385-394. http://doi: 10.22059/ijswr.2015.56728. [In Persian]
  26. Soltani-Gerdefaramarzi, S., & Momeni, H. (2023). Application of machine learning algorithms to estimate solar radiation (case study: arid and semi-arid climate). Iranian Journal of Geophysics, 17(4), 25-39. http://doi:10.30499/ijg.2023.393259.1512. [In Persian]
  27. Soltani-Gerdefaramarzi, S. (2023). Prediction of solar radiation intensity in Yazd station by using regression model based on principal components (PCR). Journal of Agricultural Meteorology, 11(1), 6-16. http://doi:10.22125/agmj.2023.352446.1140. [In Persian]
  28. Xu, H. Xu, C. Y, Sælthun, N. R. Xu, Y. Zhou, B., & Chen, H. (2015). Entropy theory based multicriteria resampling of rain gauge networks for hydrological modelling – A case study of humid area in southern China. Journal of Hydrology, 525, 138-151. http://doi: 10.1016/j.jhydrol.2015.03.034.
  29. Yadav, A. K., & Chandel, S. S. (2015). Solar energy potential assessment of western Himalayan Indian state of Himachal Pradesh using J48 algorithm of WEKA in ANN based prediction model. Renewable Energy, 75, 675-693. http://doi:10.1016/j.renene.2014.10.045.