کاربرد تئوری آنتروپی و تحلیل مؤلفه اصلی جهت تعیین متغیرهای ورودی تخمین تابش خورشیدی با الگوریتم‌های یادگیری ماشین

نوع مقاله : مقاله کامل

نویسندگان

1 گروه علوم و مهندسی آب، دانشکده کشاورزی و منابع طبیعی، دانشگاه اردکان، اردکان، ایران

2 گروه مهندسی کامپیوتر، دانشکده فنی و مهندسی، دانشگاه اردکان، اردکان، ایران

10.22059/jphgr.2025.386916.1007858

چکیده

تابش خورشیدی به‌عنوان یکی از متغیرهای مهم در مدل‌های بیلان انرژی و شبیه‌سازی رشد گیاهان اهمیت زیادی دارد. در این پژوهش عملکرد روش تحلیل مؤلفه اصلی (PCA) و تئوری آنتروپی شانون (ENT) برای تعیین ورودی مدل‌های یادگیری ماشین جنگل تصادفی (RF)، رگرسیون خطی (LR)، ماشین بردار پشتیبان (SVR)، نزدیک‌ترین همسایه (KNN)، درخت تصمیم (DT) و (XGB) XGBoost در برآورد تابش خورشیدی در ایستگاه سینوپتیک یزد در حد فاصل سال‌های 2006 تا 2023 موردبررسی قرار گرفت. متغیرهای میانگین دما، دمای کمینه، دمای بیشینه، ساعات آفتابی، رطوبت نسبی و تابش خورشیدی به‌صورت روزانه از سازمان هواشناسی دریافت و متغیرهای تابش فرازمینی، فاصله نسبی زمین تا خورشید، زاویه میل خورشیدی و حداکثر ساعات آفتابی با روابط موجود محاسبه و به‌عنوان ورودی روش‌های پیش‌پردازش انتخاب شدند. نتایج الگوریتم‌های یادگیری ماشین حاکی از دقت قابل‌قبول آن‌ها در تخمین تابش خورشیدی بود. با کاهش بعد داده‌های ورودی به الگوریتم‌های یادگیری ماشین، نتایج نشان داد که روش تحلیل مؤلفه اصلی دقت مدل را افزایش داد و در بین مدل‌های به‌کاررفته، مدل PCA-SVR با ضریب تبیین 923/0 و دقت 84/92% بهترین نتیجه را در ایستگاه یزد نشان داد. لازم به ذکر است که روش تئوری آنتروپی شانون نتوانست نتایج مدل‌سازی را نسبت به روش بدون پیش‌پردازش اولیه بهبود بخشد. این تحلیل نشان می‌دهد که استفاده از تکنیک‌های کاهش ابعاد و انتخاب مدل‌های مناسب می‌تواند منجر به‌دقت بیشتر و پیچیدگی محاسباتی کمتر در مسائل پیش‌بینی شود، هرچند در انتخاب مدل پیش‌پردازش داده‌های اولیه باید دقت کافی داشت.

کلیدواژه‌ها

موضوعات


عنوان مقاله [English]

Application of Entropy Theory and Principal Component Analysis to Determine Input Variables for Estimating Solar Radiation using Machine Learning Algorithms

نویسندگان [English]

  • Somayeh Soltani-Gerdefaramarzi 1
  • Mozhgan Askarizadeh 2
1 Department of Water Sciences and Engineering, Faculty of Agriculture and Natural Resources, Ardakan University, Ardakan, Iran
2 Department of Computer Engineering, Faculty of Engineering, Ardakan University, Ardakan, Iran
چکیده [English]

ABSTRACT
Solar radiation is crucial in energy balance models and plant growth simulations. This research investigates the performance of Principal Component Analysis (PCA) and Shannon Entropy Theory (ENT) in determining the input for machine learning models – Random Forest (RF), Linear Regression (LR), Support Vector Regression (SVR), K-Nearest Neighbors (KNN), Decision Tree (DT), and XGBoost (XGB) – for estimating solar radiation at the Yazd synoptic station between 2006 and 2023. Daily data for average temperature, minimum temperature, maximum temperature, sunshine hours, relative humidity, and solar radiation were obtained from the Meteorological Organization. Extraterrestrial radiation, the relative Earth-Sun distance, solar declination angle, and maximum sunshine hours were calculated using existing formulas and selected as inputs for the pre-processing methods. The results of machine learning algorithms indicated their acceptable accuracy in estimating solar radiation. By reducing the dimensionality of the input data to the machine learning algorithms, the results showed that the Principal Component Analysis (PCA) method increased the model's accuracy. Among the models used, the PCA-SVR model showed the best result at the Yazd station with a coefficient of determination of 0.923 and an accuracy of 92.84%. It is worth mentioning that the Shannon entropy theory method failed to improve the modeling results compared to the method without initial pre-processing. This analysis shows that using dimensionality reduction techniques and selecting appropriate models can lead to greater accuracy and less computational complexity in prediction problems. However, sufficient care should be taken when selecting a pre-processing model for the initial data.
Extended Abstract
Introduction
In terms of selecting all influential parameters and the lack of statistical information, the complexity of meteorological and hydrological systems makes complete modeling of these systems impossible. Using system modeling based on mathematical relationships is of interest in such conditions. Solar radiation is one of the important and effective meteorological variables in estimating evapotranspiration and the water needs of plants, and it is the energy source for all atmospheric and surface processes. Although the measurement of this variable has a relatively long history in Iran, due to the high costs of measuring instruments, many existing stations in the country lack a radiometer or pyranometer, or face issues such as calibration problems and the accumulation of water and dust on the sensor. Even at weather stations that measure radiation, there are days when radiation data is not recorded, or unrealistic values outside the expected range are observed due to equipment malfunctions or other issues. On the other hand, due to the many factors affecting solar radiation studies, it is impossible to include all elements in the relevant equations. As a result, only a limited number of these variables are applicable for estimating solar radiation using empirical and semi-empirical equations. In recent years, many researchers have focused their studies on using data mining methods and mathematical modeling to estimate solar radiation.
 
Methodology
The data used in this research are daily climatic variables measured at the Yazd synoptic station from 2006 to 2023. The Yazd station is located at 31.8974° North latitude and 54.3569° East longitude, at an altitude of 1216 meters above sea level. The average solar and extraterrestrial radiation at the Yazd synoptic station are 19.35 and 32 megajoules per square meter per day. The ratio of sunshine hours to maximum possible sunshine hours is 0.75, the average relative humidity is 27%, and the average temperature is 28°C. Data from 2006 to 2014 were used for calibrating the equations, and data from 2015 to 2023 were used for evaluating the results. Extraterrestrial radiation and maximum daily sunshine hours, which depend on the geographical latitude and day number based on the Gregorian calendar, were calculated using the relationships presented by Duffie and Beckman (1991). This research investigates the performance of Principal Component Analysis (PCA) and Shannon Entropy (ENT) for determining the input variables of Random Forest (RF), Linear Regression (LR), Support Vector Regression (SVR), K-Nearest Neighbors (KNN), Decision Tree (DT), and XGBoost (XGB) machine learning models in estimating solar radiation. Daily data for mean temperature, minimum temperature, maximum temperature, sunshine hours, relative humidity, and solar radiation were obtained from the Meteorological Organization. Extraterrestrial radiation, relative earth-sun distance, solar declination angle, and maximum sunshine hours were calculated using existing relationships and selected as inputs for the preprocessing methods.
 
Results and discussion
Results showed that, in the training phase, the employed models were well-trained and exhibited acceptable results. In the testing phase, the modeling results for the raw input data (without pre-processing) also yielded satisfactory results for all models. The coefficient of determination varied between 0.790 for the KNN model and 0.893 for the SVR model, depending on the algorithms used. In other words, regarding R-squared values, all the algorithms used showed good results for solar radiation prediction. Considering all evaluation metrics, the Support Vector Regression (SVR) algorithm performed better than other models to predict solar radiation with RMSE = 1.732, MSE = 0.003, MAE = 0.826, R² = 0.893, and an accuracy of 90.75%. Results showed that using Principal Component Analysis (PCA) for dimensionality reduction, the first principal component accounted for approximately 49% of the variance, and the second principal component accounted for approximately 36%. The first two principal components comprised over 85% of the original data's variability; therefore, these two components were considered as inputs for the predictive models to estimate solar radiation. Based on the training results, the PCA-DT and ENT-DT models exhibited the best performance in solar radiation estimation and model training at the Yazd station, achieving zero mean squared error and mean absolute percentage error, and a coefficient of determination of 1.00 compared to other models. The results of the model testing section indicate that the PCA-SVR model outperforms other methods. As can be seen, the PCA-SVR model, with a coefficient of determination of 0.923 and an accuracy of 92.84%, achieved the best results among the mentioned models at Yazd station, exhibiting the lowest error metrics. The ENT-DT model, with a coefficient of determination of 0.535 and an accuracy of 79.34%, showed weaker results among the models used at Yazd station.
 
Conclusion
Given the importance of accurate solar radiation estimation in hydrological phenomena and the need for advanced methods in its estimation, this research utilized Principal Component Analysis (PCA) and entropy theory for data pre-processing.  Model inputs for the estimation models were identified using these two methods. Modeling was performed using Random Forest (RF), Linear Regression (LR), Support Vector Regression (SVR), K-Nearest Neighbors (KNN), Decision Tree (DT), and XGBoost (XGB) models.  Entropy theory results indicated that at the Yazd station, solar declination angle, minimum temperature, minimum relative humidity, and average relative humidity were effective variables in estimating solar radiation.  Furthermore, PCA reduced the number of input variables to two principal components, and modeling was performed using these two derived input variables.  Overall, the modeling results showed that the PCA-SVR model outperformed other models in estimating solar radiation.  In general, PCA pre-processing demonstrated that this method determines better inputs for the estimation models. It is worth noting that Shannon's theoretical method did not improve the modeling results compared to the method without pre-processing. This analysis shows that using dimensionality reduction techniques and selecting appropriate models can lead to higher accuracy and lower computational complexity in prediction problems. However, care must be taken when selecting the pre-processing model for the initial data. Similar research using new data or in different geographical conditions could also help further validate the results.
 
Funding
There is no funding support.
 
Authors’ Contribution
In this study, the authors' contributions are as follows: Somayeh Soltani-Gardfaramarzi was responsible for the study design, data collection, analysis, writing the initial draft, and final editing of the article, and Mojgan Askarizadeh was responsible for modeling and results.
 
Conflict of Interest
Authors declared no conflict of interest.
 
Acknowledgments
We are grateful to all the scientific consultants of this paper.

کلیدواژه‌ها [English]

  • Geometric Specifications
  • Machine Learning
  • Solar Zenith Angle
  • Radiation
  • Yazd
  1. سلطانی گردفرامرزی، سمیه؛ تقی زاده، روح‌الله؛ قاسمی، محسن. (1394). برآورد ضریب پخشیدگی طولی رودخانه با استفاده از انواع روش‌های داده‌کاوی. تحقیقات آب‌وخاک ایران، 46(3)، 385-394. doi: 10.22059/ijswr.2015.56728
  2. سلطانی گردفرامرزی، سمیه. (1402). پیش‌بینی تابش خورشیدی در ایستگاه یزد با به‌کارگیری مدل رگرسیونی مبتنی بر مؤلفه‌های اصلی (PCR). هواشناسی کشاورزی، 11(1)، 6-16. doi: 10.22125/agmj.2023.352446.1140
  3. سلطانی گردفرامرزی، سمیه و مؤمنی، هاجر. (1402). کاربست الگوریتم‌های یادگیری ماشین برای تخمین تابش خورشیدی (موردمطالعه: اقلیم خشک و نیمه‌خشک). ژئوفیزیک ایران، 17(4)، 29-35. doi: 10.30499/ijg.2023.393259.1512
  4. شیخ‌الاسلامی، نونا؛ قهرمان، بیژن؛ مساعدی، ابوالفضل؛ داوری، کامران و مهاجرپور، مهدی. (1393). پیش‌بینی تبخیر و تعرق گیاه مرجع (ETO)  با استفاده از روش آنالیز مؤلفه‌های اصلی (PCA) و توسعه مدل رگرسیونی خطی چندگانه (MLR-PCA) (مطالعه موردی: ایستگاه مشهد). نشریه آب‌وخاک، 28(2)، 420-429. doi: 10.22067/jsw.v0i0.25711
  5. صفاری پور، محمدحسن و مهرابیان، مظفرعلی. (1388). پیش‌بینی مقدار کل تابش خورشیدی در کرمان با استفاده از مشخصات هندسی، نجومی، جغرافیایی و هواشناسی. شریف، 51، 3-13.
  6. محمدی، بابک و امامقلی زاده، صمد. (1395). استفاده از تحلیل مؤلفه اصلی برای تعیین ورودی‌های مؤثر بر تخمین بارش به کمک شبکه عصبی مصنوعی و ماشین بردار پشتیبان، سامانه‌های سطوح آبگیر باران، 4(13)، 67-75. doi:20.1001.1.24235970.1395.4.4.6.9
  7. عوض پور، صدیقه؛ بختیاری، بابک و قادری، کوروش. (1398). بررسی کارایی روش‌های شبکه عصبی و رگرسیون چند متغیره در برآورد تابش کل خورشیدی در چند ایستگاه معرف اقلیم‌های خشک و نیمه‌خشک. تحقیقات آب‌وخاک ایران، 50(139)، 1855-1869. doi:20.1001.1.24234931.1399.7.2.3.2
  8. محمدی، بابک؛ آقاشریعتمداری، زهرا و مؤذن‌زاده، روزبه. (1398). تعیین متغیرهای ورودی برای تخمین تابش خورشیدی با استفاده از تئوری آنتروپی و تحلیل مؤلفه اصلی. تحقیقات آب‌وخاک ایران، 50(3)، 626-639. doi: 10.22059/ijswr.2018.257150.667906
  9. Abdelhafidi, N., Bachari, N.E.I., & Abdelhafidi, Z. (2021). Estimation of solar radiation using stepwise multiple linear regression with principal component analysis in Algeria. Meteorology and Atmospheric Physics, 133(2), 205-216. http://doi: 10.1007/s00703-020-00739-0.
  10. Avazpour, S., Bakhtiari, B., & Qaderi, K. (2019). Performance evaluation of Neural Network and Multivariate Regression Methods for Estimation of Total Solar Radiation at several stations in Arid and Semi-Arid Climates. Iranian Journal of Soil and Water Research, 50(8), 1855-1869. http://doi: 20.1001.1.24234931.1399.7.2.3.2. [In Persian]
  11. Boroughani, M., Soltani, S., Ghezelseflu, N., & Pazhouhan, I. (2022). A comparative assessment between artificial neural network, neuro-fuzzy, and support vector machine models in splash erosion modelling under simulation circumstances. Folia Oecologica, 49(1), 23-34. http://doi:10.2478/foecol-2022-0003.
  12. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. DOI: 10.1023/A:1010933404324.
  13. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27. http://doi: 10.1109/TIT.1967.1053964.
  14. Demir, V., & Citakoglu, H. (2023). Forecasting of solar radiation using different machine learning approaches. Neural Computing and Applications, 35(1), 887-906. http://doi: 10.1007/s00521-022-07831-z.
  15. Djeldjeli, Y., Taouaf, L., Alqahtani, S., Mokaddem, A., Alshammari, B.M., Menni, Y., & Kolsi, L. (2024). Enhancing solar power forecasting with machine learning using principal component analysis and diverse statistical indicators. Case Studies in Thermal Engineering, 61, 104924. http://doi: 10.1016/j.csite.2024.104924.
  16. Draper, N. R., & Smith, H. (1998). Applied regression analysis. Wiley-Interscience.
  17. Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Systems, 9, 155-161. https://www.researchgate.net/publication/309185766 Support vector regression machines.
  18. Duffie, J.A., & Beckman, W.A. (1991). Solar Engineering of Thermal Processes. Wiley, New York.
  19. Hunt, L.A., Kuchar, L., & Swanton, C.J. (1998). Estimation of solar radiation for use in crop modeling. Agric. Meteorol. 91, 293–300. http://doi: 10.1016/S0168-1923(98)00085-4.
  20. Liu, C.W., Lin K.H., & Kuo Y.M. (2003). Application of factor analysis in the assessment of groundwater quality in a blackfoot disease area in Taiwan. Science of the Total Environment, 313, 77-89. http://doi: 10.1016/S0048-9697(02)00683-6.
  21. Meenal, R., & Selvakumar, A.I. (2018). Assessment of SVM, empirical and ANN based solar radiation prediction models with most influencing input parameters. Renewable Energy, 121, 324-343. http://doi: 10.1016/j.renene.2017.12.005.
  22. Mohammadi, B., Aghashariatmadari, Z., & moazenzadeh, R. (2019). Determination of Input Variables to Estimate Solar Radiation Using Entropy Theory and Principal Component Analysis. Iranian Journal of Soil and Water Research, 50(3), 625-639. http://doi: 10.22059/ijswr.2018.257150.667906. [In Persian]
  23. Mohammadi, B., & Emamgholizadeh, S. (2017). Using principal component analysis to inputs the effective rainfall estimates based on entries to help support vector machine and artificial neural network. Journal of Rainwater Catchment Systems; 4 (4), 67-75. http://doi:20.1001.1.24235970.1395.4.4.6.9. [In Persian]
  24. Olalekan, S., Abdullahi, M. I., & Olabisi, A. (2018). Modeling of Solar Radiation Using Artificial Neural Network for Renewable Energy Application. Journal of Applied Physics, 10(2), 6-12. http://dx.doi.org/10.9790/4861-1002030612.
  25. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106. http://doi: 10.1007/BF00116251.
  26. Radosevic, N., Duckham, M., Liu, G.J., & Sun, Q. (2020). Solar radiation modeling with KNIME and Solar Analyst: Increasing environmental model reproducibility using scientific workflows. Environmental Modelling & Software, 132, 104780. http://doi: 10.1016/j.envsoft.2020.104780.
  27. Rahimikhoob, A. (2010). Estimating global solar radiation using artificial neural network and air temperature data in a semi-arid environment. Renew Energy, 35, 2131-2135. http://doi: 10.1016/j.renene.2010.01.013.
  28. Sabziparvar, A.A., & Shetaee, H. (2007). Estimation of global solar radiation in arid and semi-arid climates of East and West Iran, Energy, 32, 649–655. http://doi: 10.1016/j.energy.2006.06.006.
  29. Saffaripour, M., & Mehrabian, M. (2009). Predicting the total amount of solar radiation in Kerman using geometric, astronomical, geographical and meteorological characteristics. Sharif, 51 (1), 3-13 [In Persian]
  30. Saraswat, R., Jhanwar, D., & Gupta, M. (2024). Enhanced Solar Power Forecasting Using XG Boost and PCA-Based Sky Image Analysis. Traitement du Signal, 41(1). http://doi: 10.18280/ts.410104.
  31. Shannon, C.E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(379–423), 623–656. http://doi: 10.1002/j.1538-7305.1948.tb01338.x.
  32. Sheikholeslami, N., Ghahraman, B., Mosaedi, A., Davari, K., & Mohajeri, M. (2014). Estimating Reference Evapotranspiration by Using Principal Component Analysis (PCA) and The Development of a Regression Model (MLR-PCA) (Case Study: Mashhad Station). Water and Soil, 28(2), 420-429. http://doi: 10.22067/jsw.v0i0.25711. [In Persian]
  33. Soltani-Gerdefaramarzi, S., Taghizadeh-Mehrjerdi, R., & Ghasemi, M. (2015). Prediction of Longitudinal Dispersion Coefficient in Natural Streams using Soft Computing Techniques. Iranian Journal of Soil and Water Research, 46(3), 385-394. http://doi: 10.22059/ijswr.2015.56728. [In Persian]
  34. Soltani-Gerdefaramarzi, S., & Momeni, H. (2023). Application of machine learning algorithms to estimate solar radiation (case study: arid and semi-arid climate). Iranian Journal of Geophysics, 17(4), 25-39. http://doi:10.30499/ijg.2023.393259.1512. [In Persian]
  35. Soltani-Gerdefaramarzi, S. (2023). Prediction of solar radiation intensity in Yazd station by using regression model based on principal components (PCR). Journal of Agricultural Meteorology, 11(1), 6-16. http://doi:10.22125/agmj.2023.352446.1140. [In Persian]
  36. Xu, H. Xu, C. Y, Sælthun, N. R. Xu, Y. Zhou, B., & Chen, H. (2015). Entropy theory based multicriteria resampling of rain gauge networks for hydrological modelling – A case study of humid area in southern China. Journal of Hydrology, 525, 138-151. http://doi: 10.1016/j.jhydrol.2015.03.034.
  37. Yadav, A. K., & Chandel, S. S. (2015). Solar energy potential assessment of western Himalayan Indian state of Himachal Pradesh using J48 algorithm of WEKA in ANN based prediction model. Renewable Energy, 75, 675-693. http://doi:10.1016/j.renene.2014.10.045.