Detection of Spam Pages Using XGBoost Algorithm
Reyhane Rashidpour
1
(
Dept. of Comp. Eng., Yazd University, Yazd, Iran
)
Ali-Mohammad Zareh-Bidoki
2
(
Dept. of Comp. Eng., Yazd University, Yazd, Iran
)
Keywords: Web spam, XGBoost classification algorithm, data balancing, machine learning.,
Abstract :
Today, search engines are the gateway to the web. With the increasing popularity of the web, the efforts to exploit it for commercial, social, and political purposes have also increased, making it difficult for search engines to distinguish good content from spam. The concept of web spam was first introduced in 1996 and quickly became recognized as one of the key challenges for the search engine industry. The phenomenon of spam occurs primarily because a significant portion of web page visits comes from search engines, and users tend to check the first search results. The goal of identifying spam pages is to ensure that these pages cannot achieve high rankings using deceptive strategies. Our effort is to provide an effective method for identifying spam pages, thereby reducing the presence of spam in the top search results. In this article, two methods for combating web spam are proposed. The first method, called XGspam, identifies spam pages based on the XGBoost learning algorithm with an accuracy of 94.27%. The second method, named XGSspam, offers a solution to the challenge of imbalanced web data by combining the SMOTE oversampling algorithm with the XGBoost classification model, achieving an accuracy of 95.44% in identifying spam pages.
[1] E. Convey, "Porn sneaks way back on web," The Boston Herald, vol. 28, 1996.
[2] M. De Kunder, "he Size of the World Wide Web (The Internet), https://www.worldwidewebsize.com, Retrived 2024. [3] A. Shahzad, N. M. Nawi, M. Z. Rehman, and A. Khan, "An improved framework for content‐and link‐based web‐spam detection: a combined approach," Complexity, vol. 2021, Article ID: 6625739, 18 pp., 2021.
[4] C. Castillo, Web Spam Collections, https://chato.cl/webspam/datasets/uk2007, Retrived 2024.
[5] T. Chen and C. Guestrin, "Xgboost: a scalable tree boosting system," in Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 785-794, San Francisco, CA, USA, 13-17 Aug. 2016.
[6] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Artificial Intelligence Research, vol. 16, no. 1, pp. 321-357, Jan. 2002.
[7] J. Liu, Y. Su, S. Lv, and C. Huang, "Detecting web spam based on novel features from web page source code," Security and Communication Networks, vol. 2020, Article ID: 6662166, 14 pp., 2020.
[8] F. Asdaghi and A. Soleimani, "An effective feature selection method for web spam detection," Knowledge-Based Systems, vol. 166, pp. 198-206, Feb. 2019.
[9] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, "Detecting spam web pages through content analysis," in Proc. World Wide Web, pp. 83-92, Edinburgh, Scotland, 23-26 May 2006.
[10] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, "Using rank propagation and probabilistic counting for link-based spam detection," in Proc. the WebKDD, 10 pp., 2006.
[11] R. Baeza-Yates, P. Boldi, and C. Castillo, "Generalizing PageRank: damping functions for link-based ranking algorithms," in Proc. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 308-315, Seattle, WA, USA, 6-11 Aug. 2006.
[12] M. Yu, J. Zhang, J. Wang, J. Gao, T. Xu, and R. Yu, "The research of spam web page detection method based on web page differentiation and concrete clusters centers," in Proc. Int. Conf. on Wireless Algorithms, Systems, and Applications, pp. 820-826, Tianjin, China, 20-22 Jun. 2018.
[13] J. J. Whang, Y. S. Jeong, I. Dhillon, S. Kang, and J. Lee, "Fast asynchronous antitrust rank for web spam detection," in Proc. WSDM Workshop on Misinformation and Misbehavior Mining on the Web, 4 pp., Marina Del Rey, CA, USA, 5-9 Feb. 2018.
[14] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen, "Combating web spam with trustrank," in Proc.Very Large Data Bases, vol. 30, pp. 576-587, Toronto, Canada, 31 Aug.-3 Sept. 2004.
[15] M. Sobek, Pr0-Google’s Pagerank 0 Penalty, http://pr.efactory.de/e-pr0.shtml, Retrived 2024.
[16] D. Liu and J. Lee, "CNN based malicious website detection by invalidating multiple web spams," IEEE Access, vol. 8, pp. 97258-97266, 2020.
[17] X. Zhuang, Y. Zhu, Q. Peng, and F. Khurshid, "Using deep belief network to demote web spam," Future Generation Computer Systems, vol. 118, pp. 94-106, May 2021.
[18] C. Wei, Y. Liu, M. Zhang, S. Ma, L. Ru, and K. Zhang, "Fighting against web spam: a novel propagation method based on click-through data," in Proc. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 395-404, Portland, ON, USA, 12-16 Aug. 2012.
[19] A. Heydari, M. A. Tavakoli, N. Salim, and Z. Heydari, "Detection of review spam: a survey," Expert Systems with Applications, vol. 42, no. 7, pp. 3634-3642, May 2015.
[20] S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," Computer Networks and ISDN Systems, vol. 30, no. 1-7, pp. 107-117, Apr. 1998.
[21] D. Sculley, Kaggle: Your Machine Learning and Data Science Community, https://www.kaggle.com, Retrived 2024.
[22] X. Ren, Knowledge Dscovery in Data and Data-Mining, https://kdd.org/, Retrieved 2024.
[23] T. Wongvorachan, S. He, and O. Bulut, "A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining," Information, vol. 14, no. 1, Article ID: 54, 2023.
[24] Y. Zhang, L. Deng, and B. Wei, "Imbalanced data classification based on improved random-SMOTE and feature standard deviation," Mathematics, vol. 12, no. 11, Article ID: 1709, 2024.