استخراج ویژگی‌های عمیق بلندمدت برای طبقه‌بندی ویدیو

محورهای موضوعی : electrical and computer engineering

عباس همدونی اصلی ¹ , شیما جاویدانی ² , علی جاویدانی ^{3
*}

1 - موسسه آموزش عالی جهاد دانشگاهی همدان، همدان، ایران
2 - موسسه آموزش عالی جهاد دانشگاهی همدان، همدان، ایران
3 - گروه مهندسی کامپیوتر، دانشکده مهندسی، دانشگاه بوعلی سینا، همدان، ایران

تاریخ دریافت : 1404/04/16 تاریخ پذیرش : 1404/07/02 تاریخ انتشار : 1404/10/16

کلید واژه: طبقه‌بندی ویدیو, شناسایی کنش انسانی, یادگیری عمیق, شبکه‌های عصبی کانولوشنی, شبکه‌های عصبی بازگشتی, حافظه‌ی بلند و کوتاه‌مدت (LSTM),

چکیده مقاله :

در اين مقاله، رويکردي نوين براي شناسايي کنش هاي در حال انجام از ويديوهاي بخش¬بندي¬شده ارائه مي¬شود. تمرکز اصلي بر استخراج ويژگي¬هاي بلندمدت از ويديوها به منظور طبقه¬بندي موثر آنها است. بدين منظور، ابتدا تصاوير جريان نوري ميان فريم¬هاي متوالي محاسبه و با يک شبکه عصبي کانولوشني از پيش آموزش¬ديده توصيف مي¬شوند. براي کاهش پيچيدگي فضاي ويژگي و ساده¬سازي يادگيري مدل زماني، کاهش بعد PCA بر روي بردارهاي توصيفي جريان نوري اعمال مي¬گردد. سپس به منظور پالايش ورودي، يک ماژول توجه کانالي سبک وزن بر بردارهاي کم بعد حاصل از PCA در هر گام زماني اعمال مي¬شود تا مولفه¬هاي اطلاعاتي تقويت و مولفه¬هاي کم اثر تضعيف شوند. در ادامه، توصيفگرهاي هر ويديو هم¬راستا شده و در راستاي زمان دنبال مي¬شوند و استخراج ويژگي¬هاي بلندمدت با آموزش يک شبکه LSTM دو لايه پشته‌اي انجام مي‌پذيرد. پس از LSTM، يک ماژول توجه زماني به عنوان تجميع آگاه به زمان به کار گرفته مي‌شود تا با وزن دهي داده محور به گام‌هاي زماني، لحظات اطلاع¬رسان را برجسته کرده و يک بردار منسجم براي طبقه‌بندي بسازد. نتايج تجربي نشان مي¬دهد که ترکيب PCA به همراه توجه کانالي و توجه زماني ضمن حفظ سبک وزني مدل، دقت طبقه¬بندي را در هر دو مجموعه داده عمومي 11UCF و jHMDB بهبود مي¬بخشد و عملکرد بهتري نسبت به روش¬هاي مرجع ارائه مي‌کند. کد مورد استفاده در این مقاله، به صورت دسترسی باز قابل در دسترس¬است.

چکیده انگلیسی:

This paper presents a novel approach for recognizing ongoing actions from segmented videos, with the main focus on extracting long-term features for effective classification. First, optical-flow images between consecutive frames are computed and described by a pretrained convolutional neural network. To reduce feature-space complexity and simplify training of the temporal model, PCA is applied to the optical-flow descriptors. Next, a lightweight channel-attention module is applied to the low-dimensional PCA features at each time step to enhance informative components and suppress weak ones. The descriptors of each video are then aligned and followed over time, forming a multi-channel 1D time series from which long-term patterns are learned using a two-layer stacked LSTM. After the LSTM, a temporal-attention module performs time-aware aggregation by weighting informative time steps to produce a coherent context vector for classification. Experiments show that combining PCA with channel and temporal attention improves accuracy on the public UCF11 and jHMDB datasets while keeping the model lightweight and outperforming reference methods. The code is available at: https://github.com/alijavidani/Video_Classification_LSTM

منابع و مأخذ:

[1] H. Qiu and B. Hou, "Multi-grained clip focus for skeleton-based action recognition," Pattern Recognition, vol. 148, Article ID: 110188, Apr. 2024.
[2] M. Karim, et al., "Human action recognition systems: A review of the trends and state-of-the-art," IEEE Access, vol. 12, pp. 36372-36390, 2024.
[3] J. Xie, et al., "Dynamic semantic-based spatial graph convolution network for skeleton-based human action recognition," in Proc. of the AAAI Conf. on Artificial Intelligence, vol. 38, no. 6, pp. 6225-6233. Vancouver, Canada, 20-27 Feb. 2024.
[4] Y. Ma and R. Wang, "Relative-position embedding based spatially and temporally decoupled Transformer for action recognition," Pattern Recognition, vol. 145, Article ID: 109905, Jan. 2024.
[5] S. Guo, H. Pan, G. Tan, L. Chen, and C. Gao, "A High Invariance Motion Representation for Skeleton-Based Action Recognition," International Journal of Pattern Recognition and Artificial Intelligence, vol. 30, no. 08, Article ID: 1650018, 2016.
[6] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proc. of the 28th Int. Conf. on Neural Information Processing Systems, vol. 1, pp. 568-576, Montreal, Canada, 8-13 Dec. 2014.
[7] T. P. Nguyen, A. Manzanera, M. Garrigues, and N. -S. Vu, "Spatial motion patterns: action models from semi-dense trajectories," International Journal of Pattern Recognition and Artificial Intelligence, vol. 28, no. 07, Article ID: 1460011, 2014.
[8] A. Karpathy, et al., "Large-scale video classification with convolutional neural networks," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1725-1732, Columbus, OH, USA, 23-28 Jun. 2014.
[9] J. Yue-Hei Ng, et al., "Beyond short snippets: Deep networks for video classification," in Proc. of the IEEE Con. on Computer Vision and Pattern Recognition, pp. 4694-4702., Boston, MA, USA, 7-12 Jun. 2015.
[10] C. Feichtenhofer, A. Pinz, and A. Zisserman, "Convolutional two-stream network fusion for video action recognition," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1933-1941, Las Vegas, NV, USA, 27-30 Jun. 2016.
[11] P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, "Temporal pyramid pooling-based convolutional neural network for action recognition," IEEE Trans. on Circuits and Systems for Video Technology, vol. 27, no. 12, pp. 2613-2622, Dec. 2017.
[12] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "C3D: generic features for video analysis," CoRR, abs/1412.0767, vol. 2, p. 7, 2014.
[13] S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE Trans. on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, Jan. 2013.
[14] S. Yu, Y. Cheng, L. Xie, and S. Z. Li, "Fully convolutional networks for action recognition," IET Computer Vision, vol. 11, no. 8, pp. 744-749, Dec. 2017.
[15] J. Liu, J. Luo, and M. Shah, "Recognizing realistic actions from videos “in the wild”," in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1996-2003, Miami, FL, USA, 20-25 Jun. 2009. [16] X. Liu, Y. Li, and Q. Wang, "Multi-view hierarchical bidirectional recurrent neural network for depth video sequence based action recognition," International Journal of Pattern Recognition and Artificial Intelligence, vol. 32, no. 10, Article ID: 1850033, 2018.
[17] S. Yan, J. S. Smith, W. Lu, and B. Zhang, "Hierarchical Multi-scale Attention Networks for action recognition," Signal Processing: Image Communication, vol. 61, pp. 73-84, Feb. 2018.
[18] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, "VideoLSTM convolves, attends and flows for action recognition," Computer Vision and Image Understanding, vol. 166, pp. 41-50, Jan. 2018.
[19] X. Wang, L. Gao, P. Wang, X. Sun, and X. Liu, "Two-stream 3d convnet fusion for action recognition in videos with arbitrary size and length," IEEE Trans. on Multimedia, vol. 20, no. 3, pp. 634-644, Mar. 2018.
[20] Y. Shi, Y. Tian, Y. Wang, and T. Huang, "Sequential deep trajectory descriptor for action recognition with three-stream CNN," IEEE Trans. on Multimedia, vol. 19, no. 7, pp. 1510-1520, Jul. 2017.
[21] A. Iqbal, A. Richard, H. Kuehne, and J. Gall, "Recurrent residual learning for action recognition," in ¬Proc. German Conf. on Pattern Recognition, pp. 126-137, Basel, Switzerland, 12-15 Sept. 2017.
[22] A. Piergiovanni, C. Fan, and M. S. Ryoo, "Learning latent sub-events in activity videos using temporal attention filters," in Proc. of the 31st AAAI Conf. on Artificial Intelligence, pp. 4247-4254, San Francisco, CA, USA, 4-7 Feb. 2017.
[23] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, "Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition," in Proc. IEEE Winter Conf. on Applications of Computer Vision, pp. 177-186, Santa Rosa, CA, USA, 24-31 Mar. 2017.
[24] A. Ullah, K. Muhammad, J. Del Ser, S. W. Baik, and V. H. C. de Albuquerque, "Activity recognition using temporal optical flow convolutional features and multilayer LSTM," IEEE Trans. on Industrial Electronics, vol. 66, no. 12, pp. 9692-9702, Dec. 2019.
[25] P. Zhen, X. Yan, W. Wang, H. Wei, and H.-B. Chen, "A highly compressed accelerator with temporal optical flow feature fusion and tensorized LSTM for video action recognition on terminal device," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 10, pp. 3129-3142, Oct. 2023.
[26] M. Hasan and A. K. Roy-Chowdhury, "Incremental activity modeling and recognition in streaming videos," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 796-803, olumbus, OH, USA, 23-28 Jun. 2014.
[27] N. Ikizler-Cinbis and S. Sclaroff, "Object, scene and actions: Combining multiple features for human action recognition," in Proc. of the Computer Vision, pp. 494-507, Crete, Greece, 5-11 Sept. 2010.
[28] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, "Action recognition by dense trajectories," in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 3169-3176, olorado Springs, CO, USA, 20-25 Jun. 2011.
[29] S. Sharma, R. Kiros, and R. Salakhutdinov, Action Recognition Using Visual Attention, arXiv Preprint arXiv:1511.04119, 2015.
[30] J. Cho, M. Lee, H. J. Chang, and S. Oh, "Robust action recognition using local motion and group sparsity," Pattern Recognition, vol. 47, no. 5, pp. 1813-1825, May 2014.
[31] M. Ravanbakhsh, H. Mousavi, M. Rastegari, V. Murino, and L. S. Davis, Action Recognition with Image Based CNN Features, arXiv Preprint arXiv:1512.03980, 2015.
[32] A. Javidani and A. Mahmoudi-Aznaveh, "Learning representative temporal features for action recognition," Multimedia Tools and Applications, vol. 81, no. 3, pp. 3145-3163, Jan. 2022.
[33] J. Lu, R. Xu, and J. J. Corso, "Human action segmentation with hierarchical supervoxel consistency," in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 3762-3771, Boston, MA, USA, 7-12 Jun. 2015.
[34] G. Gkioxari and J. Malik, "Finding action tubes," in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 759-768, Boston, MA, USA, 7-012 Jun. 2015.
[35] X. Peng, L. Wang, X. Wang, and Y. Qiao, "Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice," Computer Vision and Image Understanding, vol. 150, pp. 109-125, Sept. 2016.
[36] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, "Towards understanding action recognition," in Proc. IEEE Int. Conf. on Computer Vision, pp. 3192-3199, Sydney, Australia, 1-8 Dec. 2013. [37] X. Peng, C. Zou, Y. Qiao, and Q. Peng, "Action recognition with stacked fisher vectors," in Proc. European Conf. on Computer Vision, pp. 581-595, Zurich, Switzerland, 6-12 Sept. 2014.

مقالات مرتبط

کاهش درصد خطای پیش‌بینی سری‌های‌ زمانی قیمت رمزارزها با استفاده از دوسویه‌سازی شبکه‌های عصبی یادگیری عمیق
تاریخ چاپ : 1404/10/16
تشخیص سرطان سینه با رویکرد متوازن‌سازی مجموعه داده‌ها
تاریخ چاپ : 1404/10/16
بهینه سازی و پیش بینی برنامه های موردعلاقه کاربران با استفاده از رویکرد فیلترینگ مشارکتی و الگوریتم فاخته
تاریخ چاپ : 1404/10/16
مدل سازی اندازه کاشی بهینه برای افزایش استفاده مجدد از داده ها در شبکه های عصبی کانولوشنی
تاریخ چاپ : 1404/10/16
انتقال داده بهینه در شبکه های اینترنت اشیا مبتنی بر حسگر بی سیم با تلفیق برنامه ریزی خطی و درخت انتشار کمینه
تاریخ چاپ : 1404/10/16
اینورتر منبع امپدانسی فعال جدید با تنش ولتاژ کاهش یافته در دو سرکلیدها
تاریخ چاپ : 1404/09/22

اشتراک گذاری

آدرس مقاله

استخراج ویژگی‌های عمیق بلندمدت برای طبقه‌بندی ویدیو