بهبود تحمل‌پذیری تأخیر پوشه ثبات در پردازنده‌های گرافیکی به کمک بازتولید مقادیر میانی

محورهای موضوعی : electrical and computer engineering

راحیل براتی ^{1
*} , محمد صدرالساداتی ² , حمید سربازی آزاد ³

1 - دانشگاه صنعتی شریف،دانشكده مهندسي كامپيوتر
2 - پژوهشگاه دانش‌های بنیادی،گروه برق
3 - دانشگاه صنعتی شریف،دانشكده مهندسي كامپيوتر

تاریخ دریافت : 1399/12/05 تاریخ پذیرش : 1400/12/08 تاریخ انتشار : 1401/03/30

کلید واژه: پردازنده‌هاي گرافيكي, پوشه ثبات, بازتوليد مقادير, واحدهاي اجرايي,

چکیده مقاله :

پوشه‌ ثبات‌ بزرگ در پردازنده‌های گرافیکی با بهبود موازات سطح نخ، باعث کاهش دسترسی به حافظه‌ می‌شود. قبلاً برای افزایش ظرفیت پوشه‌ ثبات با سربار توان و مساحت قابل قبول، روش LTRF ارائه شده است. معماری پوشه‌ ثبات LTRF دوسطحی است که از یک حافظه نهان ثبات و یک پوشه‌ ثبات اصلی استفاده می‌کند. ثبات‌های کلاف‌ها قبل از اجرای یک کلاف به حافظه نهان ثبات پیش‌واکشی می‌شوند. برای پیش‌واکشی ثبات‌ها، گراف کنترل جریان برنامه در سطح مترجم به زیرگراف‌هایی به نام بازه‌ثبات تقسیم می‌شود. یکی از سربار‌های روش LTRF انجام عمل پیش‌واکشی ثبات و تحمیل بیکاری کلاف در طول مدت پیش‌واکشی است که کاهش تعداد بازه‌ثبات به میزان چشم‌گیری این سربار را کاهش می‌دهد. اما تعداد ثبات‌ قابل استفاده در هر بازه‌ثبات محدود است و افزایش این تعداد در بازه‌ثبات منجر به افزایش ترافیک پیش‌واکشی و ظرفیت حافظه نهان می‌گردد که راه حل مناسبی برای کاهش تعداد بازه‌ثبات‌ها نیست. در این پژوهش به کمک بازتولید مقادیر میانی در زمان ترجمه سعی در کاهش تعداد ثبات‌های مورد نیاز در هر بازه‌ثبات داریم. نتایج شبیه‌سازی نشان می‌دهند که روش پیشنهادی ما، میزان تحمل‌پذیری تأخیر دسترسی به پوشه ثبات در روش LTRF را به میزان 29 درصد بهبود می‌بخشد. همچنین با به کار‌گیری یک پوشه ثبات سلول‌های حافظه DWM، معماری پیشنهادی قادر است که کارایی پردازنده گرافیکی مجهز به LTRF را به طور میانگین 18 درصد (حدود 30 درصد نسبت به معماری پردازنده گرافیکی پایه) افزایش دهد و این در حالی است که مقادیر انرژی و توان مصرفی به میزان 38 و 15 درصد کاهش می‌یابد.

چکیده انگلیسی:

Large register files reduce the performance and energy overhead of memory accesses by improving the thread-level parallelism and reducing the number of data movements from the off-chip memory. Recently, the latency-tolerant register file (LTRF) is proposed to enable high-capacity register files with low power and area cost. LTRF is a two-level register file in which the first level is a small fast register cache, and the second level is a large slow main register file. LTRF uses a near-perfect register prefetching mechanism that warp registers are prefetched from the main register file to the register file cache before scheduling the warp and hiding the register prefetching latency by the execution of other active warps. LTRF specifies the working set of the warps by partitioning the control flow graph into several prefetch subgraphs, called register-interval. LTRF imposes some performance overhead due to warp stall during the register prefetching. Reducing the number of register-intervals can greatly mitigate this overhead, and improve the effectiveness of LTRF. A register-interval is a subgraph of the control flow graph (CFG) where it has to be a single-entry subgraph with a limited number of registers. We observe that the second constrain contributes more in reducing the size of register-intervals. Increasing the number of registers inside the register-interval cannot address this problem as it imposes huge performance and power overhead during the register prefetching process. In this paper, we propose a register-interval-aware re-production mechanism at compile-time to increase register-interval size without increasing the number of registers inside it. Our experimental results show that our proposal improves the effectiveness of LTRF by 29%, and LTRF’s performance by about 18% (about 30% improvement over baseline GPU architecture). Moreover, our proposal reduces GPU energy and power consumption by respectively 38% and 15%, on average.

منابع و مأخذ:

[1] A. Sethia and S. Mahlke, "Equalizer: dynamic tuning of gpu resources for efficient execution," in Proc. of the IEEE/ACM 47th Annual Int. Symp. on Microarchitecture, pp. 647-658, Cambridge, UK, 13-17 Dec. 2014.
[2] T. D. Han and T. S. Abdelrahman, "hiCUDA: high-level GPGPU programming," IEEE Trans. on Parallel and Distributed Systems, vol. 22, no. 1, pp. 78-90, Jan. 2011.
[3] NVIDIA Corporation. CUDA Programming Guide, V4.0.
[4] NVIDIA Corporation. CUDA Toolkit, 2012. Version 4.2, http://developer.nvidia.com/cuda/cuda-downloads. Sep. 2012.
[5] J. Lee, N. Lakshminarayana, H. Kim, and R. Vuduc, "Many-thread aware prefetching mechanisms for GPGPU applications," in Proc. IEEE/ACM of the 43th Annual Int. Symp. on Microarchitecture, pp. 213-224, Atlanta, GA, USA, 4-8 Dec. 2010.
[6] A. Jog, O. Kayiran, A. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated scheduling and prefetching for GPGPUs," ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 332-343, Jun. 2013.
[7] A. Jog, O. Kayiran, N. Chidambaram, A. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance," ACM SIGARCH Computer Architecture News, vol. 41, no. 1, pp. 395-406, Mar. 2013.
[8] A. Sethia, G. Dasika, M. Samadi, and S. Mahlke, "APOGEE: adaptive prefetching on GPUs for energy efficiency," in Proc. of the IEEE 22nd Int. Conf. on Parallel Architectures and Compilation Techniques, pp. 73-82, Edinburgh, UK, 7-11 Sept. 2013.
[9] H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram, "GPU register file virtualization," in Proc. IEEE/ACM of the 48th Annual Int. Symp. on Microarchitecture, pp. 420-432, Waikiki, HI, USA, 5-9 Dec. 2015.
[10] M. Abdel-Majeed and M. Annavaram, "Warped register file: a power efficient register file for GPGPUs," in Proc. IEEE 19th Int. Symp. on High Performance Computer Architecture, pp. 412-423, Shenzhen, China, 23-27, Feb. 2013.
[11] S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram, "Warped-compression: enabling power efficient GPUs through register compression," ACM SIGARCH Computer Architecture News, vol. 43, no. 3, pp. 502-514, Jun. 2015.
[12] C. Hsiao, S. Chu, and C. Hsieh, "An adaptive thread scheduling mechanism with low-power register file for mobile GPUs," IEEE Trans. on Multimedia, vol. 16, no. 1, pp. 60-67, Sept. 2014.
[13] N. Jing, J. Wang, F. Fan, W. Yu, L. Jiang, C. Li, and X. Liang, "Cache-emulated registerfile: an integrated on-chip memory architecture for high performance gpgpus," in Proc. IEEE/ACM of the 49th Annual Int. Symp. on Microarchitecture, 12 pp., Taipei, Taiwan, 10-15?, Oct. 2016.
[14] H. Asghari Esfeden, A. A. Abdolrashidi, S. Rahman, D. Wong, and N. Abu-Ghazaleh, "BOW: breathing operand windows to exploit bypassing in GPUs," in Proc. IEEE/ACM of the 53th Annual Int. Symp. on Microarchitecture, pp. 996-1008, Athens, Greece, 17-21 Oct. 2020.
[15] F. Khorasani, H. A. Esfeden, A. Farmahini-Farahani, N. Jayasena, and V. Sarkar, "Regmutex: inter-warp gpu register time-sharing," in Proc. of the 45th Annual Int. Symp. on Computer Architecture, pp. 816-828, Providence, RI, USA, 13-17 Apr. 2018.
[16] H. Asghari Esfeden, F. Khorasani, H. Jeon, D. Wong, and N. Abu-Ghazaleh, "CORF: coalescing operand register file for GPUs," in Proc. of the 24th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 701-714, Boston, MA, USA, 14-17 Oct. 2019.
[17] J. Kloosterman, et al., "Regless: just-in-time operand staging for gpus," in Proc. IEEE/ACM of the 50th Annual Int. Symp. on Microarchitecture, pp. 151-164, Boston, MA, USA, 14-17, Oct. 2017.
[18] M. Sadrosadati, A. Mirhosseini, S. B. Ehsani, H. Sarbazi-Azad, M. Drumond, B. Falsafi, R. Ausavarungnirun, and O. Mutlu, "Ltrf: enabling high-capacity register files for gpus via hardware/software cooperative register prefetching," in Proc. of the 23rd Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 489-502, Williamsburg, VA, USA, 24-18 Mar. 2018.
[19] J. E. Lindholm, M. Y. Siu, S. S. Moy, S. Liu, and J. R. Nickolls, Simulating Multiported Memories Using Lower Port Count Memories, US Patent 7,339,592, 2008.
[20] LTRF Register-Interval-Algorithm, https://github.com/CMU-SAFARI/Register-Interval, 2018.
[21] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing cuda workloads using a detailed gpu simulator," ISPASS. in Proc. IEEE Int. Symp. on Performance Analysis of Systems and Software,, pp. 163-174, Boston, MA, USA, 26-28 Apr. 2009.
[22] S. Che, et al., "Rodinia: a benchmark suite for heterogeneous computing," in Proc. IEEE Int. Symp. on Workload Characterization, pp. 44-54, Austin, TX, USA, 4-6 Oct. 2009.
[23] J. A. Stratton, et al., "Parboil: a revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing, vol. 127, p. 27, Mar. 2012.
[24] R. Venkatesan, S. G. Ramasubramanian, S. Venkataramani, K. Roy, and A. Raghunathan, "Stag: spintronic-tape architecture for GPGPU cache hierarchies," in Proc. IEEE Int. Symp. on Computer Architecture, pp. 253-264, Minneapolis, MN, USA, 14-18 Jun. 2014.
[25] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving gpu performance via large warps and two-level warp scheduling," in Proc. IEEE/ACM of the 44th Annual Int. Symp. on Microarchitecture, pp. 308-317, Porto Alegre, Brazil, 3-7, Dec. 2011.

مقالات مرتبط

انتقال داده بهینه در شبکه های اینترنت اشیا مبتنی بر حسگر بی سیم با تلفیق برنامه ریزی خطی و درخت انتشار کمینه
تاریخ چاپ : 1404/10/16
بهینه سازی و پیش بینی برنامه های موردعلاقه کاربران با استفاده از رویکرد فیلترینگ مشارکتی و الگوریتم فاخته
تاریخ چاپ : 1404/10/16
مدل سازی اندازه کاشی بهینه برای افزایش استفاده مجدد از داده ها در شبکه های عصبی کانولوشنی
تاریخ چاپ : 1404/10/16
تشخیص سرطان سینه با رویکرد متوازن‌سازی مجموعه داده‌ها
تاریخ چاپ : 1404/10/16
کاهش درصد خطای پیش‌بینی سری‌های‌ زمانی قیمت رمزارزها با استفاده از دوسویه‌سازی شبکه‌های عصبی یادگیری عمیق
تاریخ چاپ : 1404/10/16
استخراج ویژگی‌های عمیق بلندمدت برای طبقه‌بندی ویدیو
تاریخ چاپ : 1404/10/16

اشتراک گذاری

آدرس مقاله

بهبود تحمل‌پذیری تأخیر پوشه ثبات در پردازنده‌های گرافیکی به کمک بازتولید مقادیر میانی