Programming Massively Parallel Processors

A Hands-on Approach

1st Edition - January 22, 2010
Authors: David B. Kirk, Wen-mei W. Hwu
Language: English
eBook ISBN:
9 7 8 - 0 - 1 2 - 3 8 1 4 7 3 - 9

Programming Massively Parallel Processors discusses the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explo… Read more

Purchase options

LIMITED OFFER

Save 50% on book bundles

Immediately download your ebook while waiting for your print delivery. No promo code is needed.

Institutional subscription on ScienceDirect

Request a sales quote

Programming Massively Parallel Processors discusses the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs.

This book describes computational thinking techniques that will enable students to think about problems in ways that are amenable to high-performance parallel computing. It utilizes CUDA (Compute Unified Device Architecture), NVIDIA's software development tool created specifically for massively parallel environments. Studies learn how to achieve both high-performance and high-reliability using the CUDA programming model as well as OpenCL.

This book is recommended for advanced students, software engineers, programmers, and hardware engineers.

PrefaceAcknowledgmentsDedicationChapter 1 Introduction 1.1 GPUs as Parallel Computers 1.2 Architecture of a Modern GPU 1.3 Why More Speed or Parallelism? 1.4 Parallel Programming Languages and Models 1.5 Overarching Goals 1.6 Organization of the BookChapter 2 History of GPU Computing 2.1 Evolution of Graphics Pipelines 2.1.1 The Era of Fixed-Function Graphics Pipelines 2.1.2 Evolution of Programmable Real-Time Graphics 2.1.3 Unified Graphics and Computing Processors 2.1.4 GPGPU: An Intermediate Step 2.2 GPU Computing 2.2.1 Scalable GPUs 2.2.2 Recent Developments 2.3 Future TrendsChapter 3 Introduction to CUDA 3.1 Data Parallelism 3.2 CUDA Program Structure 3.3 A Matrix–Matrix Multiplication Example 3.4 Device Memories and Data Transfer 3.5 Kernel Functions and Threading 3.6 Summary 3.6.1 Function declarations 3.6.2 Kernel launch 3.6.3 Predefined variables 3.6.4 Runtime APIChapter 4 CUDA Threads 4.1 CUDA Thread Organization 4.2 Using blockIdx and threadIdx 4.3 Synchronization and Transparent Scalability 4.4 Thread Assignment 4.5 Thread Scheduling and Latency Tolerance 4.6 Summary 4.7 ExercisesChapter 5 CUDA™ Memories 5.1 Importance of Memory Access Efficiency 5.2 CUDA Device Memory Types 5.3 A Strategy for Reducing Global Memory Traffic 5.4 Memory as a Limiting Factor to Parallelism 5.5 Summary 5.6 ExercisesChapter 6 Performance Considerations 6.1 More on Thread Execution 6.2 Global Memory Bandwidth 6.3 Dynamic Partitioning of SM Resources 6.4 Data Prefetching 6.5 Instruction Mix 6.6 Thread Granularity 6.7 Measured Performance and Summary 6.8 ExercisesChapter 7 Floating Point Considerations 7.1 Floating-Point Format 7.1.1 Normalized Representation of M 7.1.2 Excess Encoding of E 7.2 Representable Numbers 7.3 Special Bit Patterns and Precision 7.4 Arithmetic Accuracy and Rounding 7.5 Algorithm Considerations 7.6 Summary 7.7 ExercisesChapter 8 Application Case Study: Advanced MRI Reconstruction 8.1 Application Background 8.2 Iterative Reconstruction 8.3 Computing FHd Step 1. Determine the Kernel Parallelism Structure Step 2. Getting Around the Memory Bandwidth Limitation Step 3. Using Hardware Trigonometry Functions Step 4. Experimental Performance Tuning 8.4 Final Evaluation 8.5 ExercisesChapter 9 Application Case Study: Molecular Visualization and Analysis 9.1 Application Background 9.2 A Simple Kernel Implementation 9.3 Instruction Execution Efficiency 9.4 Memory Coalescing 9.5 Additional Performance Comparisons 9.6 Using Multiple GPUs 9.7 ExercisesChapter 10 Parallel Programming and Computational Thinking 10.1 Goals of Parallel Programming 10.2 Problem Decomposition 10.3 Algorithm Selection 10.4 Computational Thinking 10.5 ExercisesChapter 11 A Brief Introduction to OpenCL™ 11.1 Background 11.2 Data Parallelism Model 11.3 Device Architecture 11.4 Kernel Functions 11.5 Device Management and Kernel Launch 11.6 Electrostatic Potential Map in OpenCL 11.7 Summary 11.8 ExercisesChapter 12 Conclusion and Future Outlook 12.1 Goals Revisited 12.2 Memory Architecture Evolution 12.2.1 Large Virtual and Physical Address Spaces 12.2.2 Unified Device Memory Space 12.2.3 Configurable Caching and Scratch Pad 12.2.4 Enhanced Atomic Operations 12.2.5 Enhanced Global Memory Access 12.3 Kernel Execution Control Evolution 12.3.1 Function Calls within Kernel Functions 12.3.2 Exception Handling in Kernel Functions 12.3.3 Simultaneous Execution of Multiple Kernels 12.3.4 Interruptible Kernels 12.4 Core Performance 12.4.1 Double-Precision Speed 12.4.2 Better Control Flow Efficiency 12.5 Programming Environment 12.6 A Bright OutlookAppendix A Matrix Multiplication Host-Only Version Source Code A.1 matrixmul.cu A.2 matrixmul_gold.cpp A.3 matrixmul.h A.4 assist.h A.5 Expected OutputAppendix B GPU Compute Capabilities B.1 GPU Compute Capability Tables B.2 Memory Coalescing VariationsIndex

David B. Kirk

David B. Kirk is well recognized for his contributions to graphics hardware and algorithm research. By the time he began his studies at Caltech, he had already earned B.S. and M.S. degrees in mechanical engineering from MIT and worked as an engineer for Raster Technologies and Hewlett-Packard's Apollo Systems Division, and after receiving his doctorate, he joined Crystal Dynamics, a video-game manufacturing company, as chief scientist and head of technology. In 1997, he took the position of Chief Scientist at NVIDIA, a leader in visual computing technologies, and he is currently an NVIDIA Fellow.

At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. For his role in bringing high-performance graphics to personal computers, Kirk received the 2002 Computer Graphics Achievement Award from the Association for Computing Machinery and the Special Interest Group on Graphics and Interactive Technology (ACM SIGGRAPH) and, in 2006, was elected to the National Academy of Engineering, one of the highest professional distinctions for engineers.

Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III. A technological "evangelist" who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide.

Affiliations and expertise

NVIDIA Fellow

Wen-mei W. Hwu

Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of Parallel Computing Institute and director of the IMPACT research group (www.impact.crhc.illinois.edu). He is a co-founder and CTO of MulticoreWare. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the ISCA Influential Paper Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the NSF Blue Waters Petascale computer project. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.

Affiliations and expertise

CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign, USA