Stanford CS149 I 2023 I Lecture 3 - Multi-core Arch Part II + ISPC Programming Abstractions

14 Sep 2024 (4 days ago)
Stanford CS149 I 2023 I Lecture 3 - Multi-core Arch Part II + ISPC Programming Abstractions

Hardware Multi-threading

Processor Utilization and Thread Count

Cache and Thread Count

  • A large data cache can reduce the number of cache misses, effectively decreasing memory access time and reducing the need for multiple threads. Conversely, removing the data cache leads to more misses and necessitates more threads for optimal performance. rel="noopener noreferrer" target="_blank">(00:14:50)

Parallelism and Execution Units

  • A processor with 16 cores, each with four-way threading and eight-wide vector processing, might appear to have a peak throughput of 128 execution units. However, to fully hide latency and achieve maximum performance, it requires 512 independent tasks to keep all execution units occupied. rel="noopener noreferrer" target="_blank">(00:18:10)

Myth Machine

Instruction Execution and Multi-threading

Modern CPUs and GPUs

Superscalar Processing

Modern Intel Chip Architecture

GPU Vector Processing

Thread Scheduling

Parallelism and Vectorization

GPU Performance and Data Requirements

Latency and Throughput

Throughput and Bottlenecks

Memory Bandwidth Bottleneck

ISPC Programming Language

Overwhelmed by Endless Content?