Effective Bandwidth in Cuda [closed]
In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.
__global__ void kernel(float *a, float * b, float * c, int num){
int i = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ a_shared[NUM];
__shared__ b_shared[NUM];
if (i < NUM){
a_shared[i] = a[i];
b_shared[i] = b[i]
c[i] = a_shared[i] + b_shared[i];
}
}
c++ cuda gpu gpu-shared-memory
closed as unclear what you're asking by Gábor Bakos, talonmies, Ander Biguri, Rob, ead Nov 26 '18 at 8:17
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.
__global__ void kernel(float *a, float * b, float * c, int num){
int i = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ a_shared[NUM];
__shared__ b_shared[NUM];
if (i < NUM){
a_shared[i] = a[i];
b_shared[i] = b[i]
c[i] = a_shared[i] + b_shared[i];
}
}
c++ cuda gpu gpu-shared-memory
closed as unclear what you're asking by Gábor Bakos, talonmies, Ander Biguri, Rob, ead Nov 26 '18 at 8:17
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55
A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59
add a comment |
In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.
__global__ void kernel(float *a, float * b, float * c, int num){
int i = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ a_shared[NUM];
__shared__ b_shared[NUM];
if (i < NUM){
a_shared[i] = a[i];
b_shared[i] = b[i]
c[i] = a_shared[i] + b_shared[i];
}
}
c++ cuda gpu gpu-shared-memory
In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.
__global__ void kernel(float *a, float * b, float * c, int num){
int i = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ a_shared[NUM];
__shared__ b_shared[NUM];
if (i < NUM){
a_shared[i] = a[i];
b_shared[i] = b[i]
c[i] = a_shared[i] + b_shared[i];
}
}
c++ cuda gpu gpu-shared-memory
c++ cuda gpu gpu-shared-memory
asked Nov 23 '18 at 4:43
BIala T. A.
11
11
closed as unclear what you're asking by Gábor Bakos, talonmies, Ander Biguri, Rob, ead Nov 26 '18 at 8:17
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
closed as unclear what you're asking by Gábor Bakos, talonmies, Ander Biguri, Rob, ead Nov 26 '18 at 8:17
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55
A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59
add a comment |
docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55
A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59
docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55
docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55
A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59
A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59
add a comment |
1 Answer
1
active
oldest
votes
With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.
How do we know this?
- A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:
To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.
However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:
Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:
That number is a DRAM bandwidth. It does not include shared memory bandwidth.
- The references for profiler measurements all pertain to global memory traffic, not shared memory:
Requested Global Load Throughput
Requested Global Store Throughput
Global Load Throughput
Global Store Throughput
DRAM Read Throughput
DRAM Write Throughput
- The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.
How do we know this?
- A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:
To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.
However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:
Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:
That number is a DRAM bandwidth. It does not include shared memory bandwidth.
- The references for profiler measurements all pertain to global memory traffic, not shared memory:
Requested Global Load Throughput
Requested Global Store Throughput
Global Load Throughput
Global Store Throughput
DRAM Read Throughput
DRAM Write Throughput
- The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.
add a comment |
With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.
How do we know this?
- A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:
To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.
However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:
Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:
That number is a DRAM bandwidth. It does not include shared memory bandwidth.
- The references for profiler measurements all pertain to global memory traffic, not shared memory:
Requested Global Load Throughput
Requested Global Store Throughput
Global Load Throughput
Global Store Throughput
DRAM Read Throughput
DRAM Write Throughput
- The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.
add a comment |
With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.
How do we know this?
- A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:
To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.
However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:
Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:
That number is a DRAM bandwidth. It does not include shared memory bandwidth.
- The references for profiler measurements all pertain to global memory traffic, not shared memory:
Requested Global Load Throughput
Requested Global Store Throughput
Global Load Throughput
Global Store Throughput
DRAM Read Throughput
DRAM Write Throughput
- The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.
With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.
How do we know this?
- A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:
To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.
However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:
Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:
That number is a DRAM bandwidth. It does not include shared memory bandwidth.
- The references for profiler measurements all pertain to global memory traffic, not shared memory:
Requested Global Load Throughput
Requested Global Store Throughput
Global Load Throughput
Global Store Throughput
DRAM Read Throughput
DRAM Write Throughput
- The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.
edited Nov 23 '18 at 13:15
answered Nov 23 '18 at 5:11
Robert Crovella
94.2k4104147
94.2k4104147
add a comment |
add a comment |
docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55
A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59