Effective Bandwidth in Cuda [closed]












-1














In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.



__global__ void kernel(float *a, float * b, float * c, int num){
int i = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ a_shared[NUM];
__shared__ b_shared[NUM];
if (i < NUM){
a_shared[i] = a[i];
b_shared[i] = b[i]
c[i] = a_shared[i] + b_shared[i];
}
}









share|improve this question













closed as unclear what you're asking by Gábor Bakos, talonmies, Ander Biguri, Rob, ead Nov 26 '18 at 8:17


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.















  • docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
    – BIala T. A.
    Nov 23 '18 at 4:55












  • A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
    – Robert Crovella
    Nov 23 '18 at 4:59
















-1














In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.



__global__ void kernel(float *a, float * b, float * c, int num){
int i = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ a_shared[NUM];
__shared__ b_shared[NUM];
if (i < NUM){
a_shared[i] = a[i];
b_shared[i] = b[i]
c[i] = a_shared[i] + b_shared[i];
}
}









share|improve this question













closed as unclear what you're asking by Gábor Bakos, talonmies, Ander Biguri, Rob, ead Nov 26 '18 at 8:17


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.















  • docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
    – BIala T. A.
    Nov 23 '18 at 4:55












  • A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
    – Robert Crovella
    Nov 23 '18 at 4:59














-1












-1








-1







In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.



__global__ void kernel(float *a, float * b, float * c, int num){
int i = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ a_shared[NUM];
__shared__ b_shared[NUM];
if (i < NUM){
a_shared[i] = a[i];
b_shared[i] = b[i]
c[i] = a_shared[i] + b_shared[i];
}
}









share|improve this question













In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.



__global__ void kernel(float *a, float * b, float * c, int num){
int i = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ a_shared[NUM];
__shared__ b_shared[NUM];
if (i < NUM){
a_shared[i] = a[i];
b_shared[i] = b[i]
c[i] = a_shared[i] + b_shared[i];
}
}






c++ cuda gpu gpu-shared-memory






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 23 '18 at 4:43









BIala T. A.

11




11




closed as unclear what you're asking by Gábor Bakos, talonmies, Ander Biguri, Rob, ead Nov 26 '18 at 8:17


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.






closed as unclear what you're asking by Gábor Bakos, talonmies, Ander Biguri, Rob, ead Nov 26 '18 at 8:17


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.














  • docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
    – BIala T. A.
    Nov 23 '18 at 4:55












  • A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
    – Robert Crovella
    Nov 23 '18 at 4:59


















  • docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
    – BIala T. A.
    Nov 23 '18 at 4:55












  • A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
    – Robert Crovella
    Nov 23 '18 at 4:59
















docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55






docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55














A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59




A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59












1 Answer
1






active

oldest

votes


















0














With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.



How do we know this?




  1. A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:



To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.




However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:




Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:




That number is a DRAM bandwidth. It does not include shared memory bandwidth.




  1. The references for profiler measurements all pertain to global memory traffic, not shared memory:



Requested Global Load Throughput



Requested Global Store Throughput



Global Load Throughput



Global Store Throughput



DRAM Read Throughput



DRAM Write Throughput





  1. The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.






share|improve this answer






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.



    How do we know this?




    1. A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:



    To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.




    However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:




    Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:




    That number is a DRAM bandwidth. It does not include shared memory bandwidth.




    1. The references for profiler measurements all pertain to global memory traffic, not shared memory:



    Requested Global Load Throughput



    Requested Global Store Throughput



    Global Load Throughput



    Global Store Throughput



    DRAM Read Throughput



    DRAM Write Throughput





    1. The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.






    share|improve this answer




























      0














      With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.



      How do we know this?




      1. A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:



      To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.




      However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:




      Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:




      That number is a DRAM bandwidth. It does not include shared memory bandwidth.




      1. The references for profiler measurements all pertain to global memory traffic, not shared memory:



      Requested Global Load Throughput



      Requested Global Store Throughput



      Global Load Throughput



      Global Store Throughput



      DRAM Read Throughput



      DRAM Write Throughput





      1. The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.






      share|improve this answer


























        0












        0








        0






        With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.



        How do we know this?




        1. A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:



        To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.




        However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:




        Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:




        That number is a DRAM bandwidth. It does not include shared memory bandwidth.




        1. The references for profiler measurements all pertain to global memory traffic, not shared memory:



        Requested Global Load Throughput



        Requested Global Store Throughput



        Global Load Throughput



        Global Store Throughput



        DRAM Read Throughput



        DRAM Write Throughput





        1. The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.






        share|improve this answer














        With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.



        How do we know this?




        1. A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:



        To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.




        However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:




        Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:




        That number is a DRAM bandwidth. It does not include shared memory bandwidth.




        1. The references for profiler measurements all pertain to global memory traffic, not shared memory:



        Requested Global Load Throughput



        Requested Global Store Throughput



        Global Load Throughput



        Global Store Throughput



        DRAM Read Throughput



        DRAM Write Throughput





        1. The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 23 '18 at 13:15

























        answered Nov 23 '18 at 5:11









        Robert Crovella

        94.2k4104147




        94.2k4104147















            Popular posts from this blog

            What visual should I use to simply compare current year value vs last year in Power BI desktop

            Alexandru Averescu

            Trompette piccolo