Effective Bandwidth in Cuda [closed]

-1

In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.

__global__ void kernel(float *a, float * b, float * c, int num){

    int i = threadIdx.x + blockIdx.x*blockDim.x;

    __shared__ a_shared[NUM];

    __shared__ b_shared[NUM];

    if (i < NUM){

       a_shared[i] = a[i];

       b_shared[i] = b[i]   

       c[i] = a_shared[i] + b_shared[i];

    }

}

asked Nov 23 '18 at 4:43

BIala T. A.

closed as unclear what you're asking by Gábor Bakos, talonmies, Ander Biguri, Rob, ead Nov 26 '18 at 8:17

Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55

A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59

add a comment |

-1

In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.

__global__ void kernel(float *a, float * b, float * c, int num){

    int i = threadIdx.x + blockIdx.x*blockDim.x;

    __shared__ a_shared[NUM];

    __shared__ b_shared[NUM];

    if (i < NUM){

       a_shared[i] = a[i];

       b_shared[i] = b[i]   

       c[i] = a_shared[i] + b_shared[i];

    }

}

asked Nov 23 '18 at 4:43

BIala T. A.

closed as unclear what you're asking by Gábor Bakos, talonmies, Ander Biguri, Rob, ead Nov 26 '18 at 8:17

docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55

A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59

add a comment |

-1

In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.

__global__ void kernel(float *a, float * b, float * c, int num){

    int i = threadIdx.x + blockIdx.x*blockDim.x;

    __shared__ a_shared[NUM];

    __shared__ b_shared[NUM];

    if (i < NUM){

       a_shared[i] = a[i];

       b_shared[i] = b[i]   

       c[i] = a_shared[i] + b_shared[i];

    }

}

asked Nov 23 '18 at 4:43

BIala T. A.

In computing effective bandwidth in Cuda, do I count the number of reads/write in a shared memory. An example code is given below.

__global__ void kernel(float *a, float * b, float * c, int num){

    int i = threadIdx.x + blockIdx.x*blockDim.x;

    __shared__ a_shared[NUM];

    __shared__ b_shared[NUM];

    if (i < NUM){

       a_shared[i] = a[i];

       b_shared[i] = b[i]   

       c[i] = a_shared[i] + b_shared[i];

    }

}

c++ cuda gpu gpu-shared-memory

asked Nov 23 '18 at 4:43

BIala T. A.

asked Nov 23 '18 at 4:43

BIala T. A.

asked Nov 23 '18 at 4:43

BIala T. A.

asked Nov 23 '18 at 4:43

BIala T. A.

asked Nov 23 '18 at 4:43

BIala T. A.

closed as unclear what you're asking by Gábor Bakos, talonmies, Ander Biguri, Rob, ead Nov 26 '18 at 8:17

docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55

A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59

add a comment |

docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55

A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59

docs.nvidia.com/cuda/cuda-c-best-practices-guide/… . This is documented in section 8.2.2 of tthe nvidia documnetation
– BIala T. A.
Nov 23 '18 at 4:55

A careful read of that section suggests to me that global loads/stores are in view, where the backing store for the global traffic is DRAM. Therefore shared activity is not in view and should not be included.
– Robert Crovella
Nov 23 '18 at 4:59

add a comment |

1 Answer
1

active

oldest

votes

With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.

How do we know this?

A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:

To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.

However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:

Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:

That number is a DRAM bandwidth. It does not include shared memory bandwidth.

The references for profiler measurements all pertain to global memory traffic, not shared memory:

Requested Global Load Throughput

Requested Global Store Throughput

Global Load Throughput

Global Store Throughput

DRAM Read Throughput

DRAM Write Throughput

The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.

edited Nov 23 '18 at 13:15

answered Nov 23 '18 at 5:11

Robert Crovella

94.2k4104147

add a comment |

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.

How do we know this?

A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:

To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.

However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:

Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:

That number is a DRAM bandwidth. It does not include shared memory bandwidth.

The references for profiler measurements all pertain to global memory traffic, not shared memory:

Requested Global Load Throughput

Requested Global Store Throughput

Global Load Throughput

Global Store Throughput

DRAM Read Throughput

DRAM Write Throughput

The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.

edited Nov 23 '18 at 13:15

answered Nov 23 '18 at 5:11

Robert Crovella

94.2k4104147

add a comment |

With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.

How do we know this?

A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:

To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.

However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:

Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:

That number is a DRAM bandwidth. It does not include shared memory bandwidth.

The references for profiler measurements all pertain to global memory traffic, not shared memory:

Requested Global Load Throughput

Requested Global Store Throughput

Global Load Throughput

Global Store Throughput

DRAM Read Throughput

DRAM Write Throughput

The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.

edited Nov 23 '18 at 13:15

answered Nov 23 '18 at 5:11

Robert Crovella

94.2k4104147

add a comment |

With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.

How do we know this?

A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:

To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.

However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:

Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:

That number is a DRAM bandwidth. It does not include shared memory bandwidth.

The references for profiler measurements all pertain to global memory traffic, not shared memory:

Requested Global Load Throughput

Requested Global Store Throughput

Global Load Throughput

Global Store Throughput

DRAM Read Throughput

DRAM Write Throughput

The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.

edited Nov 23 '18 at 13:15

answered Nov 23 '18 at 5:11

Robert Crovella

94.2k4104147

With respect to the section of the best practices guide that you indicate in the comments above, I would say the answer is no, shared traffic should not be included.

How do we know this?

A principal purpose of the calculation of effective bandwidth is to compare it to theoretical bandwidth:

To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.

However the theoretical bandwidth calculation only includes global memory traffic to the DRAM:

Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:

That number is a DRAM bandwidth. It does not include shared memory bandwidth.

The references for profiler measurements all pertain to global memory traffic, not shared memory:

Requested Global Load Throughput

Requested Global Store Throughput

Global Load Throughput

Global Store Throughput

DRAM Read Throughput

DRAM Write Throughput

The method for calculating theoretical shared memory bandwidth is not documented anywhere that I am aware of in CUDA formal documentation, so it could not be included in a theoretical bandwidth calculation. Therefore including measurement of shared memory bandwidth would not make sense for comparison purposes.

edited Nov 23 '18 at 13:15

answered Nov 23 '18 at 5:11

Robert Crovella

94.2k4104147

edited Nov 23 '18 at 13:15

answered Nov 23 '18 at 5:11

Robert Crovella

94.2k4104147

answered Nov 23 '18 at 5:11

Robert Crovella

94.2k4104147

answered Nov 23 '18 at 5:11

Robert Crovella

94.2k4104147

add a comment |

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Qfyilyi