Are write-combining buffers used for normal writes to WB memory regions on Intel?
up vote
3
down vote
favorite
Write-combining buffers have been a feature of Intel CPUs going back to at least the Pentium 4 and probably before. The basic idea is that these cache-line sized buffers collect writes to the same cache line so they can be handled as a unit. As an example of their implications for software performance, if you don't write the full cache line, you may experience reduced performance.
For example, in Intel 64 and IA-32 Architectures Optimization Reference Manual section "3.6.10 Write Combining" starts with the following description (emphasis added):
Write combining (WC) improves performance in two ways:
• On a write
miss to the first-level cache, it allows multiple stores to the same
cache line to occur before that cache line is read for ownership (RFO)
from further out in the cache/memory hierarchy. Then the rest of line
is read, and the bytes that have not been written are combined with
the unmodified bytes in the returned line.
• Write combining allows
multiple writes to be assembled and written further out in the cache
hierarchy as a unit. This saves port and bus traffic. Saving traffic
is particularly important for avoiding partial writes to uncached
memory.
There are six write-combining buffers (on Pentium 4 and Intel
Xeon processors with a CPUID signature of family encoding 15, model
encoding 3; there are 8 write-combining buffers). Two of these buffers
may be written out to higher cache levels and freed up for use on
other write misses. Only four write- combining buffers are guaranteed
to be available for simultaneous use. Write combining applies to
memory type WC; it does not apply to memory type UC.
There are six
write-combining buffers in each processor core in Intel Core Duo and
Intel Core Solo processors. Processors based on Intel Core
microarchitecture have eight write-combining buffers in each core.
Starting with Intel microarchitecture code name Nehalem, there are 10
buffers available for write- combining.
Write combining buffers
are used for stores of all memory types. They are particularly
important for writes to uncached memory ...
My question is whether write combining applies to WB memory regions (that's the "normal" memory you are using 99.99% of the time in user programs), when using normal stores (that's anything other than non-temporal stores, i.e., the stores you are using 99.99% of the time).
The text above is hard to interpret exactly, and since not to have been updated since the Core Duo era. You have the part that says write combing "applies to WC memory but not UC", but of course that leaves out all the other types, like WB. Later you have that "[WC is] particularly important for writes to uncached memory", seemly contradicting the "doesn't apply to UC part".
So are write combining buffers used on modern Intel chips for normal stores to WB memory?
performance x86 intel cpu-architecture
add a comment |
up vote
3
down vote
favorite
Write-combining buffers have been a feature of Intel CPUs going back to at least the Pentium 4 and probably before. The basic idea is that these cache-line sized buffers collect writes to the same cache line so they can be handled as a unit. As an example of their implications for software performance, if you don't write the full cache line, you may experience reduced performance.
For example, in Intel 64 and IA-32 Architectures Optimization Reference Manual section "3.6.10 Write Combining" starts with the following description (emphasis added):
Write combining (WC) improves performance in two ways:
• On a write
miss to the first-level cache, it allows multiple stores to the same
cache line to occur before that cache line is read for ownership (RFO)
from further out in the cache/memory hierarchy. Then the rest of line
is read, and the bytes that have not been written are combined with
the unmodified bytes in the returned line.
• Write combining allows
multiple writes to be assembled and written further out in the cache
hierarchy as a unit. This saves port and bus traffic. Saving traffic
is particularly important for avoiding partial writes to uncached
memory.
There are six write-combining buffers (on Pentium 4 and Intel
Xeon processors with a CPUID signature of family encoding 15, model
encoding 3; there are 8 write-combining buffers). Two of these buffers
may be written out to higher cache levels and freed up for use on
other write misses. Only four write- combining buffers are guaranteed
to be available for simultaneous use. Write combining applies to
memory type WC; it does not apply to memory type UC.
There are six
write-combining buffers in each processor core in Intel Core Duo and
Intel Core Solo processors. Processors based on Intel Core
microarchitecture have eight write-combining buffers in each core.
Starting with Intel microarchitecture code name Nehalem, there are 10
buffers available for write- combining.
Write combining buffers
are used for stores of all memory types. They are particularly
important for writes to uncached memory ...
My question is whether write combining applies to WB memory regions (that's the "normal" memory you are using 99.99% of the time in user programs), when using normal stores (that's anything other than non-temporal stores, i.e., the stores you are using 99.99% of the time).
The text above is hard to interpret exactly, and since not to have been updated since the Core Duo era. You have the part that says write combing "applies to WC memory but not UC", but of course that leaves out all the other types, like WB. Later you have that "[WC is] particularly important for writes to uncached memory", seemly contradicting the "doesn't apply to UC part".
So are write combining buffers used on modern Intel chips for normal stores to WB memory?
performance x86 intel cpu-architecture
IIRC, I think I read somewhere that cache-miss stores (to WB memory) can commit into the LFB that's waiting for the data for that line to arrive. Or waiting for the RFO. But I might be mis-remembering, because I'm not sure that would let the core snoop those stores efficiently for store-forwarding.
– Peter Cordes
Nov 22 at 21:17
@PeterCordes that might also complicate memory ordering, since normal stores have to be strong ordered, so stores to different lines get combined into different in-flight buffers, it puts some strong restrictions about in what order the respective lines can be invalidated/made visible later. Perhaps other ordering concerns already imply this, I'm not sure.
– BeeOnRope
Nov 22 at 22:11
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
Write-combining buffers have been a feature of Intel CPUs going back to at least the Pentium 4 and probably before. The basic idea is that these cache-line sized buffers collect writes to the same cache line so they can be handled as a unit. As an example of their implications for software performance, if you don't write the full cache line, you may experience reduced performance.
For example, in Intel 64 and IA-32 Architectures Optimization Reference Manual section "3.6.10 Write Combining" starts with the following description (emphasis added):
Write combining (WC) improves performance in two ways:
• On a write
miss to the first-level cache, it allows multiple stores to the same
cache line to occur before that cache line is read for ownership (RFO)
from further out in the cache/memory hierarchy. Then the rest of line
is read, and the bytes that have not been written are combined with
the unmodified bytes in the returned line.
• Write combining allows
multiple writes to be assembled and written further out in the cache
hierarchy as a unit. This saves port and bus traffic. Saving traffic
is particularly important for avoiding partial writes to uncached
memory.
There are six write-combining buffers (on Pentium 4 and Intel
Xeon processors with a CPUID signature of family encoding 15, model
encoding 3; there are 8 write-combining buffers). Two of these buffers
may be written out to higher cache levels and freed up for use on
other write misses. Only four write- combining buffers are guaranteed
to be available for simultaneous use. Write combining applies to
memory type WC; it does not apply to memory type UC.
There are six
write-combining buffers in each processor core in Intel Core Duo and
Intel Core Solo processors. Processors based on Intel Core
microarchitecture have eight write-combining buffers in each core.
Starting with Intel microarchitecture code name Nehalem, there are 10
buffers available for write- combining.
Write combining buffers
are used for stores of all memory types. They are particularly
important for writes to uncached memory ...
My question is whether write combining applies to WB memory regions (that's the "normal" memory you are using 99.99% of the time in user programs), when using normal stores (that's anything other than non-temporal stores, i.e., the stores you are using 99.99% of the time).
The text above is hard to interpret exactly, and since not to have been updated since the Core Duo era. You have the part that says write combing "applies to WC memory but not UC", but of course that leaves out all the other types, like WB. Later you have that "[WC is] particularly important for writes to uncached memory", seemly contradicting the "doesn't apply to UC part".
So are write combining buffers used on modern Intel chips for normal stores to WB memory?
performance x86 intel cpu-architecture
Write-combining buffers have been a feature of Intel CPUs going back to at least the Pentium 4 and probably before. The basic idea is that these cache-line sized buffers collect writes to the same cache line so they can be handled as a unit. As an example of their implications for software performance, if you don't write the full cache line, you may experience reduced performance.
For example, in Intel 64 and IA-32 Architectures Optimization Reference Manual section "3.6.10 Write Combining" starts with the following description (emphasis added):
Write combining (WC) improves performance in two ways:
• On a write
miss to the first-level cache, it allows multiple stores to the same
cache line to occur before that cache line is read for ownership (RFO)
from further out in the cache/memory hierarchy. Then the rest of line
is read, and the bytes that have not been written are combined with
the unmodified bytes in the returned line.
• Write combining allows
multiple writes to be assembled and written further out in the cache
hierarchy as a unit. This saves port and bus traffic. Saving traffic
is particularly important for avoiding partial writes to uncached
memory.
There are six write-combining buffers (on Pentium 4 and Intel
Xeon processors with a CPUID signature of family encoding 15, model
encoding 3; there are 8 write-combining buffers). Two of these buffers
may be written out to higher cache levels and freed up for use on
other write misses. Only four write- combining buffers are guaranteed
to be available for simultaneous use. Write combining applies to
memory type WC; it does not apply to memory type UC.
There are six
write-combining buffers in each processor core in Intel Core Duo and
Intel Core Solo processors. Processors based on Intel Core
microarchitecture have eight write-combining buffers in each core.
Starting with Intel microarchitecture code name Nehalem, there are 10
buffers available for write- combining.
Write combining buffers
are used for stores of all memory types. They are particularly
important for writes to uncached memory ...
My question is whether write combining applies to WB memory regions (that's the "normal" memory you are using 99.99% of the time in user programs), when using normal stores (that's anything other than non-temporal stores, i.e., the stores you are using 99.99% of the time).
The text above is hard to interpret exactly, and since not to have been updated since the Core Duo era. You have the part that says write combing "applies to WC memory but not UC", but of course that leaves out all the other types, like WB. Later you have that "[WC is] particularly important for writes to uncached memory", seemly contradicting the "doesn't apply to UC part".
So are write combining buffers used on modern Intel chips for normal stores to WB memory?
performance x86 intel cpu-architecture
performance x86 intel cpu-architecture
edited Nov 22 at 21:31
asked Nov 22 at 17:09
BeeOnRope
24.7k874170
24.7k874170
IIRC, I think I read somewhere that cache-miss stores (to WB memory) can commit into the LFB that's waiting for the data for that line to arrive. Or waiting for the RFO. But I might be mis-remembering, because I'm not sure that would let the core snoop those stores efficiently for store-forwarding.
– Peter Cordes
Nov 22 at 21:17
@PeterCordes that might also complicate memory ordering, since normal stores have to be strong ordered, so stores to different lines get combined into different in-flight buffers, it puts some strong restrictions about in what order the respective lines can be invalidated/made visible later. Perhaps other ordering concerns already imply this, I'm not sure.
– BeeOnRope
Nov 22 at 22:11
add a comment |
IIRC, I think I read somewhere that cache-miss stores (to WB memory) can commit into the LFB that's waiting for the data for that line to arrive. Or waiting for the RFO. But I might be mis-remembering, because I'm not sure that would let the core snoop those stores efficiently for store-forwarding.
– Peter Cordes
Nov 22 at 21:17
@PeterCordes that might also complicate memory ordering, since normal stores have to be strong ordered, so stores to different lines get combined into different in-flight buffers, it puts some strong restrictions about in what order the respective lines can be invalidated/made visible later. Perhaps other ordering concerns already imply this, I'm not sure.
– BeeOnRope
Nov 22 at 22:11
IIRC, I think I read somewhere that cache-miss stores (to WB memory) can commit into the LFB that's waiting for the data for that line to arrive. Or waiting for the RFO. But I might be mis-remembering, because I'm not sure that would let the core snoop those stores efficiently for store-forwarding.
– Peter Cordes
Nov 22 at 21:17
IIRC, I think I read somewhere that cache-miss stores (to WB memory) can commit into the LFB that's waiting for the data for that line to arrive. Or waiting for the RFO. But I might be mis-remembering, because I'm not sure that would let the core snoop those stores efficiently for store-forwarding.
– Peter Cordes
Nov 22 at 21:17
@PeterCordes that might also complicate memory ordering, since normal stores have to be strong ordered, so stores to different lines get combined into different in-flight buffers, it puts some strong restrictions about in what order the respective lines can be invalidated/made visible later. Perhaps other ordering concerns already imply this, I'm not sure.
– BeeOnRope
Nov 22 at 22:11
@PeterCordes that might also complicate memory ordering, since normal stores have to be strong ordered, so stores to different lines get combined into different in-flight buffers, it puts some strong restrictions about in what order the respective lines can be invalidated/made visible later. Perhaps other ordering concerns already imply this, I'm not sure.
– BeeOnRope
Nov 22 at 22:11
add a comment |
1 Answer
1
active
oldest
votes
up vote
4
down vote
Yes, the write combining and coalescing properties of the LFBs support all memory types except the UC type. You can observe their impact experimentally using the following program. It takes two parameters as input:
STORE_COUNT
: the number of 8-byte stores to perform sequentially.
INCREMENT
: the stride between consecutive stores.
There are 4 different values of INCREMENT
that are particularly interesting:
- 64: All stores are performed on unique cache lines. Write combining and coalescing will not take an effect.
- 0: All stores are to the same cache line and the same location within that line. Write coalescing takes effect in this case.
- 8: Every 8 consecutive stores are to the same cache line, but different locations within that line. Write combining takes effect in this case.
- 4: The target locations of consecutive stores overlap within the same cache line. Some stores might cross two cache lines (depending on
STORE_COUNT
). Both write combining and coalescing will take an effect.
There is another parameter, ITERATIONS
, which is used to repeat the same experiment many times to make reliable measurements. You can keep it at 1000.
%define ITERATIONS 1000
BITS 64
DEFAULT REL
section .bss
align 64
bufsrc: resb STORE_COUNT*64
section .text
global _start
_start:
mov ecx, ITERATIONS
.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.flush:
clflush [rsi]
sfence
lfence
add rsi, 64
sub edx, 1
jnz .flush
; This is the main loop where the stores are issued sequentially.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.inner:
mov [rsi], rdx
sfence ; Prevents potential combining in the store buffer.
add rsi, INCREMENT
sub edx, 1
jnz .inner
; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
mov edx, 100000
.wait:
lfence
sub edx, 1
jnz .wait
sub ecx, 1
jnz .loop
; Exit.
xor edi,edi
mov eax,231
syscall
I recommend the following setup:
- Disable all hardware prefetchers using
sudo wrmsr -a 0x1A4 0xf
. This ensures that they will not interfere (or have minimal interference) with the experiments. - Set the CPU frequency to the maximum. This increases the probability that the main loop will be fully executed before the first cache line reaches the L1 and causes an LFB to be freed.
- Disable hyperthreading because the LFBs are shared (at least since Sandy Bridge, but not on all microarchitectures).
The L1D_PEND_MISS.FB_FULL
performance counter enables us to capture the effect of write combining regarding how it impacts the availability of LFBs. It is supported on Intel Core and later. It is described as follows:
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.
First run the code without the inner loop and make sure that L1D_PEND_MISS.FB_FULL
is zero, which means the the flush loop has no impact on the event count.
The following figure plots STORE_COUNT
against total L1D_PEND_MISS.FB_FULL
divided by ITERATIONS
.
We can observe the following:
- It's clear that there are exactly 10 LFBs.
- When write combining or coalescing is possible,
L1D_PEND_MISS.FB_FULL
is zero for any number of stores. - When the stride is 64 bytes,
L1D_PEND_MISS.FB_FULL
is larger than zero when the number of stores is larger than 10.
Later you have that "[WC is] particularly important for writes to
uncached memory", seemly contradicting the "doesn't apply to UC part".
Both WC and UC are classified as uncachable. So you can put the two statements together to deduce that WC is particularly important for writes to WC memory.
See also: Where is the Write-Combining Buffer located? x86.
Interesting tests. However, I don't think the results support the conclusion. Why would theINCREMENT
0, 4 and 8 all also have an "elbow" at exactly 10? You say It appears that write combining or coalescing cannot be performed without some penalty. An LFB seems to be reserved for every issued store until it is determined that it can be merged within an already allocated LFB - but this seems like an unlikely mechanism: allocating an LFB, realizing the mistake, then deallocating it and coalescing the load? Seems prone to races. Lets say that was the mechanism, however...
– BeeOnRope
Nov 22 at 22:23
... in that case why would they all show different behavior at 10? One would expect this to resolve itself before filling all the buffers. I guess it might have to do with yoursfence
: perhaps thesfence
forces all the stores to get their own LFB. BTW, it's a shame that thel1d_pend_miss.pending
andl1d_pend_miss.pending_cycles
events don't count LFBs allocated for stores (or that there is not similar events for stores).
– BeeOnRope
Nov 22 at 22:24
Note that these measurements are taken over the outer loop. Then I'm dividing byITERATIONS
. So I'm not sure whether the elbow at 10 is due to the flush loop, the inner loop, or both. Is there an easy way to measure over only the inner loop so we can know for sure?
– Hadi Brais
Nov 22 at 22:33
1
I think the graph can be explained by an observation you already made: This means that LFBs are becoming available much earlier when write combining or coalescing is possible. You are begging the question there: I think you are right that the indication is that more lines become free available sooner for the lower increments, but can't this simply be explained by it taking less time to return 1 line from memory (the 0, 4 increment cases) or 2 lines (the 8 case) than 10 lines (the 64 case)? You don't necessarily need to invoke coalescing.
– BeeOnRope
Nov 22 at 22:34
1
Now the graph looks like I would expect it. Isn't this just telling us that storing to 10+ cache lines (the increment 64 case) in rapid succession exceeds the 10 LFBs, whereas storing to 1 or 2 (the other cases), doesn't? I'm actually starting to worry my question is not well-formed. I expected that a given LFB absorb all later read or store requests to the same line, and I think that's what your graph shows. Does that make it "write combining" in the sense of the Intel manual though? Perhaps I didn't do a good job distinguishing the two.
– BeeOnRope
Nov 23 at 3:11
|
show 8 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53435632%2fare-write-combining-buffers-used-for-normal-writes-to-wb-memory-regions-on-intel%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
Yes, the write combining and coalescing properties of the LFBs support all memory types except the UC type. You can observe their impact experimentally using the following program. It takes two parameters as input:
STORE_COUNT
: the number of 8-byte stores to perform sequentially.
INCREMENT
: the stride between consecutive stores.
There are 4 different values of INCREMENT
that are particularly interesting:
- 64: All stores are performed on unique cache lines. Write combining and coalescing will not take an effect.
- 0: All stores are to the same cache line and the same location within that line. Write coalescing takes effect in this case.
- 8: Every 8 consecutive stores are to the same cache line, but different locations within that line. Write combining takes effect in this case.
- 4: The target locations of consecutive stores overlap within the same cache line. Some stores might cross two cache lines (depending on
STORE_COUNT
). Both write combining and coalescing will take an effect.
There is another parameter, ITERATIONS
, which is used to repeat the same experiment many times to make reliable measurements. You can keep it at 1000.
%define ITERATIONS 1000
BITS 64
DEFAULT REL
section .bss
align 64
bufsrc: resb STORE_COUNT*64
section .text
global _start
_start:
mov ecx, ITERATIONS
.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.flush:
clflush [rsi]
sfence
lfence
add rsi, 64
sub edx, 1
jnz .flush
; This is the main loop where the stores are issued sequentially.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.inner:
mov [rsi], rdx
sfence ; Prevents potential combining in the store buffer.
add rsi, INCREMENT
sub edx, 1
jnz .inner
; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
mov edx, 100000
.wait:
lfence
sub edx, 1
jnz .wait
sub ecx, 1
jnz .loop
; Exit.
xor edi,edi
mov eax,231
syscall
I recommend the following setup:
- Disable all hardware prefetchers using
sudo wrmsr -a 0x1A4 0xf
. This ensures that they will not interfere (or have minimal interference) with the experiments. - Set the CPU frequency to the maximum. This increases the probability that the main loop will be fully executed before the first cache line reaches the L1 and causes an LFB to be freed.
- Disable hyperthreading because the LFBs are shared (at least since Sandy Bridge, but not on all microarchitectures).
The L1D_PEND_MISS.FB_FULL
performance counter enables us to capture the effect of write combining regarding how it impacts the availability of LFBs. It is supported on Intel Core and later. It is described as follows:
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.
First run the code without the inner loop and make sure that L1D_PEND_MISS.FB_FULL
is zero, which means the the flush loop has no impact on the event count.
The following figure plots STORE_COUNT
against total L1D_PEND_MISS.FB_FULL
divided by ITERATIONS
.
We can observe the following:
- It's clear that there are exactly 10 LFBs.
- When write combining or coalescing is possible,
L1D_PEND_MISS.FB_FULL
is zero for any number of stores. - When the stride is 64 bytes,
L1D_PEND_MISS.FB_FULL
is larger than zero when the number of stores is larger than 10.
Later you have that "[WC is] particularly important for writes to
uncached memory", seemly contradicting the "doesn't apply to UC part".
Both WC and UC are classified as uncachable. So you can put the two statements together to deduce that WC is particularly important for writes to WC memory.
See also: Where is the Write-Combining Buffer located? x86.
Interesting tests. However, I don't think the results support the conclusion. Why would theINCREMENT
0, 4 and 8 all also have an "elbow" at exactly 10? You say It appears that write combining or coalescing cannot be performed without some penalty. An LFB seems to be reserved for every issued store until it is determined that it can be merged within an already allocated LFB - but this seems like an unlikely mechanism: allocating an LFB, realizing the mistake, then deallocating it and coalescing the load? Seems prone to races. Lets say that was the mechanism, however...
– BeeOnRope
Nov 22 at 22:23
... in that case why would they all show different behavior at 10? One would expect this to resolve itself before filling all the buffers. I guess it might have to do with yoursfence
: perhaps thesfence
forces all the stores to get their own LFB. BTW, it's a shame that thel1d_pend_miss.pending
andl1d_pend_miss.pending_cycles
events don't count LFBs allocated for stores (or that there is not similar events for stores).
– BeeOnRope
Nov 22 at 22:24
Note that these measurements are taken over the outer loop. Then I'm dividing byITERATIONS
. So I'm not sure whether the elbow at 10 is due to the flush loop, the inner loop, or both. Is there an easy way to measure over only the inner loop so we can know for sure?
– Hadi Brais
Nov 22 at 22:33
1
I think the graph can be explained by an observation you already made: This means that LFBs are becoming available much earlier when write combining or coalescing is possible. You are begging the question there: I think you are right that the indication is that more lines become free available sooner for the lower increments, but can't this simply be explained by it taking less time to return 1 line from memory (the 0, 4 increment cases) or 2 lines (the 8 case) than 10 lines (the 64 case)? You don't necessarily need to invoke coalescing.
– BeeOnRope
Nov 22 at 22:34
1
Now the graph looks like I would expect it. Isn't this just telling us that storing to 10+ cache lines (the increment 64 case) in rapid succession exceeds the 10 LFBs, whereas storing to 1 or 2 (the other cases), doesn't? I'm actually starting to worry my question is not well-formed. I expected that a given LFB absorb all later read or store requests to the same line, and I think that's what your graph shows. Does that make it "write combining" in the sense of the Intel manual though? Perhaps I didn't do a good job distinguishing the two.
– BeeOnRope
Nov 23 at 3:11
|
show 8 more comments
up vote
4
down vote
Yes, the write combining and coalescing properties of the LFBs support all memory types except the UC type. You can observe their impact experimentally using the following program. It takes two parameters as input:
STORE_COUNT
: the number of 8-byte stores to perform sequentially.
INCREMENT
: the stride between consecutive stores.
There are 4 different values of INCREMENT
that are particularly interesting:
- 64: All stores are performed on unique cache lines. Write combining and coalescing will not take an effect.
- 0: All stores are to the same cache line and the same location within that line. Write coalescing takes effect in this case.
- 8: Every 8 consecutive stores are to the same cache line, but different locations within that line. Write combining takes effect in this case.
- 4: The target locations of consecutive stores overlap within the same cache line. Some stores might cross two cache lines (depending on
STORE_COUNT
). Both write combining and coalescing will take an effect.
There is another parameter, ITERATIONS
, which is used to repeat the same experiment many times to make reliable measurements. You can keep it at 1000.
%define ITERATIONS 1000
BITS 64
DEFAULT REL
section .bss
align 64
bufsrc: resb STORE_COUNT*64
section .text
global _start
_start:
mov ecx, ITERATIONS
.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.flush:
clflush [rsi]
sfence
lfence
add rsi, 64
sub edx, 1
jnz .flush
; This is the main loop where the stores are issued sequentially.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.inner:
mov [rsi], rdx
sfence ; Prevents potential combining in the store buffer.
add rsi, INCREMENT
sub edx, 1
jnz .inner
; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
mov edx, 100000
.wait:
lfence
sub edx, 1
jnz .wait
sub ecx, 1
jnz .loop
; Exit.
xor edi,edi
mov eax,231
syscall
I recommend the following setup:
- Disable all hardware prefetchers using
sudo wrmsr -a 0x1A4 0xf
. This ensures that they will not interfere (or have minimal interference) with the experiments. - Set the CPU frequency to the maximum. This increases the probability that the main loop will be fully executed before the first cache line reaches the L1 and causes an LFB to be freed.
- Disable hyperthreading because the LFBs are shared (at least since Sandy Bridge, but not on all microarchitectures).
The L1D_PEND_MISS.FB_FULL
performance counter enables us to capture the effect of write combining regarding how it impacts the availability of LFBs. It is supported on Intel Core and later. It is described as follows:
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.
First run the code without the inner loop and make sure that L1D_PEND_MISS.FB_FULL
is zero, which means the the flush loop has no impact on the event count.
The following figure plots STORE_COUNT
against total L1D_PEND_MISS.FB_FULL
divided by ITERATIONS
.
We can observe the following:
- It's clear that there are exactly 10 LFBs.
- When write combining or coalescing is possible,
L1D_PEND_MISS.FB_FULL
is zero for any number of stores. - When the stride is 64 bytes,
L1D_PEND_MISS.FB_FULL
is larger than zero when the number of stores is larger than 10.
Later you have that "[WC is] particularly important for writes to
uncached memory", seemly contradicting the "doesn't apply to UC part".
Both WC and UC are classified as uncachable. So you can put the two statements together to deduce that WC is particularly important for writes to WC memory.
See also: Where is the Write-Combining Buffer located? x86.
Interesting tests. However, I don't think the results support the conclusion. Why would theINCREMENT
0, 4 and 8 all also have an "elbow" at exactly 10? You say It appears that write combining or coalescing cannot be performed without some penalty. An LFB seems to be reserved for every issued store until it is determined that it can be merged within an already allocated LFB - but this seems like an unlikely mechanism: allocating an LFB, realizing the mistake, then deallocating it and coalescing the load? Seems prone to races. Lets say that was the mechanism, however...
– BeeOnRope
Nov 22 at 22:23
... in that case why would they all show different behavior at 10? One would expect this to resolve itself before filling all the buffers. I guess it might have to do with yoursfence
: perhaps thesfence
forces all the stores to get their own LFB. BTW, it's a shame that thel1d_pend_miss.pending
andl1d_pend_miss.pending_cycles
events don't count LFBs allocated for stores (or that there is not similar events for stores).
– BeeOnRope
Nov 22 at 22:24
Note that these measurements are taken over the outer loop. Then I'm dividing byITERATIONS
. So I'm not sure whether the elbow at 10 is due to the flush loop, the inner loop, or both. Is there an easy way to measure over only the inner loop so we can know for sure?
– Hadi Brais
Nov 22 at 22:33
1
I think the graph can be explained by an observation you already made: This means that LFBs are becoming available much earlier when write combining or coalescing is possible. You are begging the question there: I think you are right that the indication is that more lines become free available sooner for the lower increments, but can't this simply be explained by it taking less time to return 1 line from memory (the 0, 4 increment cases) or 2 lines (the 8 case) than 10 lines (the 64 case)? You don't necessarily need to invoke coalescing.
– BeeOnRope
Nov 22 at 22:34
1
Now the graph looks like I would expect it. Isn't this just telling us that storing to 10+ cache lines (the increment 64 case) in rapid succession exceeds the 10 LFBs, whereas storing to 1 or 2 (the other cases), doesn't? I'm actually starting to worry my question is not well-formed. I expected that a given LFB absorb all later read or store requests to the same line, and I think that's what your graph shows. Does that make it "write combining" in the sense of the Intel manual though? Perhaps I didn't do a good job distinguishing the two.
– BeeOnRope
Nov 23 at 3:11
|
show 8 more comments
up vote
4
down vote
up vote
4
down vote
Yes, the write combining and coalescing properties of the LFBs support all memory types except the UC type. You can observe their impact experimentally using the following program. It takes two parameters as input:
STORE_COUNT
: the number of 8-byte stores to perform sequentially.
INCREMENT
: the stride between consecutive stores.
There are 4 different values of INCREMENT
that are particularly interesting:
- 64: All stores are performed on unique cache lines. Write combining and coalescing will not take an effect.
- 0: All stores are to the same cache line and the same location within that line. Write coalescing takes effect in this case.
- 8: Every 8 consecutive stores are to the same cache line, but different locations within that line. Write combining takes effect in this case.
- 4: The target locations of consecutive stores overlap within the same cache line. Some stores might cross two cache lines (depending on
STORE_COUNT
). Both write combining and coalescing will take an effect.
There is another parameter, ITERATIONS
, which is used to repeat the same experiment many times to make reliable measurements. You can keep it at 1000.
%define ITERATIONS 1000
BITS 64
DEFAULT REL
section .bss
align 64
bufsrc: resb STORE_COUNT*64
section .text
global _start
_start:
mov ecx, ITERATIONS
.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.flush:
clflush [rsi]
sfence
lfence
add rsi, 64
sub edx, 1
jnz .flush
; This is the main loop where the stores are issued sequentially.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.inner:
mov [rsi], rdx
sfence ; Prevents potential combining in the store buffer.
add rsi, INCREMENT
sub edx, 1
jnz .inner
; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
mov edx, 100000
.wait:
lfence
sub edx, 1
jnz .wait
sub ecx, 1
jnz .loop
; Exit.
xor edi,edi
mov eax,231
syscall
I recommend the following setup:
- Disable all hardware prefetchers using
sudo wrmsr -a 0x1A4 0xf
. This ensures that they will not interfere (or have minimal interference) with the experiments. - Set the CPU frequency to the maximum. This increases the probability that the main loop will be fully executed before the first cache line reaches the L1 and causes an LFB to be freed.
- Disable hyperthreading because the LFBs are shared (at least since Sandy Bridge, but not on all microarchitectures).
The L1D_PEND_MISS.FB_FULL
performance counter enables us to capture the effect of write combining regarding how it impacts the availability of LFBs. It is supported on Intel Core and later. It is described as follows:
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.
First run the code without the inner loop and make sure that L1D_PEND_MISS.FB_FULL
is zero, which means the the flush loop has no impact on the event count.
The following figure plots STORE_COUNT
against total L1D_PEND_MISS.FB_FULL
divided by ITERATIONS
.
We can observe the following:
- It's clear that there are exactly 10 LFBs.
- When write combining or coalescing is possible,
L1D_PEND_MISS.FB_FULL
is zero for any number of stores. - When the stride is 64 bytes,
L1D_PEND_MISS.FB_FULL
is larger than zero when the number of stores is larger than 10.
Later you have that "[WC is] particularly important for writes to
uncached memory", seemly contradicting the "doesn't apply to UC part".
Both WC and UC are classified as uncachable. So you can put the two statements together to deduce that WC is particularly important for writes to WC memory.
See also: Where is the Write-Combining Buffer located? x86.
Yes, the write combining and coalescing properties of the LFBs support all memory types except the UC type. You can observe their impact experimentally using the following program. It takes two parameters as input:
STORE_COUNT
: the number of 8-byte stores to perform sequentially.
INCREMENT
: the stride between consecutive stores.
There are 4 different values of INCREMENT
that are particularly interesting:
- 64: All stores are performed on unique cache lines. Write combining and coalescing will not take an effect.
- 0: All stores are to the same cache line and the same location within that line. Write coalescing takes effect in this case.
- 8: Every 8 consecutive stores are to the same cache line, but different locations within that line. Write combining takes effect in this case.
- 4: The target locations of consecutive stores overlap within the same cache line. Some stores might cross two cache lines (depending on
STORE_COUNT
). Both write combining and coalescing will take an effect.
There is another parameter, ITERATIONS
, which is used to repeat the same experiment many times to make reliable measurements. You can keep it at 1000.
%define ITERATIONS 1000
BITS 64
DEFAULT REL
section .bss
align 64
bufsrc: resb STORE_COUNT*64
section .text
global _start
_start:
mov ecx, ITERATIONS
.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.flush:
clflush [rsi]
sfence
lfence
add rsi, 64
sub edx, 1
jnz .flush
; This is the main loop where the stores are issued sequentially.
lea rsi, [bufsrc]
mov edx, STORE_COUNT
.inner:
mov [rsi], rdx
sfence ; Prevents potential combining in the store buffer.
add rsi, INCREMENT
sub edx, 1
jnz .inner
; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
mov edx, 100000
.wait:
lfence
sub edx, 1
jnz .wait
sub ecx, 1
jnz .loop
; Exit.
xor edi,edi
mov eax,231
syscall
I recommend the following setup:
- Disable all hardware prefetchers using
sudo wrmsr -a 0x1A4 0xf
. This ensures that they will not interfere (or have minimal interference) with the experiments. - Set the CPU frequency to the maximum. This increases the probability that the main loop will be fully executed before the first cache line reaches the L1 and causes an LFB to be freed.
- Disable hyperthreading because the LFBs are shared (at least since Sandy Bridge, but not on all microarchitectures).
The L1D_PEND_MISS.FB_FULL
performance counter enables us to capture the effect of write combining regarding how it impacts the availability of LFBs. It is supported on Intel Core and later. It is described as follows:
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.
First run the code without the inner loop and make sure that L1D_PEND_MISS.FB_FULL
is zero, which means the the flush loop has no impact on the event count.
The following figure plots STORE_COUNT
against total L1D_PEND_MISS.FB_FULL
divided by ITERATIONS
.
We can observe the following:
- It's clear that there are exactly 10 LFBs.
- When write combining or coalescing is possible,
L1D_PEND_MISS.FB_FULL
is zero for any number of stores. - When the stride is 64 bytes,
L1D_PEND_MISS.FB_FULL
is larger than zero when the number of stores is larger than 10.
Later you have that "[WC is] particularly important for writes to
uncached memory", seemly contradicting the "doesn't apply to UC part".
Both WC and UC are classified as uncachable. So you can put the two statements together to deduce that WC is particularly important for writes to WC memory.
See also: Where is the Write-Combining Buffer located? x86.
edited Nov 23 at 2:13
answered Nov 22 at 21:35
Hadi Brais
9,06711737
9,06711737
Interesting tests. However, I don't think the results support the conclusion. Why would theINCREMENT
0, 4 and 8 all also have an "elbow" at exactly 10? You say It appears that write combining or coalescing cannot be performed without some penalty. An LFB seems to be reserved for every issued store until it is determined that it can be merged within an already allocated LFB - but this seems like an unlikely mechanism: allocating an LFB, realizing the mistake, then deallocating it and coalescing the load? Seems prone to races. Lets say that was the mechanism, however...
– BeeOnRope
Nov 22 at 22:23
... in that case why would they all show different behavior at 10? One would expect this to resolve itself before filling all the buffers. I guess it might have to do with yoursfence
: perhaps thesfence
forces all the stores to get their own LFB. BTW, it's a shame that thel1d_pend_miss.pending
andl1d_pend_miss.pending_cycles
events don't count LFBs allocated for stores (or that there is not similar events for stores).
– BeeOnRope
Nov 22 at 22:24
Note that these measurements are taken over the outer loop. Then I'm dividing byITERATIONS
. So I'm not sure whether the elbow at 10 is due to the flush loop, the inner loop, or both. Is there an easy way to measure over only the inner loop so we can know for sure?
– Hadi Brais
Nov 22 at 22:33
1
I think the graph can be explained by an observation you already made: This means that LFBs are becoming available much earlier when write combining or coalescing is possible. You are begging the question there: I think you are right that the indication is that more lines become free available sooner for the lower increments, but can't this simply be explained by it taking less time to return 1 line from memory (the 0, 4 increment cases) or 2 lines (the 8 case) than 10 lines (the 64 case)? You don't necessarily need to invoke coalescing.
– BeeOnRope
Nov 22 at 22:34
1
Now the graph looks like I would expect it. Isn't this just telling us that storing to 10+ cache lines (the increment 64 case) in rapid succession exceeds the 10 LFBs, whereas storing to 1 or 2 (the other cases), doesn't? I'm actually starting to worry my question is not well-formed. I expected that a given LFB absorb all later read or store requests to the same line, and I think that's what your graph shows. Does that make it "write combining" in the sense of the Intel manual though? Perhaps I didn't do a good job distinguishing the two.
– BeeOnRope
Nov 23 at 3:11
|
show 8 more comments
Interesting tests. However, I don't think the results support the conclusion. Why would theINCREMENT
0, 4 and 8 all also have an "elbow" at exactly 10? You say It appears that write combining or coalescing cannot be performed without some penalty. An LFB seems to be reserved for every issued store until it is determined that it can be merged within an already allocated LFB - but this seems like an unlikely mechanism: allocating an LFB, realizing the mistake, then deallocating it and coalescing the load? Seems prone to races. Lets say that was the mechanism, however...
– BeeOnRope
Nov 22 at 22:23
... in that case why would they all show different behavior at 10? One would expect this to resolve itself before filling all the buffers. I guess it might have to do with yoursfence
: perhaps thesfence
forces all the stores to get their own LFB. BTW, it's a shame that thel1d_pend_miss.pending
andl1d_pend_miss.pending_cycles
events don't count LFBs allocated for stores (or that there is not similar events for stores).
– BeeOnRope
Nov 22 at 22:24
Note that these measurements are taken over the outer loop. Then I'm dividing byITERATIONS
. So I'm not sure whether the elbow at 10 is due to the flush loop, the inner loop, or both. Is there an easy way to measure over only the inner loop so we can know for sure?
– Hadi Brais
Nov 22 at 22:33
1
I think the graph can be explained by an observation you already made: This means that LFBs are becoming available much earlier when write combining or coalescing is possible. You are begging the question there: I think you are right that the indication is that more lines become free available sooner for the lower increments, but can't this simply be explained by it taking less time to return 1 line from memory (the 0, 4 increment cases) or 2 lines (the 8 case) than 10 lines (the 64 case)? You don't necessarily need to invoke coalescing.
– BeeOnRope
Nov 22 at 22:34
1
Now the graph looks like I would expect it. Isn't this just telling us that storing to 10+ cache lines (the increment 64 case) in rapid succession exceeds the 10 LFBs, whereas storing to 1 or 2 (the other cases), doesn't? I'm actually starting to worry my question is not well-formed. I expected that a given LFB absorb all later read or store requests to the same line, and I think that's what your graph shows. Does that make it "write combining" in the sense of the Intel manual though? Perhaps I didn't do a good job distinguishing the two.
– BeeOnRope
Nov 23 at 3:11
Interesting tests. However, I don't think the results support the conclusion. Why would the
INCREMENT
0, 4 and 8 all also have an "elbow" at exactly 10? You say It appears that write combining or coalescing cannot be performed without some penalty. An LFB seems to be reserved for every issued store until it is determined that it can be merged within an already allocated LFB - but this seems like an unlikely mechanism: allocating an LFB, realizing the mistake, then deallocating it and coalescing the load? Seems prone to races. Lets say that was the mechanism, however...– BeeOnRope
Nov 22 at 22:23
Interesting tests. However, I don't think the results support the conclusion. Why would the
INCREMENT
0, 4 and 8 all also have an "elbow" at exactly 10? You say It appears that write combining or coalescing cannot be performed without some penalty. An LFB seems to be reserved for every issued store until it is determined that it can be merged within an already allocated LFB - but this seems like an unlikely mechanism: allocating an LFB, realizing the mistake, then deallocating it and coalescing the load? Seems prone to races. Lets say that was the mechanism, however...– BeeOnRope
Nov 22 at 22:23
... in that case why would they all show different behavior at 10? One would expect this to resolve itself before filling all the buffers. I guess it might have to do with your
sfence
: perhaps the sfence
forces all the stores to get their own LFB. BTW, it's a shame that the l1d_pend_miss.pending
and l1d_pend_miss.pending_cycles
events don't count LFBs allocated for stores (or that there is not similar events for stores).– BeeOnRope
Nov 22 at 22:24
... in that case why would they all show different behavior at 10? One would expect this to resolve itself before filling all the buffers. I guess it might have to do with your
sfence
: perhaps the sfence
forces all the stores to get their own LFB. BTW, it's a shame that the l1d_pend_miss.pending
and l1d_pend_miss.pending_cycles
events don't count LFBs allocated for stores (or that there is not similar events for stores).– BeeOnRope
Nov 22 at 22:24
Note that these measurements are taken over the outer loop. Then I'm dividing by
ITERATIONS
. So I'm not sure whether the elbow at 10 is due to the flush loop, the inner loop, or both. Is there an easy way to measure over only the inner loop so we can know for sure?– Hadi Brais
Nov 22 at 22:33
Note that these measurements are taken over the outer loop. Then I'm dividing by
ITERATIONS
. So I'm not sure whether the elbow at 10 is due to the flush loop, the inner loop, or both. Is there an easy way to measure over only the inner loop so we can know for sure?– Hadi Brais
Nov 22 at 22:33
1
1
I think the graph can be explained by an observation you already made: This means that LFBs are becoming available much earlier when write combining or coalescing is possible. You are begging the question there: I think you are right that the indication is that more lines become free available sooner for the lower increments, but can't this simply be explained by it taking less time to return 1 line from memory (the 0, 4 increment cases) or 2 lines (the 8 case) than 10 lines (the 64 case)? You don't necessarily need to invoke coalescing.
– BeeOnRope
Nov 22 at 22:34
I think the graph can be explained by an observation you already made: This means that LFBs are becoming available much earlier when write combining or coalescing is possible. You are begging the question there: I think you are right that the indication is that more lines become free available sooner for the lower increments, but can't this simply be explained by it taking less time to return 1 line from memory (the 0, 4 increment cases) or 2 lines (the 8 case) than 10 lines (the 64 case)? You don't necessarily need to invoke coalescing.
– BeeOnRope
Nov 22 at 22:34
1
1
Now the graph looks like I would expect it. Isn't this just telling us that storing to 10+ cache lines (the increment 64 case) in rapid succession exceeds the 10 LFBs, whereas storing to 1 or 2 (the other cases), doesn't? I'm actually starting to worry my question is not well-formed. I expected that a given LFB absorb all later read or store requests to the same line, and I think that's what your graph shows. Does that make it "write combining" in the sense of the Intel manual though? Perhaps I didn't do a good job distinguishing the two.
– BeeOnRope
Nov 23 at 3:11
Now the graph looks like I would expect it. Isn't this just telling us that storing to 10+ cache lines (the increment 64 case) in rapid succession exceeds the 10 LFBs, whereas storing to 1 or 2 (the other cases), doesn't? I'm actually starting to worry my question is not well-formed. I expected that a given LFB absorb all later read or store requests to the same line, and I think that's what your graph shows. Does that make it "write combining" in the sense of the Intel manual though? Perhaps I didn't do a good job distinguishing the two.
– BeeOnRope
Nov 23 at 3:11
|
show 8 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53435632%2fare-write-combining-buffers-used-for-normal-writes-to-wb-memory-regions-on-intel%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
IIRC, I think I read somewhere that cache-miss stores (to WB memory) can commit into the LFB that's waiting for the data for that line to arrive. Or waiting for the RFO. But I might be mis-remembering, because I'm not sure that would let the core snoop those stores efficiently for store-forwarding.
– Peter Cordes
Nov 22 at 21:17
@PeterCordes that might also complicate memory ordering, since normal stores have to be strong ordered, so stores to different lines get combined into different in-flight buffers, it puts some strong restrictions about in what order the respective lines can be invalidated/made visible later. Perhaps other ordering concerns already imply this, I'm not sure.
– BeeOnRope
Nov 22 at 22:11