improve atomic read from InterlockedCompareExchange()











up vote
2
down vote

favorite












Assuming architecture is ARM64 or x86-64.



I want to make sure if these two are equivalent:




  1. a = _InterlockedCompareExchange64((__int64*)p, 0, 0);

  2. MyBarrier(); a = *(volatile __int64*)p; MyBarrier();


Where MyBarrier() is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory").
So method 2 is supposed to be faster than method 1.



I heard that _Interlocked() functions would also imply memory barrier of both compiler and hardware level.



I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?



(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.)



Thank you for any advise/correction on this.





Here is some benchmarks on Ivy Bridge (i5 laptop).



(1E+006 loops: 27ms):



; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx


(1E+006 loops: 27ms):



; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]


(1E+006 loops: 7ms):



; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]


(1E+006 loops: 1.26ms, not synchronized?):



; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]









share|improve this question
























  • It is just not equivalent. sfence ensures that the store is visible but doesn't make sure that the load is fresh. So no atomic read at all. mfence is equivalent, good odds that it won't make any difference anymore. Maybe you meant lfence, hard to tell.
    – Hans Passant
    Nov 23 at 10:43

















up vote
2
down vote

favorite












Assuming architecture is ARM64 or x86-64.



I want to make sure if these two are equivalent:




  1. a = _InterlockedCompareExchange64((__int64*)p, 0, 0);

  2. MyBarrier(); a = *(volatile __int64*)p; MyBarrier();


Where MyBarrier() is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory").
So method 2 is supposed to be faster than method 1.



I heard that _Interlocked() functions would also imply memory barrier of both compiler and hardware level.



I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?



(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.)



Thank you for any advise/correction on this.





Here is some benchmarks on Ivy Bridge (i5 laptop).



(1E+006 loops: 27ms):



; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx


(1E+006 loops: 27ms):



; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]


(1E+006 loops: 7ms):



; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]


(1E+006 loops: 1.26ms, not synchronized?):



; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]









share|improve this question
























  • It is just not equivalent. sfence ensures that the store is visible but doesn't make sure that the load is fresh. So no atomic read at all. mfence is equivalent, good odds that it won't make any difference anymore. Maybe you meant lfence, hard to tell.
    – Hans Passant
    Nov 23 at 10:43















up vote
2
down vote

favorite









up vote
2
down vote

favorite











Assuming architecture is ARM64 or x86-64.



I want to make sure if these two are equivalent:




  1. a = _InterlockedCompareExchange64((__int64*)p, 0, 0);

  2. MyBarrier(); a = *(volatile __int64*)p; MyBarrier();


Where MyBarrier() is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory").
So method 2 is supposed to be faster than method 1.



I heard that _Interlocked() functions would also imply memory barrier of both compiler and hardware level.



I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?



(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.)



Thank you for any advise/correction on this.





Here is some benchmarks on Ivy Bridge (i5 laptop).



(1E+006 loops: 27ms):



; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx


(1E+006 loops: 27ms):



; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]


(1E+006 loops: 7ms):



; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]


(1E+006 loops: 1.26ms, not synchronized?):



; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]









share|improve this question















Assuming architecture is ARM64 or x86-64.



I want to make sure if these two are equivalent:




  1. a = _InterlockedCompareExchange64((__int64*)p, 0, 0);

  2. MyBarrier(); a = *(volatile __int64*)p; MyBarrier();


Where MyBarrier() is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory").
So method 2 is supposed to be faster than method 1.



I heard that _Interlocked() functions would also imply memory barrier of both compiler and hardware level.



I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?



(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.)



Thank you for any advise/correction on this.





Here is some benchmarks on Ivy Bridge (i5 laptop).



(1E+006 loops: 27ms):



; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx


(1E+006 loops: 27ms):



; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]


(1E+006 loops: 7ms):



; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]


(1E+006 loops: 1.26ms, not synchronized?):



; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]






c++ multithreading 64bit atomicity interlocked






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 at 10:51

























asked Nov 22 at 7:56









cozmoz

143




143












  • It is just not equivalent. sfence ensures that the store is visible but doesn't make sure that the load is fresh. So no atomic read at all. mfence is equivalent, good odds that it won't make any difference anymore. Maybe you meant lfence, hard to tell.
    – Hans Passant
    Nov 23 at 10:43




















  • It is just not equivalent. sfence ensures that the store is visible but doesn't make sure that the load is fresh. So no atomic read at all. mfence is equivalent, good odds that it won't make any difference anymore. Maybe you meant lfence, hard to tell.
    – Hans Passant
    Nov 23 at 10:43


















It is just not equivalent. sfence ensures that the store is visible but doesn't make sure that the load is fresh. So no atomic read at all. mfence is equivalent, good odds that it won't make any difference anymore. Maybe you meant lfence, hard to tell.
– Hans Passant
Nov 23 at 10:43






It is just not equivalent. sfence ensures that the store is visible but doesn't make sure that the load is fresh. So no atomic read at all. mfence is equivalent, good odds that it won't make any difference anymore. Maybe you meant lfence, hard to tell.
– Hans Passant
Nov 23 at 10:43














1 Answer
1






active

oldest

votes

















up vote
1
down vote













For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.



However, _MemoryBarrier() is not a "hint to the compiler". _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. You also probably only need the first _MemoryBarrier(), the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.



When you create this replacement, you basically end up with pretty much the same result:



// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val

// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val


Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.



However, with two memory barriers, the "optimized version" is actually around 2x slower.



So the better question is: Why are you using _InterlockedCompareExchange64 at all? If you need atomic access to a variable, use std::atomic, and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.






share|improve this answer























  • And btw, __int64? You should stick to standard typedefs from stdint.h/cstdint.
    – Groo
    Nov 22 at 11:25










  • I am so sorry that, I previously used misleading _MemoryBarrier() instead of MyBarrier(). I am not using microsoft's macro MemoryBarrier(). So the updated asm code for the 2nd version (the "optimized version"), should not include lock or DWORD PTR [rsp], r8d, which is emited by MemoryBarrier().
    – cozmoz
    Nov 22 at 11:26










  • Interlocked functions are easy to understand. And I personally hate to use std::atomic, which is too complex to me.
    – cozmoz
    Nov 22 at 11:31










  • @cozmoz: in that case, the resulting code will not guarantee that other threads will see values being updated in the program order. Anyway, as a C++ programmer, you should really take a moment of your time and read the docs for std::atomic. It's standard, it works, and, most of all, it lets you convey your intents explicitly. Do you only need an atomic read? Use memory_order_relaxed. Do you need to publish the changes across all threads with sequential consistency? Use memory_order_seq_cst. Right now, you are placing performance optimizations above code correctness and clarity.
    – Groo
    Nov 22 at 12:41












  • Thanks. I always use interlocked functions to modify shared variables, so there's no problem in the producer threads. Since atomicity is never a problem with ARM64/x86-64, so the only require is to read true value in consumer threads. The question is if the variable is modified by some interlocked function in some producer thread, does the updated value immediately visibile in another viewer thread by a simple volatile read?
    – cozmoz
    Nov 23 at 9:07











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53426225%2fimprove-atomic-read-from-interlockedcompareexchange%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote













For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.



However, _MemoryBarrier() is not a "hint to the compiler". _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. You also probably only need the first _MemoryBarrier(), the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.



When you create this replacement, you basically end up with pretty much the same result:



// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val

// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val


Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.



However, with two memory barriers, the "optimized version" is actually around 2x slower.



So the better question is: Why are you using _InterlockedCompareExchange64 at all? If you need atomic access to a variable, use std::atomic, and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.






share|improve this answer























  • And btw, __int64? You should stick to standard typedefs from stdint.h/cstdint.
    – Groo
    Nov 22 at 11:25










  • I am so sorry that, I previously used misleading _MemoryBarrier() instead of MyBarrier(). I am not using microsoft's macro MemoryBarrier(). So the updated asm code for the 2nd version (the "optimized version"), should not include lock or DWORD PTR [rsp], r8d, which is emited by MemoryBarrier().
    – cozmoz
    Nov 22 at 11:26










  • Interlocked functions are easy to understand. And I personally hate to use std::atomic, which is too complex to me.
    – cozmoz
    Nov 22 at 11:31










  • @cozmoz: in that case, the resulting code will not guarantee that other threads will see values being updated in the program order. Anyway, as a C++ programmer, you should really take a moment of your time and read the docs for std::atomic. It's standard, it works, and, most of all, it lets you convey your intents explicitly. Do you only need an atomic read? Use memory_order_relaxed. Do you need to publish the changes across all threads with sequential consistency? Use memory_order_seq_cst. Right now, you are placing performance optimizations above code correctness and clarity.
    – Groo
    Nov 22 at 12:41












  • Thanks. I always use interlocked functions to modify shared variables, so there's no problem in the producer threads. Since atomicity is never a problem with ARM64/x86-64, so the only require is to read true value in consumer threads. The question is if the variable is modified by some interlocked function in some producer thread, does the updated value immediately visibile in another viewer thread by a simple volatile read?
    – cozmoz
    Nov 23 at 9:07















up vote
1
down vote













For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.



However, _MemoryBarrier() is not a "hint to the compiler". _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. You also probably only need the first _MemoryBarrier(), the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.



When you create this replacement, you basically end up with pretty much the same result:



// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val

// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val


Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.



However, with two memory barriers, the "optimized version" is actually around 2x slower.



So the better question is: Why are you using _InterlockedCompareExchange64 at all? If you need atomic access to a variable, use std::atomic, and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.






share|improve this answer























  • And btw, __int64? You should stick to standard typedefs from stdint.h/cstdint.
    – Groo
    Nov 22 at 11:25










  • I am so sorry that, I previously used misleading _MemoryBarrier() instead of MyBarrier(). I am not using microsoft's macro MemoryBarrier(). So the updated asm code for the 2nd version (the "optimized version"), should not include lock or DWORD PTR [rsp], r8d, which is emited by MemoryBarrier().
    – cozmoz
    Nov 22 at 11:26










  • Interlocked functions are easy to understand. And I personally hate to use std::atomic, which is too complex to me.
    – cozmoz
    Nov 22 at 11:31










  • @cozmoz: in that case, the resulting code will not guarantee that other threads will see values being updated in the program order. Anyway, as a C++ programmer, you should really take a moment of your time and read the docs for std::atomic. It's standard, it works, and, most of all, it lets you convey your intents explicitly. Do you only need an atomic read? Use memory_order_relaxed. Do you need to publish the changes across all threads with sequential consistency? Use memory_order_seq_cst. Right now, you are placing performance optimizations above code correctness and clarity.
    – Groo
    Nov 22 at 12:41












  • Thanks. I always use interlocked functions to modify shared variables, so there's no problem in the producer threads. Since atomicity is never a problem with ARM64/x86-64, so the only require is to read true value in consumer threads. The question is if the variable is modified by some interlocked function in some producer thread, does the updated value immediately visibile in another viewer thread by a simple volatile read?
    – cozmoz
    Nov 23 at 9:07













up vote
1
down vote










up vote
1
down vote









For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.



However, _MemoryBarrier() is not a "hint to the compiler". _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. You also probably only need the first _MemoryBarrier(), the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.



When you create this replacement, you basically end up with pretty much the same result:



// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val

// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val


Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.



However, with two memory barriers, the "optimized version" is actually around 2x slower.



So the better question is: Why are you using _InterlockedCompareExchange64 at all? If you need atomic access to a variable, use std::atomic, and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.






share|improve this answer














For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.



However, _MemoryBarrier() is not a "hint to the compiler". _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. You also probably only need the first _MemoryBarrier(), the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.



When you create this replacement, you basically end up with pretty much the same result:



// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val

// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val


Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.



However, with two memory barriers, the "optimized version" is actually around 2x slower.



So the better question is: Why are you using _InterlockedCompareExchange64 at all? If you need atomic access to a variable, use std::atomic, and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 22 at 10:44

























answered Nov 22 at 10:38









Groo

34.8k1383158




34.8k1383158












  • And btw, __int64? You should stick to standard typedefs from stdint.h/cstdint.
    – Groo
    Nov 22 at 11:25










  • I am so sorry that, I previously used misleading _MemoryBarrier() instead of MyBarrier(). I am not using microsoft's macro MemoryBarrier(). So the updated asm code for the 2nd version (the "optimized version"), should not include lock or DWORD PTR [rsp], r8d, which is emited by MemoryBarrier().
    – cozmoz
    Nov 22 at 11:26










  • Interlocked functions are easy to understand. And I personally hate to use std::atomic, which is too complex to me.
    – cozmoz
    Nov 22 at 11:31










  • @cozmoz: in that case, the resulting code will not guarantee that other threads will see values being updated in the program order. Anyway, as a C++ programmer, you should really take a moment of your time and read the docs for std::atomic. It's standard, it works, and, most of all, it lets you convey your intents explicitly. Do you only need an atomic read? Use memory_order_relaxed. Do you need to publish the changes across all threads with sequential consistency? Use memory_order_seq_cst. Right now, you are placing performance optimizations above code correctness and clarity.
    – Groo
    Nov 22 at 12:41












  • Thanks. I always use interlocked functions to modify shared variables, so there's no problem in the producer threads. Since atomicity is never a problem with ARM64/x86-64, so the only require is to read true value in consumer threads. The question is if the variable is modified by some interlocked function in some producer thread, does the updated value immediately visibile in another viewer thread by a simple volatile read?
    – cozmoz
    Nov 23 at 9:07


















  • And btw, __int64? You should stick to standard typedefs from stdint.h/cstdint.
    – Groo
    Nov 22 at 11:25










  • I am so sorry that, I previously used misleading _MemoryBarrier() instead of MyBarrier(). I am not using microsoft's macro MemoryBarrier(). So the updated asm code for the 2nd version (the "optimized version"), should not include lock or DWORD PTR [rsp], r8d, which is emited by MemoryBarrier().
    – cozmoz
    Nov 22 at 11:26










  • Interlocked functions are easy to understand. And I personally hate to use std::atomic, which is too complex to me.
    – cozmoz
    Nov 22 at 11:31










  • @cozmoz: in that case, the resulting code will not guarantee that other threads will see values being updated in the program order. Anyway, as a C++ programmer, you should really take a moment of your time and read the docs for std::atomic. It's standard, it works, and, most of all, it lets you convey your intents explicitly. Do you only need an atomic read? Use memory_order_relaxed. Do you need to publish the changes across all threads with sequential consistency? Use memory_order_seq_cst. Right now, you are placing performance optimizations above code correctness and clarity.
    – Groo
    Nov 22 at 12:41












  • Thanks. I always use interlocked functions to modify shared variables, so there's no problem in the producer threads. Since atomicity is never a problem with ARM64/x86-64, so the only require is to read true value in consumer threads. The question is if the variable is modified by some interlocked function in some producer thread, does the updated value immediately visibile in another viewer thread by a simple volatile read?
    – cozmoz
    Nov 23 at 9:07
















And btw, __int64? You should stick to standard typedefs from stdint.h/cstdint.
– Groo
Nov 22 at 11:25




And btw, __int64? You should stick to standard typedefs from stdint.h/cstdint.
– Groo
Nov 22 at 11:25












I am so sorry that, I previously used misleading _MemoryBarrier() instead of MyBarrier(). I am not using microsoft's macro MemoryBarrier(). So the updated asm code for the 2nd version (the "optimized version"), should not include lock or DWORD PTR [rsp], r8d, which is emited by MemoryBarrier().
– cozmoz
Nov 22 at 11:26




I am so sorry that, I previously used misleading _MemoryBarrier() instead of MyBarrier(). I am not using microsoft's macro MemoryBarrier(). So the updated asm code for the 2nd version (the "optimized version"), should not include lock or DWORD PTR [rsp], r8d, which is emited by MemoryBarrier().
– cozmoz
Nov 22 at 11:26












Interlocked functions are easy to understand. And I personally hate to use std::atomic, which is too complex to me.
– cozmoz
Nov 22 at 11:31




Interlocked functions are easy to understand. And I personally hate to use std::atomic, which is too complex to me.
– cozmoz
Nov 22 at 11:31












@cozmoz: in that case, the resulting code will not guarantee that other threads will see values being updated in the program order. Anyway, as a C++ programmer, you should really take a moment of your time and read the docs for std::atomic. It's standard, it works, and, most of all, it lets you convey your intents explicitly. Do you only need an atomic read? Use memory_order_relaxed. Do you need to publish the changes across all threads with sequential consistency? Use memory_order_seq_cst. Right now, you are placing performance optimizations above code correctness and clarity.
– Groo
Nov 22 at 12:41






@cozmoz: in that case, the resulting code will not guarantee that other threads will see values being updated in the program order. Anyway, as a C++ programmer, you should really take a moment of your time and read the docs for std::atomic. It's standard, it works, and, most of all, it lets you convey your intents explicitly. Do you only need an atomic read? Use memory_order_relaxed. Do you need to publish the changes across all threads with sequential consistency? Use memory_order_seq_cst. Right now, you are placing performance optimizations above code correctness and clarity.
– Groo
Nov 22 at 12:41














Thanks. I always use interlocked functions to modify shared variables, so there's no problem in the producer threads. Since atomicity is never a problem with ARM64/x86-64, so the only require is to read true value in consumer threads. The question is if the variable is modified by some interlocked function in some producer thread, does the updated value immediately visibile in another viewer thread by a simple volatile read?
– cozmoz
Nov 23 at 9:07




Thanks. I always use interlocked functions to modify shared variables, so there's no problem in the producer threads. Since atomicity is never a problem with ARM64/x86-64, so the only require is to read true value in consumer threads. The question is if the variable is modified by some interlocked function in some producer thread, does the updated value immediately visibile in another viewer thread by a simple volatile read?
– cozmoz
Nov 23 at 9:07


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53426225%2fimprove-atomic-read-from-interlockedcompareexchange%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

What visual should I use to simply compare current year value vs last year in Power BI desktop

Alexandru Averescu

Trompette piccolo