On x86-64, this could only happen if the compiler chose to reorder at compile-time. The x86 memory model is program order plus a store-buffer with store forwarding, which only allows [StoreLoad][1] plus the quirks of threads seeing their own stores before they're globally visible. (Neither thread is doing that here.) https://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu/ has some practical reordering tests on ARM, but not this one.
I think it's theoretically possible on ARM / AArch64, but only with an extremely slow response to a MESI share request for `x`. Perhaps multiple other cores spamming atomic RMWs on it, like `x.fetch_add(0)` in other threads? (Beware of clang optimizing no-op RMWs to just memory barriers without touching the cache line). Or incomplete LDXR / STXR transactions that don't actually modify it? Maybe with hand-written asm with `ldxr` / `clrex` if that's a thing in 64-bit mode.
Or more simply, a 3rd and 4th thread spamming stores (and maybe loads) on a different part of the same cache line holding `x`.
---
Separate cache lines for the variables is necessary (so use `alignas(64) std::atomic<int> x {0};` etc.) We need the `x.load` to miss in cache so badly that it doesn't take a value until after a round trip to the other thread. And probably that thread 2 doesn't even send out (to other cores) a request to read the cache line until long after its store has retired and committed to L1d cache. Otherwise the request will go out to other cores before thread 1 has even dirtied that cache line, so data will just come in from DRAM, or from the main thread's store since it's a local on the stack.
A long dependency chain for the load address like Nate tried won't help, unfortunately. That would work if you could delay the load that way without delaying visibility of the store that follows it, but that's not the case due to the way out-of-order exec works. (In-order exec of course automatically delays all later instructions when earlier instructions are slow.)
Stores can't become visible to other cores (commit from the store buffer to L1d cache) until after they retire from out-of-order exec, because they're still speculative until then. And speculation needs to be kept private within one core so only it has to roll back on detection of mis-speculation. See *https://stackoverflow.com/questions/64141366/can-a-speculatively-executed-cpu-branch-contain-opcodes-that-access-ram* (All instructions that could potentially fault are speculated, not just branches.)
So the store `D` can't retire until after the load address for `C` is ready and the load itself sent to an execution unit, unless the compiler statically (in the asm) reorders the store before the load. It's allowed to do that, but it won't do so for no reason. And compilers don't (currently) optimize atomics, so an earlier store of `y` to get the compiler to do dead-store elimination won't work, nor will other operations on `x`.
---
**If the core running thread 2 sends out a MESI share request due to `x.load()` at about the same time it sends out a RFO (Read For Ownership) due to `y.store()`, it's going to get replies to both within a pretty narrow window.** Far too little time for another core to read the store value and and have modified the cache line holding `x` before receiving and responding to the share request.
At least under normal conditions; with heavy contention for the cache line holding `x`, the reply could be delayed a lot, with some other threads getting a turn (giving time for `y.store` to be seen by `y.load` in thread 1).
But the `x.store` in thread 1 (B) won't happen until one hop of inter-thread-latency after `y.store` in thread 2 (D). And the `x.load` in thread 1 (C) will have had all that time to get its turn to read that cache line.
Perhaps it could help to have a lot of previous cache misses (to scattered addresses) in thread 2 oustanding, in hopes that the load could still be sent to an execution unit and retire before there was a free buffer to track an off-core MESI request for it. If those loads were all `ldapr` acquire loads, it would make sure they had to finish before the relaxed `ldr`.
A non-x86 ISA can retire a load from the out-of-order back-end before the data arrives, as long as the address is known and it's verified that it's non-faulting. (i.e. TLB hit and valid translation.) This is how LoadStore reordering is possible at all.
---
### Other notes on testing methodology:
You need to compile with optimization enabled, otherwise GCC won't inline `x.store` and so on. Run-time variable `memory_order` gets treated as `seq_cst` instead of branching to maybe skip a fence. https://godbolt.org/z/6rz43a716
Thread startup may take longer than the reordering window; as Nate described in comments, you want to run multiple test after starting a thread.
Perhaps have thread 1 spin until it sees a non-`0` value, then store that. That removes the need for a timing coincidence there of happening to run at just the right time. Preshing's https://preshing.com/20120515/memory-reordering-caught-in-the-act x86 StoreLoad test has a sync and random delay before doing one test, to maximize the chance both threads are really running at the same time, not having one finish before the OS can get to starting the 2nd one on a different core.
[1]: https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/
Your bullet points of assumptions all look correct to me, except that you could build a uarch where loads can retire from the OoO core after merely checking permissions (TLB) on a load to make sure it can definitely happen. There could be OoO exec CPUs that do that (update: apparently there are).
I think x86 CPUs require loads to actually have the data arrive before they can retire, but their strong memory model doesn't allow LoadStore reordering anyway. So ARM certainly could be different.
You're right that stores can't be made visible to any other cores before retirement. That way lies madness. Even on an [SMT core][1] (multiple logical threads on one physical core), it would link speculation on two logical threads together, requiring them both to roll back if either one detected mis-speculation. That would defeat the purpose of SMT of having one logical thread take advantage of stalls in others.
(Related: Making retired but not yet committed (to L1d) stores visible to other logical threads on the same core is how some real PowerPC implementations make it possible for threads to disagree on the global order of stores. https://stackoverflow.com/questions/27807118/will-two-atomic-writes-to-different-locations-in-different-threads-always-be-see/50679223#50679223)
---
**CPUs with in-order execution can start a load (check the TLB and write a load-buffer entry) and only stall if an instruction tries to use the result before it's ready. Then later instructions, including stores, can run normally**. This is basically required for non-terrible performance in an in-order pipeline; stalling on every cache miss (or even just L1d latency) would be unacceptable. Memory parallelism is a thing even on in-order CPUs; they can have multiple load buffers that track multiple outstanding cache misses. High(ish) performance in-order ARM cores like [Cortex-A53][3] are still widely used in modern smartphones, and scheduling loads well ahead of when the result register is used is a well-known important optimization for looping over an array. (Unrolling or even software pipelining.)
So if the load misses in cache but the store hits (and commits to L1d before earlier cache-miss loads get their data), you can get LoadStore reordering. ([Jeff Preshing intro to memory reording][2] uses that example for LoadStore, but doesn't get into uarch details at all.)
**A load can't fault after you've checked the TLB and / or whatever memory-region stuff for it**. That part has to be complete before it retires, or before it reaches the end of an in-order pipeline. Just like a retired store sitting in the store buffer waiting to commit, a retired load sitting in a load buffer is definitely happening at some point.
So the sequence on an in-order pipeline is:
* `lw r0, [r1]` TLB hit, but misses in L1d cache. Load execution unit writes the address (`r1`) into a load buffer. Any later instruction that tries to read `r0` will stall, but we know for sure that the load didn't fault.
With `r0` tied to waiting for that load buffer to be ready, the `lw` instruction itself can leave the pipeline (retire), and so can later instructions.
* any amount of other instructions that don't read r0. That would stall an in-order pipeline.
* `sw r2, [r3]` store execution unit writes address + data to the store buffer / queue. Then this instruction can retire.
**Probing the load buffers finds that this store doesn't overlap with the pending load, so it can commit to L1d.** (If it *had* overlapped, you couldn't commit it until a MESI RFO completed anyway, and fast restart would forward the incoming data to the load buffer. So it might not be too complicated to handle that case without even probing on every store, but let's only look at the separate-cache-line case where we can get LoadStore reordering)
Committing to L1d = becoming globally visible. This can happen while the earlier load is still waiting for the cache line to arrive.
----
For OoO CPUs, you'd need some way to tie load completion back into the OoO core for instructions waiting on the load result. I guess that's possible, but it means that the architectural/retirement value of a register might not be stored anywhere in the core. Pipeline flushes and other rollbacks from mis-speculation would have to hang on to that association between an incoming load and a physical and architectural register. (Not flushing store buffers on pipeline rollbacks is already a thing that CPUs have to do, though. Retired but not yet committed stores sitting in the store buffer have no way to be rolled back.)
That could be a good design idea for uarches with a small OoO window that's too small to come close to hiding a cache miss. (Which to be fair, is every high-performance OoO exec CPU: memory latency is usually too high to fully hide.)
---
We have experimental evidence of LoadStore reordering on an OoO ARM: section 7.1 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf shows non-zero counts for "load buffering" on [Tegra 2][5], which is based on the out-of-order [Cortex-A9 uarch][6]. I didn't look up all the others, but I did rewrite the answer to suggest that this is the likely mechanism for out-of-order CPUs, too. I don't know for sure if that's the case, though.
[1]: https://en.wikipedia.org/wiki/Simultaneous_multithreading
[2]: http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/
[3]: https://en.wikipedia.org/wiki/ARM_Cortex-A53
[4]: https://en.wikipedia.org/wiki/List_of_Qualcomm_Snapdragon_systems-on-chip#Snapdragon_429,_439_and_450_(2017/18)
[5]: https://en.wikipedia.org/wiki/Tegra#Tegra_2
[6]: https://en.wikipedia.org/wiki/ARM_Cortex-A9