akira@ubuntu:~/false_sharing$ perf c2c record -F 10000 ./false_sharing_struct cache false sharing: 7394 ms [ perf record: Woken up 55 times to write data ] [ perf record: Captured and wrote 13.716 MB perf.data (177504 samples) ] akira@ubuntu:~/false_sharing$ perf c2c report --stats ================================================= Trace Event Information ================================================= Total records : 177504 Locked Load/Store Operations : 1 Load Operations : 76801 Loads - uncacheable : 0 Loads - IO : 0 Loads - Miss : 0 Loads - no mapping : 0 Load Fill Buffer Hit : 12598 Load L1D hit : 64188 Load L2D hit : 3 Load LLC hit : 11 Load Local HITM : 5 Load Remote HITM : 0 Load Remote HIT : 0 Load Local DRAM : 1 Load Remote DRAM : 0 Load MESI State Exclusive : 1 Load MESI State Shared : 0 Load LLC Misses : 1 LLC Misses to Local DRAM : 100.0% LLC Misses to Remote DRAM : 0.0% LLC Misses to Remote cache (HIT) : 0.0% LLC Misses to Remote cache (HITM) : 0.0% Store Operations : 100703 Store - uncacheable : 0 Store - no mapping : 2 Store L1D Hit : 94191 Store L1D Miss : 6510 No Page Map Rejects : 182 Unable to parse data source : 0
================================================= Global Shared Cache Line Event Information ================================================= Total Shared Cache Lines : 1 Load HITs on shared lines : 42461 Fill Buffer Hits on shared lines : 12591 L1D hits on shared lines : 29862 L2D hits on shared lines : 3 LLC hits on shared lines : 5 Locked Access on shared lines : 0 Store HITs on shared lines : 38640 Store L1D hits on shared lines : 32262 Total Merged records : 38645
================================================= c2c details ================================================= Events : cpu/mem-loads,ldlat=30/P : cpu/mem-stores/P Cachelines sort on : Total HITMs Cacheline data grouping : offset,pid,iaddr (END)
其中HITM大概率是发生了false sharing
Load HITs on shared lines和Store HITs on shared lines是共享的cacheline发生读写的操作,这种情况下会大大增加
MESI 总线的传输,对性能影响比较大。
akira@ubuntu:~/false_sharing$ perf c2c record -F 10000 ./false_sharing_struct cache without false sharing: 2284 ms [ perf record: Woken up 27 times to write data ] [ perf record: Captured and wrote 6.766 MB perf.data (88391 samples) ] akira@ubuntu:~/false_sharing$ perf c2c report --stats ================================================= Trace Event Information ================================================= Total records : 88391 Locked Load/Store Operations : 1 Load Operations : 43362 Loads - uncacheable : 0 Loads - IO : 0 Loads - Miss : 0 Loads - no mapping : 0 Load Fill Buffer Hit : 5 Load L1D hit : 43356 Load L2D hit : 0 Load LLC hit : 0 Load Local HITM : 0 Load Remote HITM : 0 Load Remote HIT : 0 Load Local DRAM : 1 Load Remote DRAM : 0 Load MESI State Exclusive : 1 Load MESI State Shared : 0 Load LLC Misses : 1 LLC Misses to Local DRAM : 100.0% LLC Misses to Remote DRAM : 0.0% LLC Misses to Remote cache (HIT) : 0.0% LLC Misses to Remote cache (HITM) : 0.0% Store Operations : 45029 Store - uncacheable : 0 Store - no mapping : 0 Store L1D Hit : 45006 Store L1D Miss : 23 No Page Map Rejects : 39 Unable to parse data source : 0
================================================= Global Shared Cache Line Event Information ================================================= Total Shared Cache Lines : 0 Load HITs on shared lines : 0 Fill Buffer Hits on shared lines : 0 L1D hits on shared lines : 0 L2D hits on shared lines : 0 LLC hits on shared lines : 0 Locked Access on shared lines : 0 Store HITs on shared lines : 0 Store L1D hits on shared lines : 0 Total Merged records : 0
================================================= c2c details ================================================= Events : cpu/mem-loads,ldlat=30/P : cpu/mem-stores/P Cachelines sort on : Total HITMs Cacheline data grouping : offset,pid,iaddr
这里可以看出,在没有 cacheline
冲突的情况下,HITM和`Load HITs on shared lines和Store HITs on shared的值都为0。
CPU-0 cache L3 / DRAM ------------------------------- I → Read ----------→ 返回干净副本 ←-------------------
状态变为 E (Exclusive) 因无其他副本,直接给
E(干净,独占)
3. 第 2 步:CPU-1 读 data.y → load miss
1 2 3 4 5 6
CPU-0 cache CPU-1 cache L3 ---------------------------------- E I Read --------→ 转发干净副本 ←------------ S S
两份副本都变成 S (Shared),内容干净,可读不可写
4. 第 3 步:CPU-0 写 data.x → Store Miss /
Write-Invalidate
1 2 3 4 5 6 7 8
CPU-0 cache CPU-1 cache L3 ---------------------------------- S S 要写 → 发 Invalidate ----→ 收到作废 状态变 I 回 ACK ←---------------------- E → M (Modified, 脏,独占)
MESI 核心规则:写前必须先独占
作废消息广播 → 对端副本立即失效;行变为 M
5. 第 4 步:CPU-1 写 data.y → 再次 Store Miss /
Read-Invalidate
1 2 3 4 5 6 7 8
CPU-0 cache CPU-1 cache L3 ---------------------------------- M (脏) I 收到 Inv ----→ 发 Read-Invalidate 必须把脏数据 **写回 L3 / 内存** 回送脏副本 ──────> 转发最新副本 状态变 I ←------------ E → M
因行在远端 Modified,需三步:
脏数据 Write-Back 到 L3/内存
本地副本 作废
把最新数据 转发 给请求者
此次往返 ≈ 60–200 ns,即 perf 报告的 HITM 延迟
6. 放大效应:循环 10 亿次 → 10 亿次作废/重载
若两个线程在同一 cache line 上交替写,每次写都会触发
Invalidate、脏数据写回与远端等待,导致显著的延迟累积。