Adding Selective Replay and Bank Conflict Prediction to the Out-of-Order CPU model in Gem5

Pradeep Kumar

Nikhita Kunati



### Life of an instruction in the O3 CPU

Current O3 model doesn't allow speculative scheduling of load dependent instructions and on a memory order violation it squashes all the instructions in the pipeline and re-fetches from the violating PC IDEA – Allocate a token to every load and every dependent instruction inherits the token id in a dependence vector



SELECTIV E TOKEN BASED REPLAY

| Architecte<br>d Register | Physical<br>Register | Dependence<br>vector |
|--------------------------|----------------------|----------------------|
| R2                       | p2                   | 0000000              |
| R1                       | р9                   | 0000001              |
| R4                       | р7                   | 0000001              |
| R3                       | p4                   | 00000011             |
| R5                       | P11                  | 00000011             |

| LOAD R1, [R2] | → Token ID 1             |
|---------------|--------------------------|
| ADD R4,R1,R2  |                          |
| LOAD R3,[R4]  | $\rightarrow$ Token ID 2 |
| SUB R5,R3,R4  |                          |

Vhile renaming dest regs once mapping is done

f (inst->isLoad) { depVector[renamedDestReg] = depVector[renamedDestReg] |(1 << (tokenID -1) terate(renamedSrcRegs) { depVector[renamedDestReg] = dep[renamedDestReg] | lepVector[renamedSrcReg] Changes made to O3 source code

- rename\_impl.hh, rename.hh -> tokenID allocation/deallocation, dependence vector calculation, clearing dependence vector
- In the inst\_queue\_impl.hh instead of popping entries from the instList during commit we added a logic to only pop entries with dependence vector as all zeroes. We maintain a replayQueue which has instructions with non-zero Dependence vectors and remove insts when the commit.
- In lsq\_unit\_impl.hh when we detect a memory order violation we replay->notify(tokenID). And in the next cycle in issue stage we reissue instructions that match the tokenID from replayQueue.
- Changes made to Fetch and decode and ROB to not squash instructions.



| OoO Execution             | 8-wide fetch/issue/commit, 256 ROB,<br>128 LSQ, 128 issue queue entries                                           |
|---------------------------|-------------------------------------------------------------------------------------------------------------------|
| Functional Units(Latency) | 8 INT ALUs(1), 4 FP ALUs(2), 4 INT<br>MULT/DIV(3/20), 4 FP MULT/DIV(4/24),<br>4 general memory ports              |
| Branch Prediction         | Tournament Predictor                                                                                              |
| Memory System(Latency)    | 32KB 2-way 64B line L1I(2), 32KB 4-way<br>64B line L1D(2), 512KB 4-way 64B line<br>L2(8), 8193MB main memory(1ns) |

#### Number of squashed instructions issued



Number of squashed instructions issued

#### memOrderViolationEvents



memOrderViolationEvents



% of instructions not dependent on the violating loads



### Number of squashed instructions that are issued

Squashing Selective Reply

### BANK CONFLICT PREDICTION

MOTIVATION – To allow more than one concurrent memory reference Multi-Banked caches were proposed as a less expensive option compared to multiported cache however possible bank conflicts may pose as a limitation



IDEA – Use Bank conflict prediction to intelligently schedule load instructions that have less chance of conflicting.

### Previous Work on Bank Prediction

Previously proposed technique involves predicting a bank for each load which will improve scheduling of loads as well as simplify the memory execution pipeline





| Ready Queue |  |  |  |  |
|-------------|--|--|--|--|
| Load        |  |  |  |  |
| Sub         |  |  |  |  |
| Add         |  |  |  |  |
| Store       |  |  |  |  |
| Load        |  |  |  |  |
| Add         |  |  |  |  |
| Add         |  |  |  |  |
| Mov         |  |  |  |  |
|             |  |  |  |  |

| OoO Execution          |                          |                                                                                                                | 8-wide fetch/issue/commit, 256 ROB, 128 LSQ, 128 issue queue entries |                        |     |                         |                   |           |          |
|------------------------|--------------------------|----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|------------------------|-----|-------------------------|-------------------|-----------|----------|
| Memory System(Latency) |                          | 32KB 2-way 64B line L1I(2), 32KB 4-way 64B line L1D(2),<br>512KB 4-way 64B line L2(8), 8193MB main memory(1ns) |                                                                      |                        |     |                         |                   |           |          |
| Benchmark              | Total number<br>of loads | Total number<br>of bank<br>conflicting<br>Loads                                                                | 18 -<br>16 -                                                         | % of conflicting loads |     |                         |                   |           |          |
| Leslie3d               | 32534                    | 4348                                                                                                           | 14 -<br>12 -                                                         |                        | _   |                         |                   |           |          |
| Gcc                    | 25648                    | 3816                                                                                                           | 12 -                                                                 |                        |     |                         |                   |           |          |
| Milc                   | 23012                    | 2470                                                                                                           | 8 -                                                                  |                        |     |                         |                   |           |          |
| hmmer                  | 29049                    | 3389                                                                                                           | 6 -                                                                  |                        |     |                         |                   |           |          |
| calculix               | 35507                    | 5949                                                                                                           | 2 -                                                                  |                        |     |                         |                   |           |          |
| Xalancbmk              | 27936                    | 3814                                                                                                           | 0 —                                                                  | leslie3d               | gcc | milc<br>% of conflictin | hmmer<br>ng loads | xalancbmk | calculix |

# Changes in Gem5 Source

Made a record of the load instructions that got scheduled at the same cycle.

Prediction for each load was made based on their PC Address.

If there are more than one conflicting predictions then Scheduling logic pushes the load instruction back to the ready queue for the next cycle.

### In the IEW Unit,

When the address for all the same cycle dispatched load instructions was calculated, we can easily find out the number of actually conflicted loads and thus update the conflict predictor table along with the hashed PC address.(This is how we train the predictor).

Changes in IQ::**scheduleReadyInsts** ()

Changes in IEW::**ExecuteLoad()** 

## Difficulty faced while implementing in Gem5

No L1D cache banks option available in Gem5. L2 cache banking is implemented in Ruby cache simulator.(Part of Gem5) But very hard to replicate the same thing for L1D. We tried but failed to do so.

A common Data structure was required for Scheduling logic to read while predicting and executeLoad() function to update it.

Both were happening in different cycle for the same load. We spent a lot time in figuring out and then moved on to generate Trace.

We generated traces (PC address + effective mem address) of all the load instructions that got issued in the same cycle by making some changes in the execute stage of gem5.

Used These traces for an offline predictor that we wrote in Perl.

FUTURE WORK

- Improvise token-ID allocation to loads with confidence estimator
- Add speculative scheduling of dependent instructions to see more benfits of selective replay.
- Handle inflight instructions in the O3 cpu model cleanely on the event of a memory order violation

# References

[1] I. Kim and M. Lipasti. "Understanding Scheduling Replay Schemes." In Proc. 10th International Symposium on High Performance Computer Architecture, Feb. 2004.

 [2] A. Yoaz, E. Mattan, R. Ronen, and S. Jourdan.: Speculation Techniques for Improving Load Related Instruction Scheduling. Proc. of 26th ISCA (1999) 42– 53

[3] <u>The gem5 Simulator</u>. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. May 2011, ACM SIGARCH Computer Architecture News.