How to accurately simulate the concurrent data transfer characteristics of multi-channel DDR? #274

LujhCoconut · 2024-12-05T08:26:12Z

I tried two approaches to utilize multiple channels.

(1) The first was simply setting controllers=4 in the cfg file.

mem = {
    controllers = 4;
    type = "DDR";
    ranksPerChannel = 4;
    banksPerRank = 8;
    tech="DDR4-3200-CL22";
  };

Compared to controllers=1, there was almost no significant difference. The metric I focus on is the IPC corresponding to the CPU in the output file zsim.out.

mem = {
    controllers = 1;
    type = "DDR";
    ranksPerChannel = 4;
    banksPerRank = 8;
    tech="DDR4-3200-CL22";
  };

(2) The second approach draws inspiration from the implementation of banshee[https://github.com/yxymit/banshee]. It involves creating four DDR channels in an array-like structure to form a multi-channel DDR (mcdram). Memory requests are then distributed across channels by performing a modulo operation based on the number of channels to determine which channel handles each request.

 _mcdram = (MemObject **) gm_malloc(sizeof(MemObject *) * _mcdram_per_mc);
for (uint32_t i = 0; i < _mcdram_per_mc; i++) {
	g_string mcdram_name = _name + g_string("-mc-") + g_string(to_string(i).c_str());
    	// ...
        } else if (_mcdram_type == "DDR") {
	// XXX HACK tBL for mcdram is 1, so for data access, should multiply by 2, for tad access, should multiply by 3. 
        	_mcdram[i] = BuildDDRMemory(config, frequency, domain, mcdram_name, "sys.mem.mcdram.", 1, timing_scale);
	}//....

Address address = req.lineAddr;
uint32_t mcdram_select = (address / 64) % _mcdram_per_mc;
Address mc_address = (address / 64 / _mcdram_per_mc * 64) | (address % 64);
//...
if (_scheme == CacheOnly) {
	req.lineAddr = mc_address;
 	req.cycle = _mcdram[mcdram_select]->access(req, 0, 4);
	req.lineAddr = address;
	_numLoadHit.inc();
	futex_unlock(&_lock);
	return req.cycle;
}
//...

Unfortunately, I still observed almost identical performance (IPC) compared to the pure DDR setup with controllers=1.

To gain a deeper understanding of this issue, I referred to several past issues. For instance, I experimented with modifying tCK to increase bandwidth and adjusting tBL. While these changes had some effect, the improvements were not significant. I also examined the zsim-ndp[https://github.com/CriusT/zsim-ndp] implementation of MemChannel[https://github.com/CriusT/zsim-ndp/blob/master/src/mem_channel.cpp], but encountered similar performance challenges. I have also tried modifying the memory interleaving approach, but the results were still not good.

I added debugging information in the **trySchedule** function of **ddr_mem.cpp**. By comparing the debug output, I found that the two aforementioned methods for constructing multi-channel DDR systems exhibited almost identical r->arrivalCycle sequences. When timing parameters such as tBL were modified, only numerical differences appeared, but the pattern remained largely consistent.

uint64_t DDRMemory::trySchedule(uint64_t curCycle, uint64_t sysCycle) {
//...
std::cout << curCycle << " Found ready request 0x" <<  r->addr << "   r->arrCycle= " << r->arrivalCycle << std::endl;
//...
}

I encountered a similar issue when using gem5 Simulator. This raises the question: are these discrete-event-driven simulators inherently limited in accurately simulating the parallelism achievable with multi-channel memory systems, particularly their ability to exploit high bandwidth through concurrency?

Thank you for any useful suggestions!

The text was updated successfully, but these errors were encountered:

berkan-sahin · 2024-12-27T10:17:31Z

There is a highly accurate memory simulator called Ramulator, which you can find here: https://github.com/CMU-SAFARI/ramulator2.
The README shows how one can connect it to the gem5 simulator. (https://github.com/CMU-SAFARI/ramulator2?tab=readme-ov-file#using-ramulator-20-as-a-library-gem5-example) I suppose it could also be connected to ZSim with some effort (a previous version of Ramulator was connected to ZSim in this project: https://github.com/CMU-SAFARI/DAMOV)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to accurately simulate the concurrent data transfer characteristics of multi-channel DDR? #274

How to accurately simulate the concurrent data transfer characteristics of multi-channel DDR? #274

LujhCoconut commented Dec 5, 2024

berkan-sahin commented Dec 27, 2024

How to accurately simulate the concurrent data transfer characteristics of multi-channel DDR? #274

How to accurately simulate the concurrent data transfer characteristics of multi-channel DDR? #274

Comments

LujhCoconut commented Dec 5, 2024

berkan-sahin commented Dec 27, 2024