Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to accurately simulate the concurrent data transfer characteristics of multi-channel DDR? #274

Open
LujhCoconut opened this issue Dec 5, 2024 · 1 comment

Comments

@LujhCoconut
Copy link

I tried two approaches to utilize multiple channels.

(1) The first was simply setting controllers=4 in the cfg file.

mem = {
    controllers = 4;
    type = "DDR";
    ranksPerChannel = 4;
    banksPerRank = 8;
    tech="DDR4-3200-CL22";
  };

Compared to controllers=1, there was almost no significant difference. The metric I focus on is the IPC corresponding to the CPU in the output file zsim.out.

mem = {
    controllers = 1;
    type = "DDR";
    ranksPerChannel = 4;
    banksPerRank = 8;
    tech="DDR4-3200-CL22";
  };

(2) The second approach draws inspiration from the implementation of banshee[https://github.com/yxymit/banshee]. It involves creating four DDR channels in an array-like structure to form a multi-channel DDR (mcdram). Memory requests are then distributed across channels by performing a modulo operation based on the number of channels to determine which channel handles each request.

 _mcdram = (MemObject **) gm_malloc(sizeof(MemObject *) * _mcdram_per_mc);
for (uint32_t i = 0; i < _mcdram_per_mc; i++) {
	g_string mcdram_name = _name + g_string("-mc-") + g_string(to_string(i).c_str());
    	// ...
        } else if (_mcdram_type == "DDR") {
	// XXX HACK tBL for mcdram is 1, so for data access, should multiply by 2, for tad access, should multiply by 3. 
        	_mcdram[i] = BuildDDRMemory(config, frequency, domain, mcdram_name, "sys.mem.mcdram.", 1, timing_scale);
	}//....
Address address = req.lineAddr;
uint32_t mcdram_select = (address / 64) % _mcdram_per_mc;
Address mc_address = (address / 64 / _mcdram_per_mc * 64) | (address % 64);
//...
if (_scheme == CacheOnly) {
	req.lineAddr = mc_address;
 	req.cycle = _mcdram[mcdram_select]->access(req, 0, 4);
	req.lineAddr = address;
	_numLoadHit.inc();
	futex_unlock(&_lock);
	return req.cycle;
}
//...

Unfortunately, I still observed almost identical performance (IPC) compared to the pure DDR setup with controllers=1.

To gain a deeper understanding of this issue, I referred to several past issues. For instance, I experimented with modifying tCK to increase bandwidth and adjusting tBL. While these changes had some effect, the improvements were not significant. I also examined the zsim-ndp[https://github.com/CriusT/zsim-ndp] implementation of MemChannel[https://github.com/CriusT/zsim-ndp/blob/master/src/mem_channel.cpp], but encountered similar performance challenges. I have also tried modifying the memory interleaving approach, but the results were still not good.

I added debugging information in the **trySchedule** function of **ddr_mem.cpp**. By comparing the debug output, I found that the two aforementioned methods for constructing multi-channel DDR systems exhibited almost identical r->arrivalCycle sequences. When timing parameters such as tBL were modified, only numerical differences appeared, but the pattern remained largely consistent.

uint64_t DDRMemory::trySchedule(uint64_t curCycle, uint64_t sysCycle) {
//...
std::cout << curCycle << " Found ready request 0x" <<  r->addr << "   r->arrCycle= " << r->arrivalCycle << std::endl;
//...
}

I encountered a similar issue when using gem5 Simulator. This raises the question: are these discrete-event-driven simulators inherently limited in accurately simulating the parallelism achievable with multi-channel memory systems, particularly their ability to exploit high bandwidth through concurrency?

Thank you for any useful suggestions!

@berkan-sahin
Copy link

There is a highly accurate memory simulator called Ramulator, which you can find here: https://github.com/CMU-SAFARI/ramulator2.
The README shows how one can connect it to the gem5 simulator. (https://github.com/CMU-SAFARI/ramulator2?tab=readme-ov-file#using-ramulator-20-as-a-library-gem5-example) I suppose it could also be connected to ZSim with some effort (a previous version of Ramulator was connected to ZSim in this project: https://github.com/CMU-SAFARI/DAMOV)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants