gem5 Report

Author: Stefanos Fragoulis
Date: January 2025

Introduction & Preface

gem5 is an open-source, modular simulation platform widely used in computer architecture research and development. It allows researchers and engineers to model and analyze the behavior of computer systems, including CPUs, memory hierarchies, caches, and interconnects. By providing a flexible and extensible framework, gem5 supports multiple processor architectures such as x86, ARM, and RISC-V, as well as various system configurations.

The platform is instrumental for performance analysis, enabling the evaluation and optimization of both hardware components and software applications. It serves as a valuable tool for research and development, allowing experimentation with new architectural features, memory systems, and caching strategies. Additionally, gem5 is utilized for educational purposes, helping students and professionals understand the intricacies of computer architecture and system design. By combining different simulation techniques, including cycle-accurate and system-level modeling, gem5 offers detailed insights into the interactions between hardware and software. Its versatility and comprehensive capabilities make gem5 an essential tool for advancing computer architecture innovations and optimizing system performance.

gem5 was originally conceived for computer architecture research in academia, but it has grown to be used in computer system design by academia, industry for research, and in teaching.

The purpose of this project was to familiarize the students with the gem5 tool, as it is one of the most widely-used tools in the field of Computer Architecture.

Environment Setup

For this project, a Virtual Machine with gem5 installed was utilized to avoid unnecessary installation and dependency issues. The VM was provided by the assignment PDF and was executed on VirtualBox with 12 processors and 16GB of RAM.

Exercises

The first exercise aims at setting the foundations for understanding the functionality and operation of the gem5 simulation tool. This includes an initial read of the gem5 documentation, a fundamental exploration of the various CPU models, and an evaluation of the significance of the stats.txt file in performance analysis. Moreover, the exercise encourages students to experiment with different memory configurations, CPU architectures, and diverse CPU frequencies to gain practical insights into their impact on system behavior and performance.

First Exercise

First Question

"Open the starter_se.py that was utilized in the prior "Hello World" example and try to understand the basic parameters that gem5 uses regarding the emulated system. Write down basic system characteristics such as CPU type, operation frequency, basic units, caches, memory, etc."

CPU Type:
The script uses a dictionary cpu_types alongside a --cpu argument. The cpu_types dictionary is the following:

cpu_types = {
    "atomic": (AtomicSimpleCPU, None, None, None, None),
    "minor": (MinorCPU,
              devices.L1I, devices.L1D,
              devices.WalkCache,
              devices.L2),
    "hpi": (HPI.HPI,
            HPI.HPI_ICache, HPI.HPI_DCache,
            HPI.HPI_WalkCache,
            HPI.HPI_L2)
}

It is clear that it involves three different CPU models, some of which will be further discussed later. The desired CPU type can be set by including a --cpu argument in the bash command. For example:

/build/ARM/gem5.opt -d hello_result configs/example/arm/starter_se.py --cpu="minor" "tests/test-progs/hello/bin/arm/linux/hello"

In this provided example, the desired CPU type is Minor.

Caches:
Caches are part of the cpu_types configuration. In the "Hello World" example, caches are added as follows:

if self.cpu_cluster.memoryMode() == "timing":
    self.cpu_cluster.addL1()
    self.cpu_cluster.addL2(self.cpu_cluster.clk_domain)

Basic Units:
Defined inside the SimpleSeSystem class as follows:

self.voltage_domain = VoltageDomain(voltage="3.3V")
self.clk_domain = SrcClockDomain(clock="1GHz",
                                 voltage_domain=self.voltage_domain)

CPU Frequency:
Defined by the --cpu-freq argument, as this line dictates:
```
parser.add_argument("--cpu-freq", type=str, default="1GHz")
```
In our example, the default 1GHz frequency was kept, as no --cpu-freq argument was used.

Memory Type:
Specified by the --mem-type argument, as the following line dictates:

parser.add_argument("--mem-type", default="DDR3_1600_8x8",
                    choices=ObjectList.mem_list.get_names(),
                    help="type of memory to use")

And the memory configuration is applied in this line:

MemConfig.config_mem(args, system)

In the "Hello World" example, the default value was kept.

Second Question

"Besides the output file stats.txt generated at the end of the simulation, gem5 also produces config.ini and config.json files that provide information about the simulated system. a. Use these files to verify your answer to the first question by including the relevant fields. b. What are sim_seconds, sim_insts, and host_inst_rate? c. What is the total number of committed instructions, and why does this number differ from the statistics presented by gem5? d. How many times was the L2 cache accessed, and how could you calculate these accesses if they were not provided by the simulator?"

Using the config.ini, the above information can be verified as follows:

CPU Type:
```
[system.cpu_cluster.cpus]
type=MinorCPU
```

Cache Configuration:

[system.cpu_cluster.cpus.icache]
type=Cache
children=replacement_policy tags
addr_ranges=0:18446744073709551615
assoc=3

Memory Controller Configuration:

The [system] section includes a memory attribute which points to the following sections:
```
[system.mem_ctrls0]
type=DRAMCtrl

[system.mem_ctrls1]
type=DRAMCtrl
```

Operating Frequency:

[system.clk_domain]
type=SrcClockDomain
clock=1000

Using the config.json, the above information can be verified as follows:

CPU Type:

"system": {
    "cpu_cluster": {
        "cpus": [
            {
                "cxx_class": "MinorCPU"
            }
        ]
    }
}

Cache Configuration:

"system": {
    "cpu_cluster": {
        "cpus": [
            {
                "icache": {
                    "type": "Cache"
                }
            }
        ]
    }
}

Operating Frequency:

"clk_domain": {
    "name": "clk_domain",
    "clock": [
        1000
    ],
}

Memory Configuration:

"membus": {
    "point_of_coherency": true,
    "system": "system",
    "response_latency": 2,
    "cxx_class": "CoherentXBar",
    "max_routing_table_size": 512,
    "forward_latency": 4,
    "clk_domain": "system.clk_domain",
    "max_outstanding_snoops": 512,
    "point_of_unification": true,
    "width": 16,
}

Additional Metrics:

sim_seconds: Represents the real time that was simulated by gem5.
sim_inst: Represents the number of instructions that were executed in total from the simulated system during the simulation.
host_inst_rate: Measures the simulation speed in terms of instructions per second executed by the simulated system on the host computer.

The total number of committed instructions can be found inside the stats.txt file under:

system.cpu_cluster.cpus.committedInsts

and is equal to 5027.

This number differs from fields like Committed ops, which is equal to 5831:

3789 int ALU
1085 memread
4 int mult
3 cmdfloatmisc
950 memwrite

The L2 cache total data accesses can be found inside the stats.txt file from:

system.cpu_cluster.l2.overall_accesses::total          474

where the total L2 accesses are 474.

In the case where said statistic was not provided, the total L2 accesses could be calculated through other statistics such as:

system.cpu_cluster.cpus.dcache.overall_misses::total          177
system.cpu_cluster.cpus.icache.overall_misses::total          327

By adding those numbers, the number 504 comes up. However, there is another interesting statistic:

system.cpu_cluster.cpus.dcache.overall_mshr_hits::.cpu_cluster.cpus.data           30

which clarifies the reason why 504 came up. The MSHR (Miss Status Holding Register) is used in cache hierarchies to track outstanding memory requests (cache misses) that are in progress. When a new cache request arrives and matches an ongoing MSHR entry, it is considered an MSHR Hit. Thus, MSHR hits do not generate new L2 accesses because the L1 cache reuses the same outstanding request already being processed.

If 30 is subtracted from 504, the number that comes up (474) matches the total L2 accesses.

Some Information on Different CPU Models

The gem5 simulator offers various in-order CPU models, each with specific characteristics and use cases:

SimpleCPU:
A basic, functional in-order CPU model, suitable for scenarios where detailed simulation is not required. It is usually used for the initial phase of program development. It is divided into three subcategories:
- AtomicSimpleCPU: Focuses on simplicity and speed.
- TimingSimpleCPU: Offers more realistic simulation of memory delays.
MinorCPU:
A more detailed in-order CPU model with a fixed pipeline. It allows custom data structures and custom execution behavior. It is perfectly suited for simulations that require higher accuracy and more processing power without losing the in-order execution element.

Writing and Executing a C Program in gem5

"Write a program in C that implements the Fibonacci sequence and execute it in gem5 using different CPU models while keeping all other parameters the same. Use the TimingSimpleCPU and MinorCPU models."

It is known that the statistics that are related to execution time are the following.

Execution Time Metrics Comparison with Default Settings

Metric	Value (MinorCPU)	Value (TimingSimpleCPU)	Difference
final_tick	44,194,000	58,425,000	-14,231,000
sim_seconds	0.000044	0.000058	-0.000014
sim_ticks	44,194,000	58,425,000	-14,231,000

Table 1: Comparison of execution time metrics between MinorCPU and TimingSimpleCPU.

Due to the MinorCPU's ability to simulate pipeline processes explicitly, instructions run faster even though the CPU model is more detailed and accurate. TimingSimpleCPU, while simpler, does not include pipeline-level optimizations, leading to longer execution times.

Execution Time Metrics Comparison for TimingSimpleCPU and MinorCPU with DDR4 Memory Technology

Metric	Value (MinorCPU)	Value (TimingSimpleCPU)	Difference
final_tick	43,112,000	57,679,000	-14,567,000
host_seconds	0.04	0.01	0.03
host_tick_rate	1,064,720,573	4,108,076,185	-3,043,355,612
sim_seconds	0.000043	0.000058	-0.000015
sim_ticks	43,112,000	57,679,000	-14,567,000

Table 2: Comparison of execution time metrics between MinorCPU and TimingSimpleCPU with DDR4 memory.

Once again, it is evident that MinorCPU outperforms TimingSimpleCPU due to its detailed pipeline, which improves instruction throughput and handles memory operations and delays more efficiently. Comparing the MinorCPU metrics from Table 1 and Table 2, we observe a slight improvement in performance in Table 2. This improvement can be attributed to the lower latency and higher bandwidth of DDR4 memory compared to DDR3.

Execution Time Metrics Comparison for TimingSimpleCPU and MinorCPU with 2GHz Clock Frequency

Metric	Value (MinorCPU)	Value (TimingSimpleCPU)	Difference
final_tick	41,045,500	55,638,000	-14,592,500
host_seconds	0.04	0.01	0.03
host_tick_rate	1,042,979,640	4,817,522,837	-3,774,543,197
sim_seconds	0.000041	0.000056	-0.000015
sim_ticks	41,045,500	55,638,000	-14,592,500

Table 3: Comparison of execution time metrics between MinorCPU and TimingSimpleCPU at 2GHz.

It is evident that the performance has increased compared to prior simulations. This is due to the higher CPU clock performance. A higher CPU frequency generally results in fewer ticks required for the same operations, leading to faster compilation time. The difference between MinorCPU and TimingSimpleCPU remains due to the inherent pipeline and execution model differences.

Second Exercise

First Questions

"Use your knowledge from the first part and locate in the relevant files the key parameters for the processor simulated by gem5 concerning the memory subsystem. Specifically, find the sizes of the caches (L1 instruction and L1 data caches as well as the L2 cache), the associativity of each of them, and the size of the cache line."

L1 Instruction Cache (I-Cache):

Size: 32 KB
Associativity: 2-way set associative
Cache Line Size: 64 bytes

L1 Data Cache (D-Cache):

Size: 64 KB
Associativity: 2-way set associative
Cache Line Size: 64 bytes

L2 Cache:

Size: 2 MB
Associativity: 8-way set associative
Cache Line Size: 64 bytes

These settings match across all the key files.

Second Question

"Record the results from the different benchmarks. Specifically, keep the following information from each benchmark:
(i) execution time (note: the time required for the program to run on the simulated processor, not the time required by gem5 to perform the simulation),
(ii) CPI (cycles per instruction), and
(iii) overall miss rates for the L1 Data cache, L1 Instruction cache, and L2 cache.
You can obtain this information from the stats.txt files. (Hint: for the first, look for the sim_seconds value, and for the third, search for entries like this: icache.overall_miss_rate::total).
Create graphs to visualize this information for all the benchmarks. What do you observe?"

Benchmark	Execution Time (s)	CPI	L1 I-Cache Miss Rate	L1 D-Cache Miss Rate	L2 Cache Miss Rate
hmmer	0.059396	1.187917	0.000221	0.001637	0.07776
bzip	0.083982	1.67965	0.000077	0.014798	0.282163
libm	0.174671	3.493415	0.000094	0.060972	0.999944
mcf	0.064955	1.299095	0.023612	0.002108	0.055046
jeng	0.513528	10.270554	0.00002	0.121831	0.999972

Third Question

"Run the benchmarks in gem5 again in the same way as before, but this time add the parameters --cpu-clock=1GHz and --cpu-clock=3GHz. Examine the stats.txt files from the three executions of the program (the initial one and the ones with 1GHz and 3GHz) and identify the information about the clock. You will find two entries: one for system.clk_domain.clock and one for cpu_cluster.clk_domain.clock. Can you explain what is clocked at 1GHz/3GHz and what is clocked at the default GHz? Why do you think this happens?
Refer to the config.json file corresponding to the system with 1GHz. By searching for information about the clock, can you provide a clearer answer?
If we add another processor, what do you estimate its frequency will be?
Observe the execution times of the benchmarks for systems with different clocks. Is there perfect scaling? Can you provide an explanation if there is no perfect scaling? "

What is clocked at 1 GHz/3 GHz or default?
- Default: Both system.clk_domain.clock and cpu_cluster.clk_domain.clock follow default settings (assumed 1 GHz unless explicitly set).
- With --sys-clock=1GHz: Both domains clocked at 1 GHz.
- With --sys-clock=3GHz: Both domains clocked at 3 GHz.
What happens if another processor is added?
- It inherits the clock frequency of the system.clk_domain.clock (e.g., 1 GHz or 3 GHz).
Is there perfect scaling?
- No. Execution times show improvement (e.g., 0.16 s at 1 GHz to 0.058 s at 3 GHz) but not a perfect 3x speedup.
Why is scaling not perfect?
- Due to architectural constraints, memory bottlenecks, and fixed simulation overheads unrelated to clock frequency.

Fourth Question

Configuration	Execution Time (s)	CPI	L1 I-Cache Miss Rate	L1 D-Cache Miss Rate	L2 Cache Miss Rate
Default Memory	0.174671	3.493415	0.000094	0.060972	0.999944
DDR3_2133_x64	0.171530	3.430593	0.000094	0.060972	0.999944

Observations:

Execution Time: Slight improvement in execution time with the faster memory.
CPI: Lower CPI with DDR3_2133_x64, indicating better overall processor efficiency due to reduced memory access latency.
Miss Rates:
- L1 I-Cache and L1 D-Cache miss rates remain unchanged, as these are mostly influenced by the cache hierarchy and workload.
- L2 Cache Miss Rate remains extremely high, implying the workload is heavily reliant on main memory and the faster DDR3 helps only marginally.

Faster memory reduces memory latency but does not completely address the workload's bottlenecks. The high L2 cache miss rate indicates frequent accesses to main memory, where faster memory only provides diminishing returns.

Second Step

"Try to find the parameter values from the following list that will maximize the performance of your system for each benchmark: L1 instruction cache size
L1 instruction cache associativity
L1 data cache size
L1 data cache associativity
L2 cache size
L2 cache associativity
Cache line size"

Here are the results for each benchmark simulation. The Cache associativities were not deemed as important as the rest of the metrics so they were kept constant for the sake of time conservation.

HMMER

Benchmark Configuration	CPI	L1 D-Cache Miss Rate	L1 I-Cache Miss Rate	L2 Cache Miss Rate
spechmmer_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	1.185304	0.001629	0.000221	0.077747
spechmmer_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.180334	0.000879	0.000344	0.072074
spechmmer_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	1.183580	0.000694	0.000212	0.179969
spechmmer_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.178968	0.000386	0.000344	0.146335
spechmmer_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	1.185060	0.001631	0.000087	0.080460
spechmmer_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.179780	0.000880	0.000082	0.079899
spechmmer_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	1.183287	0.000695	0.000087	0.193925
spechmmer_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.178340	0.000387	0.000082	0.182442
spechmmer_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.185304	0.001629	0.000221	0.077747
spechmmer_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.180334	0.000879	0.000344	0.072074
spechmmer_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.183580	0.000694	0.000212	0.179969
spechmmer_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.178968	0.000386	0.000344	0.146335
spechmmer_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.185060	0.001631	0.000087	0.080460
spechmmer_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.179780	0.000880	0.000082	0.079899
spechmmer_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.183287	0.000695	0.000087	0.193925
spechmmer_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.178340	0.000387	0.000082	0.182442

Crucial Parameters: - L1 Instruction Cache Size and Associativity: hmmer involves repeated fetching of code for statistical modeling. A larger instruction cache with high associativity reduces instruction fetch stalls. - L2 Cache Size: Bioinformatics workloads process large datasets, and a larger L2 cache can help in reducing memory access penalties.

Why: Efficient instruction caching is essential due to frequent branching and instruction reuse. L2 cache assists with data-intensive operations.

BZIP2

Benchmark Name	CPI	L1 Data Cache Miss Rate	L1 Instruction Cache Miss Rate	L2 Cache Miss Rate
specbzip_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.564919	0.010685	0.000056	0.209911
specbzip_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.597315	0.014274	0.000056	0.154595
specbzip_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.573925	0.010702	0.000056	0.236481
specbzip_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.606366	0.014290	0.000056	0.174167
specbzip_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.573904	0.010701	0.000067	0.236486
specbzip_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.581525	0.011621	0.000077	0.363007
specbzip_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.606309	0.014290	0.000067	0.174178
specbzip_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.609942	0.014672	0.000077	0.282140

Crucial Parameters: - L1 Data Cache Size and Associativity: bzip2 performs repetitive operations on small chunks of data during compression, which benefits from a well-optimized L1 Data Cache. Insufficient size or associativity can lead to cache misses. - L2 Cache Size: Larger datasets and intermediate results might spill over into L2. The size of the L2 cache helps reduce the performance penalty from accessing main memory.

Why: The benchmark's memory access patterns involve small, localized chunks of data. Optimizing L1 and L2 caches minimizes data latency.

MCF

Configuration	CPI	L1 D-Cache Miss Rate	L1 I-Cache Miss Rate	L2 Cache Miss Rate
l1i32kB_l1d64kB_l22MB_line64_cpu1GHz	1.279422	0.002108	0.023627	0.055046
l1i32kB_l1d64kB_l22MB_line128_cpu1GHz	1.320485	0.001384	0.034839	0.020630
l1i32kB_l1d64kB_l24MB_line64_cpu1GHz	1.279161	0.002108	0.023626	0.054613
l1i32kB_l1d64kB_l24MB_line128_cpu1GHz	1.320286	0.001384	0.034838	0.020419
l1i32kB_l1d128kB_l22MB_line64_cpu1GHz	1.278449	0.001932	0.023627	0.055391
l1i32kB_l1d128kB_l22MB_line128_cpu1GHz	1.319366	0.001180	0.034835	0.020749
l1i32kB_l1d128kB_l24MB_line64_cpu1GHz	1.278180	0.001932	0.023627	0.054963
l1i32kB_l1d128kB_l24MB_line128_cpu1GHz	1.319174	0.001180	0.034834	0.020544
l1i64kB_l1d64kB_l22MB_line64_cpu1GHz	1.139457	0.002108	0.000018	0.711865
l1i64kB_l1d64kB_l22MB_line128_cpu1GHz	1.113017	0.001384	0.000020	0.529297
l1i64kB_l1d64kB_l24MB_line64_cpu1GHz	1.139146	0.002108	0.000018	0.706401
l1i64kB_l1d64kB_l24MB_line128_cpu1GHz	1.112796	0.001384	0.000020	0.524081
l1i64kB_l1d128kB_l22MB_line64_cpu1GHz	1.138112	0.001932	0.000018	0.776058
l1i64kB_l1d128kB_l22MB_line128_cpu1GHz	1.111745	0.001180	0.000020	0.624600
l1i64kB_l1d128kB_l24MB_line64_cpu1GHz	1.137956	0.001932	0.000018	0.770179
l1i64kB_l1d128kB_l24MB_line128_cpu1GHz	1.111593	0.001180	0.000020	0.618629

Crucial Parameters: - L2 Cache Size and Associativity: mcf accesses sparse data structures, leading to irregular memory patterns that frequently bypass L1 caches. A larger and more associative L2 cache can reduce main memory accesses. - L1 Instruction Cache Associativity: While L1 instruction caching is less critical for mcf, the associativity can mitigate conflicts caused by irregular instruction fetching.

Why: The irregular and sparse nature of mcf's memory access patterns makes it heavily dependent on the L2 cache to bridge the gap between L1 misses and memory accesses.

SJENG

Configuration	CPI	L1 D-Cache Miss Rate	L1 I-Cache Miss Rate	L2 Cache Miss Rate
specsjeng_l1i32kB_l1d64kB_l22MB_line64	7.040276	0.121831	0.000020	0.999972
specsjeng_l1i32kB_l1d64kB_l22MB_line128	4.974806	0.060922	0.000015	0.999824
specsjeng_l1i32kB_l1d128kB_l22MB_line64	7.040343	0.121831	0.000020	0.999976
specsjeng_l1i32kB_l1d128kB_l22MB_line128	4.974776	0.060921	0.000015	0.999856
specsjeng_l1i64kB_l1d64kB_l22MB_line64	7.043379	0.121831	0.000019	0.999979
specsjeng_l1i64kB_l1d64kB_l22MB_line128	4.974854	0.060922	0.000013	0.999846
specsjeng_l1i64kB_l1d128kB_l22MB_line64	7.040244	0.121831	0.000019	0.999983
specsjeng_l1i64kB_l1d128kB_l22MB_line128	4.974881	0.060921	0.000013	0.999877
specsjeng_l1i32kB_l1d64kB_l24MB_line64	7.038783	0.121831	0.000020	0.999972
specsjeng_l1i32kB_l1d64kB_l24MB_line128	4.972564	0.060922	0.000015	0.999824
specsjeng_l1i32kB_l1d128kB_l24MB_line64	7.038783	0.121831	0.000020	0.999976
specsjeng_l1i32kB_l1d128kB_l24MB_line128	4.972564	0.060921	0.000015	0.999856
specsjeng_l1i64kB_l1d64kB_l24MB_line64	7.038702	0.121831	0.000019	0.999979
specsjeng_l1i64kB_l1d64kB_l24MB_line128	4.972596	0.060922	0.000013	0.999846
specsjeng_l1i64kB_l1d128kB_l24MB_line64	7.038696	0.121831	0.000019	0.999983
specsjeng_l1i64kB_l1d128kB_l24MB_line128	4.972596	0.060921	0.000013	0.999877

Crucial Parameters: - L1 Instruction Cache Size and Associativity: The chess engine involves significant branching and decision-making, relying on quick instruction fetches to evaluate moves. - L1 Data Cache Associativity: Decision-making benefits from localized data caching, where associativity helps minimize cache conflicts.

Why: The benchmark's workload is instruction-heavy with frequent decision tree evaluations. A well-configured L1 cache ensures smooth instruction flow and data access.

L1 & L2 Cache sizes Constant // Associativity 8 16

Benchmark	CPI	L1 D-Cache Miss Rate	L1 I-Cache Miss Rate	L2 Cache Miss Rate
specsjeng_l1i64kB_l1iassoc4_l1d64kB_l1dassoc4_l24MB_l2assoc8_line128_cpu1GHz_new	4.972548	0.060918	0.000013	0.999973
specsjeng_l1i64kB_l1iassoc4_l1d64kB_l1dassoc4_l24MB_l2assoc8_line256_cpu1GHz_new	3.714674	0.030461	0.000009	0.999950
specsjeng_l1i64kB_l1iassoc4_l1d64kB_l1dassoc4_l24MB_l2assoc16_line128_cpu1GHz_new	4.972063	0.060918	0.000013	0.999973
specsjeng_l1i64kB_l1iassoc4_l1d64kB_l1dassoc4_l24MB_l2assoc16_line256_cpu1GHz_new	3.714674	0.030461	0.000009	0.999950
specsjeng_l1i64kB_l1iassoc4_l1d64kB_l1dassoc4_l24MB_l2assoc16_line_cpu1GHz_new_256cacheline	3.714674	0.030461	0.000009	0.999950
specsjeng_l1i64kB_l1iassoc4_l1d64kB_l1dassoc4_l24MB_l2assoc8_line_cpu1GHz_new_256cacheline	3.714674	0.030461	0.000009	0.999950

LIBM

Benchmark	CPI	L1 D-Cache Miss Rate	L1 I-Cache Miss Rate	L2 Cache Miss Rate
speclibm_l1i32kB_l1d64kB_l22MB_l2assoc4_line64_cpu1GHz	2.623265	0.060971	0.000094	0.999944
speclibm_l1i32kB_l1d64kB_l22MB_l2assoc4_line128_cpu1GHz	1.990458	0.030487	0.000112	0.999799
speclibm_l1i32kB_l1d64kB_l22MB_l2assoc8_line64_cpu1GHz	2.623265	0.060971	0.000094	0.999944
speclibm_l1i32kB_l1d64kB_l22MB_l2assoc8_line128_cpu1GHz	1.990458	0.030487	0.000112	0.999799
speclibm_l1i32kB_l1d64kB_l24MB_l2assoc4_line64_cpu1GHz	2.620756	0.060971	0.000094	0.999944
speclibm_l1i32kB_l1d64kB_l24MB_l2assoc4_line128_cpu1GHz	1.989132	0.030487	0.000112	0.999799
speclibm_l1i32kB_l1d64kB_l24MB_l2assoc8_line64_cpu1GHz	2.620756	0.060971	0.000094	0.999944
speclibm_l1i32kB_l1d64kB_l24MB_l2assoc8_line128_cpu1GHz	1.989132	0.030487	0.000112	0.999799
speclibm_l1i32kB_l1d64kB_l24MB_l2assoc16_line128_cpu1GHz	1.989132	0.030487	0.000112	0.999799
speclibm_l1i32kB_l1d64kB_l24MB_l2assoc16_line256_cpu1GHz	1.653896	0.015244	0.000134	0.999474

Crucial Parameters:
- L1 Data Cache Size: Continuous streaming of data for simulation benefits from a large L1 Data Cache to keep working sets close to the processor.
- L2 Cache Size: Larger datasets frequently spill into L2. A larger cache helps prevent memory bottlenecks.
- Why: lbm requires high memory bandwidth and frequent reuse of floating-point data. A well-optimized L1 and L2 cache hierarchy ensures minimal stalls during data fetches.

Third Exercise

Cost Function.

The cost function selected is as follows:

$$ C = a \cdot S + b \cdot T $$

where:

( S ) is the parameter related to the circuit size.
( T ) is the parameter linked to the system's speed.
( a ) and ( b ) are weights.

In this scenario, ( a ) and ( b ) are both set to 0.5 because the size and speed are deemed equally important.

Expanding the formula for ( S ):

$$ S = k_1 \cdot M_{\text{cache}} + k_2 \cdot A $$

Here, M_cache is further expressed as:

$$ M_cache = w_1 \cdot M_{L1i} + w_2 \cdot M_{L1d} + w_3 \cdot M_{L2} $$

where:

M_L1i: The size of the L1 instruction cache.
M_L1d: The size of the L1 data cache.
M_L2: The size of the L2 cache.
w_1, w_2, and w_3: The weights associated with the respective cache levels.

This formulation balances circuit size and system speed to optimize the cost function (C) based on the given parameters and weights.

Applying the function

The cost function is defined as:

$$ C = 0.7 \cdot \left( 0.4 \cdot M_{L1i} + 0.35 \cdot M_{L1d} + 0.25 \cdot M_{L2} \right) + 0.3 \cdot \left( 0.75 \cdot A_{L1} + 0.25 \cdot A_{L2} \right) $$

0.7: The weight assigned to the miss rate component.
0.3: The weight assigned to the access rate component.
0.4, 0.35, and 0.25: The relative contributions of the L1 instruction cache, L1 data cache, and L2 cache to the miss rate, respectively.
0.75 and 0.25: The relative contributions of the L1 and L2 caches to the access rate, respectively.

The exercise states that it is necessary to define an abstract value for the cost function. This value in this case is omega and it can be seen on this particular code snippet.

 # Normalize cache sizes and associativity based on omega
    df["L1i_size_cost"] = (df["L1i_size_kB"] * 1) / omega  # Assume 1 kB = omega cost
    df["L1d_size_cost"] = (df["L1d_size_kB"] * 1) / omega
    df["L2_size_cost"] = (df["L2_size_MB"] * 1024) / omega  # Convert MB to kB
    
    df["L1i_assoc_cost"] = df["L1i_assoc"] / omega
    df["L2_assoc_cost"] = df["L2_assoc"] / omega
    
    # Calculate the components of the cost function
    df["M_cache"] = 0.4 * df["L1i_size_cost"] + 0.35 * df["L1d_size_cost"] + 0.25 * df["L2_size_cost"]
    df["A_cache"] = 0.75 * df["L1i_assoc_cost"] + 0.25 * df["L2_assoc_cost"]
    df["Performance"] = 0.7 * df["CPI"] + 0.3 * (0.5 * df["L1d_miss_rate"] + 0.3 * df["L1i_miss_rate"] + 0.2 * df["L2_miss_rate"])
    
    # Calculate the total cost
    df["Cost"] = 0.5 * (0.7 * df["M_cache"] + 0.3 * df["A_cache"]) + 0.5 * df["Performance"]

Tables for sjeng, hmmer and bzip2

Benchmarks	CPI	L1d_miss_rate	L1i_miss_rate	L2_miss_rate	L1i_size_kB	L1i_assoc	L1d_size_kB	L1d_assoc	L2_size_MB	L2_assoc	L1i_size_cost	L1d_size_cost	L2_size_cost	L1i_assoc_cost	L2_assoc_cost	M_cache	A_cache	Performance	Cost
specsjeng_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	7.04028	0.121831	2e-05	0.999972	32	2	64	2	2	4	3.2	6.4	204.8	0.2	0.4	54.72	0.25	5.00647	21.6927
specsjeng_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	4.97481	0.060922	1.5e-05	0.999824	32	2	64	2	2	4	3.2	6.4	204.8	0.2	0.4	54.72	0.25	3.55149	20.9652
specsjeng_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	7.04034	0.121831	2e-05	0.999976	32	2	128	2	2	4	3.2	12.8	204.8	0.2	0.4	56.96	0.25	5.00652	22.4768
specsjeng_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	4.97478	0.060921	1.5e-05	0.999856	32	2	128	2	2	4	3.2	12.8	204.8	0.2	0.4	56.96	0.25	3.55147	21.7492
specsjeng_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	7.04338	0.121831	1.9e-05	0.999979	64	2	64	2	2	4	6.4	6.4	204.8	0.2	0.4	56	0.25	5.00864	22.1418
specsjeng_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	4.97485	0.060922	1.3e-05	0.999846	64	2	64	2	2	4	6.4	6.4	204.8	0.2	0.4	56	0.25	3.55153	21.4133
specsjeng_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	7.04024	0.121831	1.9e-05	0.999983	64	2	128	2	2	4	6.4	12.8	204.8	0.2	0.4	58.24	0.25	5.00645	22.9247
specsjeng_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	4.97488	0.060921	1.3e-05	0.999877	64	2	128	2	2	4	6.4	12.8	204.8	0.2	0.4	58.24	0.25	3.55155	22.1973
specsjeng_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	7.03878	0.121831	2e-05	0.999972	32	2	64	2	4	4	3.2	6.4	409.6	0.2	0.4	105.92	0.25	5.00542	39.6122
specsjeng_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	4.97256	0.060922	1.5e-05	0.999824	32	2	64	2	4	4	3.2	6.4	409.6	0.2	0.4	105.92	0.25	3.54992	38.8845
specsjeng_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	7.03878	0.121831	2e-05	0.999976	32	2	128	2	4	4	3.2	12.8	409.6	0.2	0.4	108.16	0.25	5.00542	40.3962
specsjeng_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	4.97256	0.060921	1.5e-05	0.999856	32	2	128	2	4	4	3.2	12.8	409.6	0.2	0.4	108.16	0.25	3.54993	39.6685
specsjeng_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	7.0387	0.121831	1.9e-05	0.999979	64	2	64	2	4	4	6.4	6.4	409.6	0.2	0.4	107.2	0.25	5.00537	40.0602
specsjeng_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	4.9726	0.060922	1.3e-05	0.999846	64	2	64	2	4	4	6.4	6.4	409.6	0.2	0.4	107.2	0.25	3.54995	39.3325
specsjeng_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	7.0387	0.121831	1.9e-05	0.999983	64	2	128	2	4	4	6.4	12.8	409.6	0.2	0.4	109.44	0.25	5.00536	40.8442
specsjeng_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	4.9726	0.060921	1.3e-05	0.999877	64	2	128	2	4	4	6.4	12.8	409.6	0.2	0.4	109.44	0.25	3.54995	40.1165

Benchmarks	CPI	L1d_miss_rate	L1i_miss_rate	L2_miss_rate	L1i_size_kB	L1i_assoc	L1d_size_kB	L1d_assoc	L2_size_MB	L2_assoc	L1i_size_cost	L1d_size_cost	L2_size_cost	L1i_assoc_cost	L2_assoc_cost	M_cache	A_cache	Performance	Cost
spechmmer_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	1.1853	0.001629	0.000221	0.077747	32	2	64	2	2	4	3.2	6.4	204.8	0.2	0.4	54.72	0.25	0.834642	19.6068
spechmmer_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.18033	0.000879	0.000344	0.072074	32	2	64	2	2	4	3.2	6.4	204.8	0.2	0.4	54.72	0.25	0.830721	19.6049
spechmmer_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	1.18358	0.000694	0.000212	0.179969	32	2	128	2	2	4	3.2	12.8	204.8	0.2	0.4	56.96	0.25	0.839427	20.3932
spechmmer_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.17897	0.000386	0.000344	0.146335	32	2	128	2	2	4	3.2	12.8	204.8	0.2	0.4	56.96	0.25	0.834147	20.3906
spechmmer_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	1.18506	0.001631	8.7e-05	0.08046	64	2	64	2	2	4	6.4	6.4	204.8	0.2	0.4	56	0.25	0.834622	20.0548
spechmmer_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.17978	0.00088	8.2e-05	0.079899	64	2	64	2	2	4	6.4	6.4	204.8	0.2	0.4	56	0.25	0.830779	20.0529
spechmmer_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line64_cpu1GHz	1.18329	0.000695	8.7e-05	0.193925	64	2	128	2	2	4	6.4	12.8	204.8	0.2	0.4	58.24	0.25	0.840048	20.8415
spechmmer_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.17834	0.000387	8.2e-05	0.182442	64	2	128	2	2	4	6.4	12.8	204.8	0.2	0.4	58.24	0.25	0.83585	20.8394
spechmmer_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.1853	0.001629	0.000221	0.077747	32	2	64	2	4	4	3.2	6.4	409.6	0.2	0.4	105.92	0.25	0.834642	37.5268
spechmmer_l1i32kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.18033	0.000879	0.000344	0.072074	32	2	64	2	4	4	3.2	6.4	409.6	0.2	0.4	105.92	0.25	0.830721	37.5249
spechmmer_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.18358	0.000694	0.000212	0.179969	32	2	128	2	4	4	3.2	12.8	409.6	0.2	0.4	108.16	0.25	0.839427	38.3132
spechmmer_l1i32kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.17897	0.000386	0.000344	0.146335	32	2	128	2	4	4	3.2	12.8	409.6	0.2	0.4	108.16	0.25	0.834147	38.3106
spechmmer_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.18506	0.001631	8.7e-05	0.08046	64	2	64	2	4	4	6.4	6.4	409.6	0.2	0.4	107.2	0.25	0.834622	37.9748
spechmmer_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.17978	0.00088	8.2e-05	0.079899	64	2	64	2	4	4	6.4	6.4	409.6	0.2	0.4	107.2	0.25	0.830779	37.9729
spechmmer_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.18329	0.000695	8.7e-05	0.193925	64	2	128	2	4	4	6.4	12.8	409.6	0.2	0.4	109.44	0.25	0.840048	38.7615
spechmmer_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.17834	0.000387	8.2e-05	0.182442	64	2	128	2	4	4	6.4	12.8	409.6	0.2	0.4	109.44	0.25	0.83585	38.7594

Benchmarks	CPI	L1d Miss Rate	L1i Miss Rate	L2 Miss Rate	L1i Size (kB)	L1i Assoc	L1d Size (kB)	L1d Assoc	L2 Size (MB)	L2 Assoc	L1i Size Cost	L1d Size Cost	L2 Size Cost	L1i Assoc Cost	L2 Assoc Cost	M Cache	A Cache	Performance	Cost
specbzip_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.564919	0.010685	0.000056	0.209911	64	2	128	2	4	4	6.4	12.8	409.6	0.2	0.4	109.44	0.25	1.109646	38.896323
specbzip_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line128_cpu1GHz	1.597315	0.014274	0.000056	0.154595	64	2	64	2	4	4	6.4	6.4	409.6	0.2	0.4	107.20	0.25	1.129542	38.122271
specbzip_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.573925	0.010702	0.000056	0.236481	64	2	128	2	2	4	6.4	12.8	204.8	0.2	0.4	58.24	0.25	1.117547	20.980273
specbzip_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l22MB_l2assoc4_line128_cpu1GHz	1.606366	0.014290	0.000056	0.174167	64	2	64	2	2	4	6.4	6.4	204.8	0.2	0.4	56.00	0.25	1.137055	20.206027
specbzip_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.573904	0.010701	0.000067	0.236486	64	2	128	2	4	4	6.4	12.8	409.6	0.2	0.4	109.44	0.25	1.117533	38.900267
specbzip_l1i64kB_l1iassoc2_l1d128kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.581525	0.011621	0.000077	0.363007	64	2	128	2	4	4	6.4	12.8	409.6	0.2	0.4	109.44	0.25	1.117533	38.900267
specbzip_l1i64kB_l1iassoc2_l1d64kB_l1dassoc2_l24MB_l2assoc4_line64_cpu1GHz	1.609942	0.014672	0.000077	0.282140	64	2	64	2	4	4	6.4	6.4	409.6	0.2	0.4	107.20	0.25	1.129542	38.122271

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
Cost Exercise		Cost Exercise
First Exercise/hello_result		First Exercise/hello_result
Second Exercise		Second Exercise
graphs		graphs
README.md		README.md

Steve20144/gem5

Folders and files

Latest commit

History

Repository files navigation

gem5 Report

Table of Contents

Introduction & Preface

Environment Setup

Exercises

First Exercise

First Question

Second Question

Some Information on Different CPU Models

Writing and Executing a C Program in gem5

Execution Time Metrics Comparison with Default Settings

Execution Time Metrics Comparison for TimingSimpleCPU and MinorCPU with DDR4 Memory Technology

Execution Time Metrics Comparison for TimingSimpleCPU and MinorCPU with 2GHz Clock Frequency

Second Exercise

First Questions

L1 Instruction Cache (I-Cache):

L1 Data Cache (D-Cache):

L2 Cache:

Second Question

Third Question

Fourth Question

Observations:

Second Step

HMMER

BZIP2

MCF

SJENG

L1 & L2 Cache sizes Constant // Associativity 8 16

LIBM

Third Exercise

Cost Function.

Applying the function

Tables for sjeng, hmmer and bzip2

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages