- Author: Christopher Batten
- Date: February 1, 2017
This tutorial will discuss the various views that make-up a standard-cell library and then illustrate how to use a set of Synopsys ASIC tools to map an RTL design down to these standard cells and ultimately silicon. The tutorial will discuss the key tools used for synthesis, place-and-route, and power analysis. This tutorial requires entering commands manually for each of the tools to enable students to gain a better understanding of the detailed steps involved in this process. The next tutorial will illustrate how this process can be automated to facilitate rapid design-space exploration. This tutorial assumes you have already completed the tutorials on Linux, Git, PyMTL, and Verilog.
Table of Contents
- Overview of ECE 5745 ASIC Tools
- Synopsys Educational 90nm Standard-Cell Libraries
- PyMTL-Based Testing, Simulation, Translation
- Using Synopsys Design Compiler for Synthesis
- Using Synopsys IC Compiler for Place-and-Route
- Using Synopsys PrimeTime for Power Analysis
- Using Verilog RTL Models
- On Your Own
The following diagram illustrates the four primary tools we will be using in ECE 5745 along with a few smaller secondary tools. Notice that the Synopsys ASIC tools all require various views from the standard-cell library.
-
We use the PyMTL framework to test, verify, and evaluate the execution time (in cycles) of our design. This part of the flow is very similar to the flow used in ECE 4750. Note that we can write our RTL models in either PyMTL or Verilog. Once we are sure our design is working correctly, we can then start to push the design through the flow. The ASIC flow requires Verilog RTL as an input, so we can use PyMTL's automatic translation tool to translate PyMTL RTL models into Verilog RTL. We also generate waveforms in
.vcd
(Verilog Change Dump) format, and we usevcd2saif
to convert these waveforms into per-net average activity factors stored in.saif
format. These activity factors will be used for power analysis. -
We use Synopsys Design Compiler (DC) to synthesize our design, which means to transform the Verilog RTL model into a Verilog gate-level netlist where all of the gates are selected from the standard-cell library. We need to provide Synopsys DC with abstract logical and timing views of the standard-cell library in
.db
format. We also provide provide Synopsys DC with the.saif
file to enable mapping high-level RTL net names into low-level gate-level net names, and this mapping is stored in a.namemap
file. This name mapping will be used for power analysis. In addition to the Verilog gate-level netlist, Synopsys DC can also generate a.ddc
file which contains information about the gate-level netlist and timing, and this.ddc
file can be inspected using Synopsys Design Vision (DV). -
We use Synopsys IC Compiler (ICC) to place-and-route our design, which means to place all of the gates in the gate-level netlist into rows on the chip and then to generate the metal wires that connect all of the gates together. We need to provide Synopsys ICC with the same abstract logical and timing views used in Synopsys DC, but we in addition we need to provide Synopsys ICC with technology information in
.tf
and.tluplus
format and abstract physical views of the standard-cell library in.fr
(Milkyway database) format.w Synopsys ICC will generate an updated Verilog gate-level netlist, a.sbpf
file which contains parasitic resistance/capacitance information about all nets in the design, and a.gds
file which contains the final layout. The.gds
file can be inspected using the open-source Klayout GDS viewer. Synopsys ICC also generates reports which can be used to accurately characterize area and timing. -
We use Synopsys PrimeTime (PT) to perform power-analysis of our design. We need to provide Synopsys PT with the same abstract logical, timing, and power views used in Synopsys DC and ICC, but in addition we need to provide switching activity information for every net in the design (which comes from the
.saif
file), capacitance information for every net in the design (which comes from the.sbpf
file), and a file which maps high-level RTL names to low-level gate-level names (which comes from the.namemap
file). Synopsys PT puts the switching activity, capacitance, clock frequency, and voltage together to estimate the power consumption of every net and thus every module in the design, and these estimates are captured in various reports.
Extensive documentation is provided by Synopsys for Design Compiler, IC Compiler, and PrimeTime. We have organized this documentation and made it available to you on the public course webpage. The username/password was distributed during lecture.
You should start by cloning the tutorial repository from GitHub. Access
an ecelinux
machine and use the following commands:
% source setup-ece5745.sh
% mkdir $HOME/ece5745
% cd $HOME/ece5745
% git clone git@github.com:cornell-ece5745/ece5745-tut5-asic-tools
% cd ece5745-tut5-asic-tools
% TOPDIR=$PWD
A standard-cell library is a collection of combinational and sequential
logic gates that adhere to a standardized set of logical, electrical, and
physical policies. For example, all standard cells are usually the same
height, include pins that align to a predetermined vertical and
horizontal grid, include power/ground rails and nwells in predetermined
locations, and support a predetermined number of drive strengths. A
standard-cell designer will usually create a high-level behavioral
specification (in Verilog), circuit schematics (in a .cdl
Spice-like
format), and the actual layout (in .gds
format) for each logic gate.
Synopsys DC, ICC, and PT do not actually use these low-level
implementations, since they are actually too detailed. Instead these
tools use abstract views of the standard cells, which capture logical
functionality, timing, geometry, and power usage at a much higher level.
Acquiring real standard-cell libraries is a complex and potentially expensive process. It requires gaining access to a specific fabrication technology, negotiating with a company which makes standard cells, and usually signing multiple non-disclosure agreements. In this course, we will be using the Synopsys Educational 90nm (SAED) standard-cell library. This library was specifically designed by Synopsys for teaching. It is based on a "fake" 90nm technology. This means you cannot actually tapeout a design using this standard cell library, but the technology is representative enough to provide reasonable area, energy, and timing estimates for teaching purposes. In this section, we will take a look at both the low-level implementations and high-level views of the SAED standard-cell library.
A standard-cell library distribution can contain gigabytes of data in thousands of files. For example, here is the distribution for the SAED standard-cell library.
% cd $STDCELLS_DIR/dist
To simplify using the SAED standard-cell library in this course, we have created a much smaller set of symlinks which point to just the key files we want to use in this course. Here is the directory which contains these symlinks.
% cd $STDCELLS_DIR
% ls
cells.cdl # circuit schematics for each cell
cells.gds # layout for each cell
cells.v # behavioral specification for each cell
cells.sp # schematics with extracted parasitics for each cell
cells.lib # abstract logical, timing, power view for each cell
cells.lef # abstract physical view for each cell
cells.tf # interconnect technology information
cells-tech.lef # interconnect technology information
cells-max.tluplus # interconnect parasitic resistance/capacitance
cells-min.tluplus # interconnect parasitic resistance/capacitance
cells.db # binary compiled version of .lib file
cells.fr # Milkyway database built from .lef file
cells.pdf # standard-cell library databook
cells.lyp # layer settings for Klayout
Let's begin by looking at the schematic for a 3-input NAND cell (NAND3X0).
% less -p NAND3X0 cells.cdl
.subckt NAND3X0 IN1 IN2 IN3 QN VDD VSS
mmn2 net1 IN2 net2 VSS n12 l = 0.1u w = 0.52u m = 1
mmn3 net2 IN3 VSS VSS n12 l = 0.1u w = 0.52u m = 1
mmn1 QN IN1 net1 VSS n12 l = 0.1u w = 0.52u m = 1
mmp1 QN IN1 VDD VDD p12 l = 0.1u w = 0.58u m = 1
mmp2 QN IN2 VDD VDD p12 l = 0.1u w = 0.58u m = 1
mmp3 QN IN3 VDD VDD p12 l = 0.1u w = 0.58u m = 1
ends NAND3X0
For students with a circuits background, there should be no surprises here, and for those students with less circuits background we will cover basic static CMOS gate design later in the course. Essentially, this schematic includes three NMOS transistors arranged in series in the pull-down network, and three PMOS transistors arranged in parallel in the pull-up network. The PMOS transistors are larger than the NMOS transistors because the mobility of holes is less than the mobility of electrons.
Now let's look at the layout for the 3-input NAND cell using the open-source Klayout GDS viewer.
% klayout -l $STDCELLS_DIR/cells.lyp $STDCELLS_DIR/cells.gds
Note that we are using the .lyp
file which is a predefined layer color
scheme that makes it easier to view GDS files. To view the 3-input NAND
cell, find the NAND3X0 cell in the left-hand cell list, and then choose
Display > Show as New Top from the menu. Here is a picture of the
layout for this cell.
Diffusion is green, polysilicon is red, contacts are solid dark blue, metal 1 (M1) is blue, and the nwell is the large gray rectangle over the top half of the cell. All standard cells will be the same height and have the nwell in the same place. Notice the three NMOS transistors arranged in series in the pull-down network, and three PMOS transistors arranged in parallel in the pull-up network. The standard cell also includes two substrate contacts. The power rail is the horizontal strip of M1 at the top, and the ground rail is the horizontal strip of M1 at the bottom. All standard cells will have the power and ground rails in the same place so they will connect via abutment if these cells are arranged in a row. Although it is difficult to see, the three input pins and one output pin are labeled squares of M1, and these pins are arranged to be on a predetermined grid. Students with a circuits background may recognize that this is not the highest quality layout. A commercial standard-cell library will usually use very aggressive layout to optimize for area, energy, and timing. Having said this, the SAED standard-cell library is reasonable for the purposes of this course.
Now let's look at the Verilog behavior specification for the 3-input NAND cell.
% less -p NAND3X0 $STDCELLS_DIR/cells.cdl
module NAND3X0 (IN1,IN2,IN3,QN);
output QN;
input IN1,IN2,IN3;
nand #1 (QN,IN3,IN2,IN1);
`ifdef functional
`else
specify
specparam in1_lh_qn_hl=20,in1_hl_qn_lh=21,in2_lh_qn_hl=28,
in2_hl_qn_lh=27,in3_lh_qn_hl=29,in3_hl_qn_lh=32;
( IN1 -=> QN) = (in1_hl_qn_lh,in1_lh_qn_hl);
( IN2 -=> QN) = (in2_hl_qn_lh,in2_lh_qn_hl);
( IN3 -=> QN) = (in3_hl_qn_lh,in3_lh_qn_hl);
endspecify
`endif
endmodule
Note that the Verilog implementation of the 3-input NAND cell looks
nothing like the Verilog we used in ECE 4750. This cell is implemented
using a Verilog primitive gate (i.e., nand
), it includes a delay value
of one delay unit (i.e., #1
), and it includes a specify
block which
is used for advanced gate-level simulation with back-annotated delays.
We can use sophisticated tools to extract detailed parasitic resistance and capacitance values from the layout, and then we can add these parasitics to the circuit schematic to create a much more accurate model for experimenting with the circuit timing and power. Let's look at a snippet of the extracted circuit for the 3-input NAND cell:
% less -p NAND3X0 $STDCELLS_DIR/cells.sp
.SUBCKT NAND3X0 VSS VDD IN3 IN1 IN2 QN
...
Cg1 M6:DRN 0 1.83492e-17
Cg2 M4:DRN 0 1.74201e-18
Cg3 M1:DRN 0 1.13867e-17
R1 M6:DRN M5:SRC 0.001
R2 M6:DRN M4:DRN 423.29
R3 M6:DRN QN 10.9403
R4 M4:DRN QN 10.8355
R5 QN M1:DRN 5.66819
...
MM1 M1:DRN M1:GATE M1:SRC M1:BULK n12 ad=0.159p as=0.096p l=0.1u pd=1.65u ps=0.89u w=0.52u
MM2 M2:DRN M2:GATE M2:SRC M2:BULK n12 ad=0.096p as=0.096p l=0.1u pd=0.89u ps=0.89u w=0.52u
MM3 M3:DRN M3:GATE M3:SRC M3:BULK n12 ad=0.096p as=0.156p l=0.1u pd=0.89u ps=1.64u w=0.52u
MM4 M4:DRN M4:GATE M4:SRC M4:BULK p12 ad=0.177p as=0.107p l=0.1u pd=1.77u ps=0.95u w=0.58u
MM5 M5:DRN M5:GATE M5:SRC M5:BULK p12 ad=0.107p as=0.107p l=0.1u pd=0.95u ps=0.95u w=0.58u
MM6 M6:DRN M6:GATE M6:SRC M6:BULK p12 ad=0.107p as=0.174p l=0.1u pd=0.95u ps=1.76u w=0.58u
.ENDS
The full model is a couple of hundred lines long, so you can see how
detailed this model is! The ASIC tools do not really need this much
detail. We can use a special set of tools to create a much higher level
abstract view of the timing and power of this circuit suitable for use by
the ASIC tools. Essentially, these tools run many, many circuit-level
simulations to create characterization data stored in a .lib
(Liberty)
file. Let's look at snippet of the .lib
file for the 3-input NAND cell.
% less -p NAND3X0 $STDCELLS_DIR/cells.lib
cell (NAND3X0) {
cell_footprint : "nand3x0 ";
area : 7.3728 ;
cell_leakage_power : 9.151417e+04;
...
pin (IN1) {
fanout_load : 0.059000;
direction : "input";
fall_capacitance : 2.212771;
capacitance : 2.190745;
rise_capacitance : 2.168719;
...
internal_power () {
when : "!IN2&!IN3";
rise_power ("power_inputs_1") {
/* index_1 = input transition time */
index_1(" 0.0160000, 0.0320000, 0.0640000, 0.1280000, 0.2560000, 0.5120000, 1.0240000");
values ("-1.2575404, -1.2594251, -1.2887053, -1.2413107, -1.2083520, -1.2261536, -1.1689351");
}
fall_power ("power_inputs_1") {
/* index_1 = input transition time */
index_1(" 0.0160000, 0.0320000, 0.0640000, 0.1280000, 0.2560000, 0.5120000, 1.0240000");
values (" 1.9840914, 1.9791286, 2.0696119, 2.0561270, 2.0655292, 2.0637255, 2.0668630");
}
}
...
pin (QN) {
direction : "output";
power_down_function : "!VDD + VSS";
function : "(IN3*IN2*IN1)'";
...
timing () {
related_pin : "IN1";
timing_sense : "negative_unate";
cell_rise ("del_1_7_7") {
/* index_1 = input net transition time */
index_1("0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.024");
/* index_2 = total output net capacitance */
index_2("0.1, 3.75, 7.5, 13, 26, 52, 104");
values(
"0.0178632, 0.0275957, 0.0374970, 0.0517788, 0.0856314, 0.1538226, 0.2886417", \
"0.0215562, 0.0316225, 0.0414275, 0.0556431, 0.0895029, 0.1573472, 0.2917113", \
"0.0261721, 0.0387623, 0.0496870, 0.0639978, 0.0973798, 0.1643430, 0.3008298", \
"0.0323952, 0.0479363, 0.0614787, 0.0790854, 0.1144558, 0.1806003, 0.3153037", \
"0.0413278, 0.0605217, 0.0771532, 0.0986336, 0.1423713, 0.2144832, 0.3477775", \
"0.0540991, 0.0787062, 0.0997246, 0.1260550, 0.1797894, 0.2692040, 0.4165220", \
"0.0712053, 0.1041857, 0.1296371, 0.1645895, 0.2318538, 0.3420702, 0.5237414");
}
}
}
...
}
This is just a small subset of the information included in the .lib
file for this cell. We will talk more about the details of such .lib
files later in the course, but you can see that the .lib
file contains
information about area, leakage power, capacitance of each input pin,
internal power, logical functionality, and timing. Units for all data is
provided at the top of the .lib
file. In this snippet you can see that
the area of the cell is 7.4 square micron and the leakage power is 9.2uW.
The capacitance for the input pin IN`` is 2.1fF, although there is additional data that capture how the capacitance changes depending on whether the input is rising or falling. The output pin
QNimplements the logic equation
(IN3IN2IN1)(i.e., a NAND). Data within the
.libfile are often represented using one- or two-dimensional lookup tables (i.e., a
values` table). You can see two such lookup tables in the above
snippet.
Let's start by focusing on the first lookup table. This table captures
the internal power, which is the power consumed within the gate itself,
as a function of one parameter: the input transition time. Longer input
transition times result in larger short-circuit current through the gate.
Note that the characterization data is for a specific configuration of
the input pins (i.e., !IN2&!IN3
). There are other tables for the other
configurations of the input pins. Each entry in the lookup table is
calculated by measuring the current drawn from the power supply during a
detailed spice simulation and subtracting any current used to charge
the output load. In other words all of the energy that is not consumed
charging up the output load is considered internal energy. Note that some
of the internal power is negative. This is simply due to how we account
for energy. We can either assume all energy is consumed only when the
output node is charged and no energy energy is consumed when the output
node is discharged, or we can assume half the energy is consumed when
the output is node is charged and half the energy is consumed when the
output node is discharged. In this case the library is using the second
approach, and the way the total energy is averaged across both
transitions can result in either the rising or falling internal power to
be negative. We will discuss this in greater detail later in the course.
Now let's focus on the second lookup table. This table captures the delay
from input pin IN1
to output pin QN
as a function of two parameters:
the input transition time (horizontal direction in lookup table) and the
load capacitance (vertical direction in lookup table). Gates are slower
when the inputs take longer transition and/or when they are driving large
output loads. Each entry in the lookup table reflects characterization of
one or more detailed circuit-level simulations. So in this example the
delay from input pin IN1
to output pin QN
is 31ps when the input
transition rate is 32ps and the output load is 3.75fF. This level of
detail can enable very accurate static timing analysis of our designs.
Note that the ASIC tools actually do not use the .lib
file directly,
but instead use a pre-compiled binary version of the .lib
file stored
in .db
format. While the .lib
file captures the abstract logical,
timing, and power aspects of the standard-cell library, it does not
capture the physical aspects of the standard-cell library. While the ASIC
tools could potentially use the .gds
file directly, the ASIC tools do
not really need this much detail. We can use a special set of tools to
create a much higher level abstract view of the physical aspects of the
cell suitable for use by the ASIC tools. These tools create .lef
files.
Let's look at snippet of the the .lef
file for the 3-input NAND cell.
% less -p NAND3X0 $STDCELLS_DIR/cells.lef
MACRO NAND3X0
CLASS CORE ;
ORIGIN 0 0 ;
SYMMETRY X Y ;
SITE unit ;
SIZE 2.56 BY 2.88 ;
PIN VDD
DIRECTION INOUT ;
USE POWER ;
PORT
LAYER M1 ;
RECT 0 2.8 2.56 2.96 ;
RECT 1.585 1.805 1.725 2.8 ;
RECT 0.645 1.45 0.785 2.8 ;
RECT 0.23 1.51 0.37 2.8 ;
END
END VDD
PIN QN
DIRECTION OUTPUT ;
USE SIGNAL ;
PORT
LAYER M1 ;
RECT 1.745 1 2.04 1.24 ;
RECT 1.115 1.525 2.2 1.665 ;
...
END
OBS
LAYER PO ;
RECT 1.84 0.1 1.94 2.115 ;
RECT 1.84 2.115 2.07 2.345 ;
RECT 0.9 0.1 1 2.125 ;
...
END
This is just a small subset of the information included in the .lef
file for this cell. You can see the .lef
file includes information on
the dimensions of the cell and the location and dimensions of both
power/ground and signal pins. The file also includes information on
"obstructions" (or blockages). These are regions of the cell which should
not be used for any purpose by the ASIC tools. You can use Klayout to
view .lef
files as well.
% klayout
Choose File > Import > LEF from the menu. Navigate to the cells.lef
file. Here is a picture of the .lef
for this cell.
If you compare the .lef
to the .gds
you can see that the .lef
is a
much simpler representation that only captures the boundary, pins, and
obstructions. The ASIC tools actually do not use the .lef
file
directly, but instead use a pre-generated database version of the .lef
file stored in .fr
(Milkyway) format.
The standard-cell library also includes several files (e.g., .tf
,
-tech.lef
, .tluplus
) that capture information about the metal
interconnect including the wire width/pitch and parasitics.
Finally, a standard-cell library will always include a databook, which is a document that describes the details of every cell in the library. Take a few minutes to browse through the SAED standard-cell library databook.
Our goal in this tutorial is to generate layout for the sort unit from the PyMTL tutorial using the ASIC tools. As a reminder, the sort unit takes as input four integers and a valid bit and outputs those same four integers in increasing order with the valid bit. The sort unit is implemented using a three-stage pipelined, bitonic sorting network and the datapath is shown below.
Let's start by running the tests for the sort unit and note that the
tests for the SortUnitStructRTL
will fail. You can just copy over your
implementation of the MinMaxUnit
from when you completed the PyMTL
tutorial. If you have not completed the PyMTL tutorial then go back and
do that now.
% mkdir $TOPDIR/sim/build
% cd $TOPDIR/sim/build
% py.test ../tut3_pymtl/sort
% py.test ../tut3_pymtl/sort --test-verilog
The --test-verilog
command line option tells the PyMTL framework to
first translate the sort unit into Verilog, and then important it back
into PyMTL to verify that the translated Verilog is itself correct. With
the --test-verilog
command line option, PyMTL will skip tests that are
not for verifying RTL. After running the tests we use the sort unit
simulator to do the final translation into Verilog and to dump the .vcd
(Value Change Dump) file that we want to use for power analysis.
% cd $TOPDIR/sim/build
% ../tut3_pymtl/sort/sort-sim --impl rtl-struct --stats --translate --dump-vcd
num_cycles = 105
num_cycles_per_sort = 1.05
Take a moment to open up the translated Verilog which should be in a file
named SortUnitStructRTL_0x73ab8da9cdd886de.v
. The complicated hash
suffix is used by PyMTL to make this filename unique even for
parameterized modules which are instantiated for a specific set of
parameters. The hash might be different for your design. Try to see how
both the structural composition and the behavioral modeling translates
into Verilog. Here is an example of the translation for the MinMaxUnit
.
Notice how PyMTL will output the source Python embedded as a comment in
the corresponding translated Verilog.
% cd $TOPDIR/sim/build
% less SortUnitStructRTL_0x73ab8da9cdd886de.v
module MinMaxUnit_0x152ab97dfd22b898
(
input wire [ 0:0] clk,
input wire [ 7:0] in0,
input wire [ 7:0] in1,
output reg [ 7:0] out_max,
output reg [ 7:0] out_min,
input wire [ 0:0] reset
);
// PYMTL SOURCE:
//
// @s.combinational
// def block():
//
// if s.in0 >= s.in1:
// s.out_max.value = s.in0
// s.out_min.value = s.in1
// else:
// s.out_max.value = s.in1
// s.out_min.value = s.in0
// logic for block()
always @ (*) begin
if ((in0 >= in1)) begin
out_max = in0;
out_min = in1;
end
else begin
out_max = in1;
out_min = in0;
end
end
endmodule // MinMaxUnit_0x4b8e51bd8055176a
Although we hope students will not need to actually open up this translated Verilog it is occasionally necessary. For example, PyMTL is not perfect and can translate incorrectly which might require looking at the Verilog to see where it went wrong. Other steps in the ASIC flow might refer to an error in the translated Verilog which will also require looking at the Verilog to figure out why the other steps are going wrong. While we try and make things as automated as possible, students will eventually need to dig in and debug some of these steps themselves.
The .vcd
file contains information about the state of every net in the
design on every cycle. This can make these .vcd
files very large and
thus slow to analyze. For average power analysis, we only need to know
the activity factor on each net. We can use the vcd2saif
tool to
convert .vcd
files into .saif
files. An .saif
file only contains a
single average activity factor for every net.
% cd $TOPDIR/sim/build
% vcd2saif -input sort-rtl-struct-random.verilator1.vcd -output sort-rtl-struct-random.saif
We use Synopsys Design Compiler (DC) to synthesize Verilog RTL models
into a gate-level netlist where all of the gates are from the standard
cell library. So Synopsys DC will synthesize the Verilog +
operator
into a specific arithmetic block at the gate-level. Based on various
constraints it may synthesize a ripple-carry adder, a carry-look-ahead
adder, or even more advanced parallel-prefix adders.
We start by creating a subdirectory for our work, and then launching Synopsys DC.
% mkdir -p $TOPDIR/asic/dc-syn
% cd $TOPDIR/asic/dc-syn
% dc_shell-xg-t
To make it easier to copy-and-paste commands from this document, we tell
Synopsys DC to ignore the prefix dc_shell>
using the following:
dc_shell> alias "dc_shell>" ""
There are two important variables we need to set before starting to work
in Synopsys DC. The target_library
variable specifies the standard
cells that Synopsys DC should use when synthesizing the RTL. The
link_library
variable should search the standard cells, but can also
search other cells (e.g., SRAMs) when trying to resolve references in our
design. These other cells are not meant to be available for Synopsys DC
to use during synthesis, but should be used when resolving references.
Including *
in the link_library
variable indicates that Synopsys DC
should also search all cells inside the design itself when resolving
references.
dc_shell> set_app_var target_library "$env(STDCELLS_DIR)/cells.db"
dc_shell> set_app_var link_library "* $env(STDCELLS_DIR)/cells.db"
Note that we can use $env(STDCELLS_DIR)
to get access to the
$STDCELLS_DIR
environment variable which specifies the directory
containing the standard cells, and that we are referencing the abstract
logical and timing views in the .db
format.
The next step is to turn on name mapping. When we do power analysis we
will be coming activity factors from RTL simulation in .saif
format
with gate-level models. There will be many nets from the RTL simulation
that may not exist in the gate-level model, but ideally there will still
be enough nets that are present in both the .saif
file and the
gate-level model to be able to do reasonably accurate power estimation.
To help this process, name mapping will keep track of how names in the
RTL map to names in the gate-level model.
dc_shell> saif_map -start
As an aside, if you want to learn more about any command in any Synopsys
tool, you can simply type man toolname
at the shell prompt. We are now
ready to read in the Verilog file which contains the top-level design and
all referenced modules. We do this with two commands. The analyze
command reads the Verilog RTL into an intermediate internal
representation. The elaborate
command recursively resolves all of the
module references starting from the top-level module, and also infers
various registers and/or advanced data-path components.
dc_shell> analyze -format sverilog ../../sim/build/SortUnitStructRTL_0x73ab8da9cdd886de.v
dc_shell> elaborate SortUnitStructRTL_0x73ab8da9cdd886de
We can use the check_design
command to make sure there are no obvious
errors in our Verilog RTL.
dc_shell> check_design
You should see three warnings related to unconnected clk[0]
and
reset[0]
ports. This is because PyMTL always includes clk
and reset
ports even if they are not actually used in the module. You can safely
ignore these warnings. It is critical that you carefully review all
warnings. There may be many warnings, but you should still skim through
them. Often times there will be something very wrong in your Verilog RTL
which means any results from using the ASIC tools is completely bogus.
Synopsys DC will output a warning, but Synopsys DC will usually just keep
going, potentially producing a completely incorrect gate-level model!
We need to create a clock constraint to tell Synopsys DC what our target
cycle time is. Synopsys DC will not synthesize a design to run "as fast
as possible". Instead, the designer gives Synopsys DC a target cycle time
and the tool will try to meet this constraint while minimizing area and
power. The create_clock
command takes the name of the clock signal in
the Verilog (which in this course will always be clk
), the label to
give this clock (i.e., ideal_clock1
), and the target clock period in
nanoseconds. So in this example, we are asking Synopsys DC to see if it
can synthesize the design to run at 1GHz (i.e., a cycle time of 1ns).
dc_shell> create_clock clk -name ideal_clock1 -period 1
Finally, the compile
command will do the synthesis.
dc_shell> compile
During synthesis, Synopsys DC will display information about its optimization process. It will report on its attempts to map the RTL into standard-cells, optimize the resulting gate-level netlist to improve the delay, and then optimize the final design to save area.
The compile
command does not flatten your design. Flatten means to
remove module hierarchy boundaries; so instead of having module A and
module B within module C, Synopsys DC will take all of the logic in
module A and module B and put it directly in module C. You can enable
flattening with the -ungroup_all
option. Without extra hierarchy
boundaries, Synopsys DC is able to perform more optimizations and
potentially achieve better area, energy, and timing. However, an
unflattened design is much easier to analyze, since if there is a module
A in your RTL design that same module will always be in the synthesized
gate-level netlist.
The compile
command does not perform many optimizations. Synopsys DC
also includes compile_ultra
which does many more optimizations and will
likely produce higher quality of results. Keep in mind that the compile
command will not flatten your design by default, while the
compile_ultra
command will flattened your design by default. You can
turn off flattening by using the -no_autoungroup
option with the
compile_ultra
command. Once you finish this tutorial, feel free to go
back and experiment with the compile_ultra
command.
The next step is to create and then output the .namemap
file which
contains the name mapping we will use in power analysis.
dc_shell> saif_map -create_map \
-input "../../sim/build/sort-rtl-struct-random.saif" \
-source_instance "TOP/v"
dc_shell> saif_map -type ptpx -write_map "post-synth.namemap"
Now that we have synthesized the design, we output the resulting
gate-level netlist in two different file formats: Verilog and .ddc
(which we will use with Synopsys DesignVision).
dc_shell> write -format verilog -hierarchy -output post-synth.v
dc_shell> write -format ddc -hierarchy -output post-synth.ddc
We can use various commands to generate reports about area, energy, and
timing. The report_timing
command will show the critical path through
the design. Part of the report is displayed below.
dc_shell> report_timing -nosplit -transition_time -nets -attributes
...
Point Fanout Trans Incr Path
--------------------------------------------------------------------------
clock ideal_clock1 (rise edge) 0.00 0.00
clock network delay (ideal) 0.00 0.00
elm_S0S1$002/out_reg[1]/CLK (DFFX1) 0.00 0.00 0.00 r
elm_S0S1$002/out_reg[1]/Q (DFFX1) 0.03 0.18 0.18 f
elm_S0S1$002/out[1] (net) 2 0.00 0.18 f
elm_S0S1$002/out[1] (Reg_0x45f1552f10c5f05d_6) 0.00 0.18 f
elm_S0S1$002$out[1] (net) 0.00 0.18 f
minmax1_S1/in0[1] (MinMaxUnit_0x152ab97dfd22b898_0) 0.00 0.18 f
minmax1_S1/in0[1] (net) 0.00 0.18 f
minmax1_S1/U40/ZN (INVX0) 0.03 0.07 0.25 r
minmax1_S1/n15 (net) 1 0.00 0.25 r
minmax1_S1/U5/Q (OA21X1) 0.04 0.11 0.36 r
minmax1_S1/n16 (net) 1 0.00 0.36 r
minmax1_S1/U42/QN (AOI222X1) 0.03 0.16 0.53 f
minmax1_S1/n17 (net) 1 0.00 0.53 f
minmax1_S1/U43/Q (AO221X1) 0.04 0.12 0.64 f
minmax1_S1/n18 (net) 1 0.00 0.64 f
minmax1_S1/U44/Q (OA221X1) 0.04 0.11 0.76 f
minmax1_S1/n19 (net) 1 0.00 0.76 f
minmax1_S1/U45/Q (AO221X1) 0.04 0.12 0.88 f
minmax1_S1/n20 (net) 2 0.00 0.88 f
minmax1_S1/U46/Q (OA221X1) 0.04 0.12 1.00 f
minmax1_S1/n21 (net) 1 0.00 1.00 f
minmax1_S1/U47/Q (OA22X1) 0.04 0.11 1.11 f
minmax1_S1/n23 (net) 2 0.00 1.11 f
minmax1_S1/U23/Q (AO21X1) 0.05 0.12 1.23 f
minmax1_S1/n8 (net) 4 0.00 1.23 f
minmax1_S1/U30/ZN (INVX0) 0.06 0.10 1.33 r
minmax1_S1/n12 (net) 3 0.00 1.33 r
minmax1_S1/U14/Q (AO22X1) 0.04 0.12 1.45 r
minmax1_S1/out_max[7] (net) 1 0.00 1.45 r
minmax1_S1/out_max[7] (MinMaxUnit_0x152ab97dfd22b898_0) 0.00 1.45 r
minmax1_S1$out_max[7] (net) 0.00 1.45 r
elm_S1S2$003/in_[7] (Reg_0x45f1552f10c5f05d_9) 0.00 1.45 r
elm_S1S2$003/in_[7] (net) 0.00 1.45 r
elm_S1S2$003/out_reg[7]/D (DFFX2) 0.04 0.04 1.49 r
data arrival time 1.49
clock ideal_clock1 (rise edge) 1.00 1.00
clock network delay (ideal) 0.00 1.00
elm_S1S2$003/out_reg[7]/CLK (DFFX2) 0.00 1.00 r
library setup time -0.08 0.92
data required time 0.92
--------------------------------------------------------------------------
data required time 0.92
data arrival time -1.49
--------------------------------------------------------------------------
slack (VIOLATED) -0.57
This timing report uses static timing analysis to find the critical
path. Static timing analysis checks the timing across all paths in the
design (regardless of whether these paths can actually be used in
practice) and finds the longest path. For more information about static
timing analysis, consult Chapter 1 of the Synopsys Timing Constraints
and Optimization User
Guide. The
report clearly shows that the critical path starts at a pipeline register
in between the S1 and S2 stages (elm_S0S1[2]
), goes into the first
input of a MinMaxUnit
, comes out the out_max
port of the
MinMaxUnit
, and ends at a pipeline register in between the S2 and S3
stages (elm_S1S2[3]
). The report shows the delay through each logic
gate (e.g., the clk-to-q delay of the initial DFF is 180ps, the
propagation delay of a OA22X1 gate is 130ps) and the total delay for the
critical path which in this case is 1.49ns. Notice how all of the OA22X1
gates do not all have the same propagation delay; this is because the
static timing analysis also factors in input slew rates, rise vs fall
time, and output load when calculating the delay of each gate. We set the
clock constraint to be 1ns, but also notice that the report factors in
the setup time required at the final register. The setup time is 80ps, so
in order to operate the sort unit at 1ns and meet the setup time we would
need the critical path to arrive in 0.92ns.
The difference between the required arrival time and the actual arrival time is called the slack. Positive slack means the path arrived before it needed to while negative slack means the path arrived after it needed to. If you end up with positive slack it means you probably want to decrease your clock constraint to push the tools harder and produce a faster design. Even if you have no slack you still probably want to decrease your clock constraint. This is because the tools rarely leave positive slack preferring instead to take an overly fast design and resynthesize smaller logic to save area and power. In the above example, we have 390ps of negative slack. Note that this does not mean the sort unit will not work. It just means the cycle time would have to be 1.57ns in order for the sort unit to operate correctly. Because in this course we are primarily interested in design-space exploration (as opposed to meeting some kind of arbitrary timing constraint), we suggest adjusting the clock constraint until you end up with about 5-10% negative slack. This will result in a well-optimized design and help identify the "fundamental" performance of the design.
The report_area
command can show how much area each module uses and can
enable detailed area breakdown analysis.
dc_shell> report_area -nosplit -hierarchy
...
Combinational area: 2258.841582
Buf/Inv area: 481.075210
Noncombinational area: 2779.545593
Macro/Black Box area: 0.000000
Net Interconnect area: 226.283256
Total cell area: 5038.387176
Total area: 5264.670431
Global Local
Cell Area Cell Area
---------- ----------------
Hierarchical cell Abs Non Black-
Total % Comb Comb boxes
------------------ ------ ---- ------ ------ ---- -------------------------------
SortUnitStructRTL 5038.3 100 0.0 0.0 0.0 SortUnitStructRTL
elm_S0S1$000 199.0 4.0 0.0 199.0 0.0 Reg_0x45f1552f10c5f05d_8
elm_S0S1$001 199.0 4.0 0.0 199.0 0.0 Reg_0x45f1552f10c5f05d_7
elm_S0S1$002 199.0 4.0 0.0 199.0 0.0 Reg_0x45f1552f10c5f05d_6
elm_S0S1$003 199.0 4.0 0.0 199.0 0.0 Reg_0x45f1552f10c5f05d_5
elm_S1S2$000 211.9 4.2 0.0 211.9 0.0 Reg_0x45f1552f10c5f05d_0
elm_S1S2$001 211.9 4.2 0.0 211.9 0.0 Reg_0x45f1552f10c5f05d_11
elm_S1S2$002 250.6 5.0 0.0 250.6 0.0 Reg_0x45f1552f10c5f05d_10
elm_S1S2$003 231.3 4.6 0.0 231.3 0.0 Reg_0x45f1552f10c5f05d_9
elm_S2S3$000 250.6 5.0 0.0 250.6 0.0 Reg_0x45f1552f10c5f05d_4
elm_S2S3$001 250.6 5.0 0.0 250.6 0.0 Reg_0x45f1552f10c5f05d_3
elm_S2S3$002 250.6 5.0 0.0 250.6 0.0 Reg_0x45f1552f10c5f05d_2
elm_S2S3$003 250.6 5.0 0.0 250.6 0.0 Reg_0x45f1552f10c5f05d_1
minmax0_S1 462.6 9.2 462.6 0.0 0.0 MinMaxUnit_0x152ab97dfd22b898_3
minmax0_S2 407.3 8.1 407.3 0.0 0.0 MinMaxUnit_0x152ab97dfd22b898_4
minmax1_S1 506.8 10.1 506.8 0.0 0.0 MinMaxUnit_0x152ab97dfd22b898_0
minmax1_S2 483.8 9.6 483.8 0.0 0.0 MinMaxUnit_0x152ab97dfd22b898_2
minmax_S3 364.9 7.2 364.9 0.0 0.0 MinMaxUnit_0x152ab97dfd22b898_1
val_S0S1 35.9 0.7 11.0 24.8 0.0 RegRst_0x2ce052f8c32c5c39_1
val_S1S2 35.9 0.7 11.0 24.8 0.0 RegRst_0x2ce052f8c32c5c39_2
val_S2S3 35.9 0.7 11.0 24.8 0.0 RegRst_0x2ce052f8c32c5c39_0
------------------ ------ ---- ------ ------ ---- -------------------------------
Total 2258.8 2779.5 0.0
The units are in square micron. Note that the total cell area is
different from the total area. The total cell area includes just the
standard cells, while the totral area includes filler cells. We will want
to use the total area in our analysis. So we can see that the sort unit
consumes approximately 5,264um^2 of area. We can also see that each
pipeline register consumes about 4-5% of the area, while the
MinMaxUnits
consume about ~40% of the area. This is one reason we try
not to flatten our designs, since the module hierarchy helps us
understand the area breakdowns. If we completely flattened the design
there would only be one line in the above table.
The report_power
command can show how much power each module consumes.
Note that this power analysis is actually not that useful yet, since at
this stage of the flow the power analysis is based purely on statistical
activity factor estimation. Basically, Synopsys DC assumes every net
toggles 10% of the time. This is a pretty poor estimate, so we should
never use this kind of statistical power estimation in this course.
dc_shell> report_power -nosplit -hierarchy
Finally, we go ahead and exit Synopsys DC.
dc_shell> exit
Take a few minutes to examine the resulting Verilog gate-level netlist.
Notice that the module hierarchy is preserved and also notice that the
MinMaxUnit
synthesizes into a large number of basic logic gates.
% cd $TOPDIR/asic/dc-syn
% more post-synth.v
We can use the Synopsys Design Vision (DV) tool for browsing the
resulting gate-level netlist, plotting critical path histograms, and
generally analyzing our design. Start Synopsys DV and setup the
target_library
and link_library
variables as before.
% design_vision-xg
design_vision> set_app_var target_library "$env(STDCELLS_DIR)/cells.db"
design_vision> set_app_var link_library "* $env(STDCELLS_DIR)/cells.db"
You can use the following steps to open the .ddc
file generated during
synthesis.
- Choose File > Read from the menu
- Open the
post-synth.dcc
file
You can use the following steps to view the gate-level schematic for the
MinMaxUnit
:
- Select the
minmax0_S1
module in the Logical Hierarchy panel - Choose Schematic > New Schematic View from the menu
- Double click the box representing the
MinMaxUnit
in the schematic view
This shows you the exact gates used to implement the MinMaxUnit
. You
can use the following steps to view a histogram of path slack, and also
to open a gave-level schematic of just the critical path.
- Choose Timing > Path Slack from the menu
- Click OK in the pop-up window
- Select the left-most bar in the histogram to see list of most critical paths
- Right click first path (the critical path) and choose Path Schematic
This shows you the exact gates that lie on the critical path. Notice that there 11 levels of logic on the critical path. The number of levels of logic on the critical path can provide some very rough first-order intuition on whether or not we might want to explore a more aggressive clock constraint and/or adding more pipeline stages. If there are just a few levels of logic on the critical path then something is probably wrong, while if there are more than 50 levels of logic then there is potentially room for signficant improvement. The following screen capture illutrates using Design Vision to explore the post-synthesis results. While this can be interesting, in this course, we almost always prefer exploring the post-place-and-route results, so we will not really use Synopsys DC that often.
We use Synopsys IC Compiler (ICC) for placing standard cells in rows and then automatically routing all of the nets between these standard cells. We also use Synopsys ICC to route the power and ground rails in a grid and connect this grid to the power and ground pins of each standard cell, and to automatically generate a clock tree to distribute the clock to all sequential state elements with hopefully low skew.
We start by creating a subdirectory for our work, and then launching Synopsys ICC with the GUI.
% mkdir -p $TOPDIR/asic/icc-par
% cd $TOPDIR/asic/icc-par
% icc_shell -gui
As we enter commands we will be able use the GUI to see incremental
progress towards a fully placed-and-routed design. To make it easier to
copy-and-paste commands from this document, we tell Synopsys ICC to
ignore the prefix icc_shell>
using the following:
icc_shell> alias "icc_shell>" ""
We begin by setting the target_library
and link_library
variables as
before.
icc_shell> set_app_var target_library "$env(STDCELLS_DIR)/cells.db"
icc_shell> set_app_var link_library "* $env(STDCELLS_DIR)/cells.db"
ASIC tools need to manipulate increasingly large amounts of design data
and metadata, so most ASIC tools now leverage either proprietary or open
databases for storing and manipulating design data and metadata. Synopsys
ASIC tools use their own Milkyway database format. Synopsys ICC requires
us to create a new Milkway database for this design. The create_mw_lib
command initializes such a database with information about the
interconnect technology (cells.tf
file) and the abstract physical views
of the standard-cell library (cells.fr
file).
icc_shell> create_mw_lib -open \
-tech "$env(STDCELLS_DIR)/cells.tf" \
-mw_reference_library "$env(STDCELLS_DIR)/cells.fr" \
"LIB"
We also need to tell Synopsys ICC more information about the parasitic
resistance and capacitance of each metal layer so it can optimize the
power and signal routing. We use the set_tlu_plus_files
to point
Synopsys ICC to the .tluplus
files.
icc_shell> set_tlu_plus_files \
-max_tluplus "$env(STDCELLS_DIR)/cells-max.tluplus" \
-min_tluplus "$env(STDCELLS_DIR)/cells-min.tluplus" \
-tech2itf_map "$env(STDCELLS_DIR)/tech2itf.map"
We are now ready to import the gate-level Verilog netlist synthesized by Synopsys DC.
icc_shell> import_designs -format verilog "../dc-syn/post-synth.v"
As in Synopsys DC, we must set the clock constraint for timing driven placement and routing, and static timing analysis.
icc_shell> create_clock clk -name ideal_clock1 -period 1
After importing the design you will see all of the standard cells in the
gate-level netlist clustered together in the Synopsys ICC GUI. The first
step is to a floorplan which tells Synopsys ICC how much area we want to
use, how to arrange the rows, where to put the top-level pins, and what
the target utilization should be (higher utilization may result in
smaller area but can also drastically increase tool runtimes). We can use
the create_floorplan
command with most of the default options.
icc_shell> create_floorplan -core_utilization 0.7
The following screen capture illustrates what you should see: a square floorplan and the cells loosely arranged to the right of the floorplan in the Synopsys ICC GUI.
We can use the create_fp_placement
to do a simple placement of the
cells into the square floorplan.
icc_shell> create_fp_placement
Use the "zoom to fit" button in the toolbar to focus on the square floorplan. The following screen capture illustrates what you should see: all of the cells arranged into tight rows within the square floorplan.
As a quick aside, we have provided you with a different color scheme that we personally like better by using the following steps:
- Ensure the View Settings Toolbar is visible by choosing View > Toolbars > View Settings from the menu
- Use the drop-down list next to Preset to choose ece5745
Now that the cells are placed in the floorplan, the next step is power
routing. Recall that each standard cell has internal M1 power and ground
rails which will connect via abutment when the cells are placed into
rows. If we were just to supply power to cells using these rails we would
likely have large IR drop and the cells in the middle of the chip would
effectively be operating at a much lower voltage. During power routing,
we create a grid of power and ground wires on the top metal layers and
then connect this grid down to the M1 power rails in each row. We also
create a power ring around the entire floorplan. Before doing the power
routing, we need to use the derive_pg_connection
command to tell
Synopsys IC compiler which nets are power and which nets are ground
(there are many possible names for power and ground!).
icc_shell> derive_pg_connection \
-power_net "VDD" -power_pin "VDD" \
-ground_net "VSS" -ground_pin "VSS" \
-create_ports top
Now we can use the synthesize_fp_rail
command to create the power grid
and power connections. We need to tell Synopsys ICC what our power budget
is (e.g., 1W), what the target voltage supply is (e.g., 1.2V), and our
target max IR drop (e.g., 250mV).
icc_shell> synthesize_fp_rail \
-power_budget 1000 -voltage_supply 1.2 -target_voltage_drop 250 \
-nets "VDD VSS" \
-create_virtual_rails "M1" \
-synthesize_power_plan -synthesize_power_pads -use_strap_ends_as_pads
The following screen capture illustrates what you should now see: a high-level colormap of the IR drop across the floorplan. You can see that the middle of the floorplan has the highest IR drop (~150mV) while the edge of the floorplan has the lowest IR drop. This specific power routing plan is not the best, and we would probably want to iterate a bit, but it is good enough for first-order area, energy, and timing analysis.
We can go ahead and commit the power routing plan using the
commit_fp_rail
command.
icc_shell> commit_fp_rail
The following screen capture illustrates what you should now see: a top metal-layer grid with horizontal M9 and vertical M8. Synopsys IC will report a bunch of warnings which suggest it is having trouble actually doing the final power routing. Ideally we would need to dig in and fix these warnings but again for now it is probably fine for first-order area, energy, and timing analysis.
Now that the cells are placed in the floorplan, and we have finished
power routing, the next step is clock tree synthesis. We first need to
create a new clock just like we did in Synopsys DC so that Synopsys ICC
knows which net is the clock and the target clock frequency. We can then
use the clock_opt
command to do the actual clock tree synthesis.
icc_shell> clock_opt
You can use the following steps to highlight the clock tree:
- Choose Clock > Color By Clock Trees from the menu
- Select Reload in the sidebar on right
- Click OK in the pop-up window
The following screen capture illustrates what you should now see: an RC-balanced clock tree highlighted in yellow. Notice how the clock enters the floorplan in the lower-left-hand corner and then branches off to create a tree.
The final major step is to route all of the signals in the design. We do
this with the route_opt
command.
icc_shell> route_opt
Zoom into the design to see some of the detailed routing. The following screen capture illustrates what you should now see. All routes on a given metal layer always go in one direction: M2 vertical, M3 horizontal, M4 vertical, M5 horizontal, etc. All routing is on a predefined grid of routing tacks, and all pins are also on the same grid to simplify automatic routing.
The final step is to insert "filler cells" in between the standard cells. Filler cells are like standard cells except they do not contain any logic. The primary purpose of filler cells is to ensure the M1 power and ground rails are connected and that there are no design rule violations with respect to the nwells.
icc_shell> insert_stdcell_filler \
-cell_with_metal "SHFILL128 SHFILL64 SHFILL3 SHFILL2 SHFILL1" \
-connect_to_power "VDD" -connect_to_ground "VSS"
Our design is now on silicon! We can generate the resulting post-place-and-route Verilog gate-level netlist.
icc_shell> write_verilog "post-par.v"
We can also first extract the parasitic resistance and capacitance for
all routes in the design, and then write these parasitics to an .sbpf
file. This parasitic information will be used for power analysis.
icc_shell> extract_rc
icc_shell> write_parasitics -format sbpf -output "post-par.sbpf"
We also need to generate the final layout which is what we might
eventually send off to the foundry for fabrication. We can do this by
first streaming in the .gds
for all of the standard cells and then
streaming out the complete design including all of the instantiated
standard cells.
icc_shell> read_stream -format gds "$env(STDCELLS_DIR)/cells.gds"
icc_shell> write_stream -format gds post-par.gds
As in Synopsys DC, we can use various commands to generate reports about
area, energy, and timing. The report_timing
command will show the
critical path through the design. Part of the report is displayed below.
icc_shell> report_timing -nosplit -transition_time -nets -attributes
...
Point Fanout Trans Incr Path
--------------------------------------------------------------------------
clock ideal_clock1 (rise edge) 0.00 0.00
clock network delay (propagated) 0.00 0.00
elm_S0S1$003/out_reg[1]/CLK (DFFX1) 0.01 0.00 0.00 r
elm_S0S1$003/out_reg[1]/Q (DFFX1) 0.06 0.18 0.18 r
elm_S0S1$003/out[1] (net) 4 0.00 0.18 r
elm_S0S1$003/out[1] (Reg_0x45f1552f10c5f05d_5) 0.00 0.18 r
elm_S0S1$003$out[1] (net) 0.00 0.18 r
minmax1_S1/in1[1] (MinMaxUnit_0x152ab97dfd22b898_0) 0.00 0.18 r
minmax1_S1/in1[1] (net) 0.00 0.18 r
minmax1_S1/U5/Q (OA21X1) 0.04 0.09 0.27 r
minmax1_S1/n16 (net) 1 0.00 0.27 r
minmax1_S1/U42/QN (AOI222X1) 0.03 0.12 0.39 f
minmax1_S1/n17 (net) 1 0.00 0.39 f
minmax1_S1/U43/Q (AO221X1) 0.04 0.06 0.45 f
minmax1_S1/n18 (net) 1 0.00 0.45 f
minmax1_S1/U44/Q (OA221X1) 0.04 0.08 0.52 f
minmax1_S1/n19 (net) 1 0.00 0.52 f
minmax1_S1/U45/Q (AO221X1) 0.05 0.07 0.59 f
minmax1_S1/n20 (net) 2 0.00 0.59 f
minmax1_S1/U46/Q (OA221X1) 0.04 0.08 0.67 f
minmax1_S1/n21 (net) 1 0.00 0.67 f
minmax1_S1/U47/Q (OA22X1) 0.04 0.07 0.74 f
minmax1_S1/n23 (net) 2 0.00 0.74 f
minmax1_S1/U23/QN (AOI21X2) 0.07 0.10 0.85 r
minmax1_S1/n8 (net) 12 0.00 0.85 r
minmax1_S1/U10/Q (AO22X1) 0.05 0.08 0.93 r
minmax1_S1/out_min[3] (net) 1 0.00 0.93 r
minmax1_S1/out_min[3] (MinMaxUnit_0x152ab97dfd22b898_0) 0.00 0.93 r
minmax1_S1$out_min[3] (net) 0.00 0.93 r
elm_S1S2$002/in_[3] (Reg_0x45f1552f10c5f05d_10) 0.00 0.93 r
elm_S1S2$002/in_[3] (net) 0.00 0.93 r
elm_S1S2$002/out_reg[3]/D (DFFX1) 0.05 0.00 0.93 r
data arrival time 0.93
clock ideal_clock1 (rise edge) 1.00 1.00
clock network delay (propagated) 0.00 1.00
elm_S1S2$002/out_reg[3]/CLK (DFFX1) 0.00 1.00 r
library setup time -0.07 0.94
data required time 0.94
--------------------------------------------------------------------------
data required time 0.94
data arrival time -0.93
--------------------------------------------------------------------------
slack (MET) 0.01
Recall that our post-synthesis timing report identified a critical path
starting at the elm_S0S1[2]
register and ending at the elm_S1S2[3]
register, but our post-place-and-route timing report has identified a
different critical path starting at the elm_S0S1[3]
register and ending
at the `elm_S1S2[2] register. In addition, the estimated worst case
critical path delay in the post-synthesis timing report was 1.49ns with
an estimated cycle time of 1.57ns, however our post-place-and-route
timing report now estimates the worst case critical path is only 0.93ns
and the cycle time would actually be faster than the target of 1ns. This
large discrepancy is because the post-synthesis timing report must make
very rough assumptions about placement and thus interconnect delay, while
the post-place-and-route timing report uses the real placement to create
a much more accurate estimate of the interconnect delay. In this case,
the post-synthesis interconnect delay estimate was far too conservative,
but it is also very possible for the post-synthesis interconnect delay
estimate to be far too optimistic. This is why we avoid using
post-synthesis timing reports for meaningful design-space exploration,
and we instead focus on post-place-and-route timing reports.
We can also graphically view how the critical path traverses the final layout using the following steps:
- Choose Timing > New Timing Analysis Window from the menu
- Focus on Select Paths window, click OK
- List of paths should appear
- Select first path (critical path) to see it highlighted in the layout view
The following screen capture shows the critical path for the sort unit on the final layout.
As in Synopsys DC, the report_area
command can show how much area each
module uses and can enable detailed area breakdown analysis.
icc_shell> report_area -nosplit -hierarchy
...
Combinational area: 2210.918383
Buf/Inv area: 392.601609
Noncombinational area: 2463.436769
Macro/Black Box area: 0.000000
Net Interconnect area: 175.165808
Total cell area: 4674.355152
Total area: 4849.520960
Global Local
Cell Area Cell Area
---------- ----------------
Hierarchical cell Abs Non Black-
Total % Comb Comb boxes
------------------ ----- --- ----- ----- --- -------------------------------
SortUnitStructRTL 4674.3 100 0.0 0.0 0.0 SortUnitStructRTL
elm_S0S1$000 205.5 4.4 6.4 199.0 0.0 Reg_0x45f1552f10c5f05d_8
elm_S0S1$001 205.5 4.4 6.4 199.0 0.0 Reg_0x45f1552f10c5f05d_7
elm_S0S1$002 211.9 4.5 12.9 199.0 0.0 Reg_0x45f1552f10c5f05d_6
elm_S0S1$003 199.0 4.3 0.0 199.0 0.0 Reg_0x45f1552f10c5f05d_5
elm_S1S2$000 204.5 4.4 5.5 199.0 0.0 Reg_0x45f1552f10c5f05d_0
elm_S1S2$001 204.5 4.4 5.5 199.0 0.0 Reg_0x45f1552f10c5f05d_11
elm_S1S2$002 199.0 4.3 0.0 199.0 0.0 Reg_0x45f1552f10c5f05d_10
elm_S1S2$003 204.5 4.4 5.5 199.0 0.0 Reg_0x45f1552f10c5f05d_9
elm_S2S3$000 199.0 4.3 0.0 199.0 0.0 Reg_0x45f1552f10c5f05d_4
elm_S2S3$001 199.0 4.3 0.0 199.0 0.0 Reg_0x45f1552f10c5f05d_3
elm_S2S3$002 199.0 4.3 0.0 199.0 0.0 Reg_0x45f1552f10c5f05d_2
elm_S2S3$003 199.0 4.3 0.0 199.0 0.0 Reg_0x45f1552f10c5f05d_1
minmax0_S1 429.4 9.2 429.4 0.0 0.0 MinMaxUnit_0x152ab97dfd22b898_3
minmax0_S2 438.6 9.4 438.6 0.0 0.0 MinMaxUnit_0x152ab97dfd22b898_4
minmax1_S1 434.9 9.3 434.9 0.0 0.0 MinMaxUnit_0x152ab97dfd22b898_0
minmax1_S2 467.2 10.0 467.2 0.0 0.0 MinMaxUnit_0x152ab97dfd22b898_2
minmax_S3 364.9 7.8 364.9 0.0 0.0 MinMaxUnit_0x152ab97dfd22b898_1
val_S0S1 35.9 0.8 11.0 24.8 0.0 RegRst_0x2ce052f8c32c5c39_1
val_S1S2 35.9 0.8 11.0 24.8 0.0 RegRst_0x2ce052f8c32c5c39_2
val_S2S3 35.9 0.8 11.0 24.8 0.0 RegRst_0x2ce052f8c32c5c39_0
------------------ ------ --- ------ ------ --- -------------------------------
Total 2210.9 2463.4 0.0
If we compare our post-synthesis area report to the post-place-and-route area report, we will see they are roughly in the same ballpark, although they are definitely not identical. The post-place-and-route area report is far more accurate since it incorporates the actual placement of the standard cells along with filler cells. In general, we will always be using post-place-and-route area estimates.
We can also graphically view which cells are associated with various modules in the layout using the following steps:
- Choose Placement > Color By Hierarchy from the menu
- select Reload in the sidebar on right
- Select Color hierarchical cells at level in the pop-up window
- Click OK in the pop up
We call the resulting plot an "amoeba plot" because the tool often generates blocks that look like amoebas. The following screen capture shows the amoeba plot for the sort unit.
The sidebar on the right you can see how many cells are in each module,
and you can also see how the critical path physically goes through
minmax1_S1
as indicated in the post-place-and-route timing report.
As in Synopsys DC, we can use the report_power
command can show how
much power each module consumes, but since this estimate will still use
statistical power estimation it is still not of much use.
icc_shell> report_power -nosplit -hierarchy
Finally, we go ahead and save/close the Milkyway databse and exit Synopsys ICC.
icc_shell> close_mw_lib -save
icc_shell> exit
We can now look at the actual .gds
file for our design to see the final
layout including all of the cells and the interconnect using the
open-source Klayout GDS viewer.
% cd $TOPDIR/asic/icc-par
% klayout -l $STDCELLS_DIR/cells.lyp post-par.gds
The following screen capture illutrates using Klayout to view the layout for the entire sort unit.
The following figure shows a zoomed portion of the layout. You can clearly see the active layer inside the standard cells along with the signal routing on the lower metal layers and the power routing on the upper metal layers.
Synopsys PrimeTime (PT) is primarily used for very accurate "sign-off" static timing analysis (more accurate than the analysis performed by Synopsys DC and ICC), but in this course, we will only use Synopsys PT for power analysis. There are many ways to perform power analysis. As mentioned earlier, the post-synthesis and post-place-and-route power reports use statistical power analysis where we simply assume some toggle probability on each net. For more accurate power analysis we need to find out the actual activity for every net for a given experiment. One way to do this is to perform post-place-and-route gate-level simulation. In other words, we can run a benchmark executing on the gate-level netlist generated after place-and-route. These kind of gate-level simulations can be very, very slow and are tedious to setup correctly.
In this course we will use a simpler approach that is slightly less
accurate than gate-level simulation. We will use the per-net activity
factors from the RTL simulation which were stored in an .saif
file
earlier in this tutorial. The challenge is that the activity factors in
the .saif
file are from RTL simulation, and the RTL model is obviously
not a one-to-one match with the gate-level model. The RTL model might
have had a net called result_plus_4
, but the logic may have been
optimized during synthesis and place-and-route such that this net is
simply not in the gate-level model. Similarly, the gate-level model will
have many nets which do not map to corresponding nets in the RTL model.
Fortunately, there will be many nets that do match between the RTL and
gate-level models. For example, if we have a register in our RTL model
then that register is guaranteed to be in the gate-level model. So if we
know the activity factor of each bit in the register from the .saif
file, then we know the activity factor or each corresponding flip-flop in
the gate-level model. Similarly, if we have a module port in our RTL
model and we have not flattened the design, then that module port is
guaranteed to be in the gate-level model (another good reason not to
flatten your design!). Synopsys PT can use sophisticated algorithms
including many tiny little gate-level simulations of just a few gates in
order to estimate the activity factor of all nets downstream from the
valid activity factors included in the .saif
. For more information
about this kind of power analysis, consult Chapter 5 (more specifically,
the section titled "Estimating Non-Annotated Switching Activity" of the
PrimeTime PX User
Guide.
We start by creating a subdirectory for our work, and then launching Synopsys PT.
% mkdir -p $TOPDIR/asic/pt-pwr
% cd $TOPDIR/asic/pt-pwr
% pt_shell
To make it easier to copy-and-paste commands from this document, we tell
Synopsys PT to ignore the prefix pt_shell>
using the following:
pt_shell> alias "pt_shell>" ""
We begin by setting the target_library
and link_library
variables as
before.
pt_shell> set_app_var target_library "$env(STDCELLS_DIR)/cells.db"
pt_shell> set_app_var link_library "* $env(STDCELLS_DIR)/cells.db"
Since Synopsys PT is primarily used for static timing analysis, we need to explicitly tell Synopsys PT that we want to use it for power analysis.
pt_shell> set_app_var power_enable_analysis true
We now read in the gate-levle netlist, tell Synopsys PT we want to do power analysis for the top-level module, and link the design (i.e., recursively resolve all of the module references starting from the top-level module).
pt_shell> read_verilog "../icc-par/post-par.v"
pt_shell> current_design SortUnitStructRTL_0x73ab8da9cdd886de
pt_shell> link_design
In order to do power analysis, Synopsys PT needs to know the clock period. Here we will set the clock frequency to be the same as the initial clock constraint, but note that this is only valid if our design has no negative or positive slack. If our design has negative or positive slack, we want to adjust the clock period to reflect the clock frequency that can actually be achieved by the design.
pt_shell> create_clock clk -name ideal_clock1 -period 1
The key to making this approach to power analysis work, is to ensure as
many nets as possible match between the .saif
generated from RTL and
the gate-level netlist. Synopsys DC will change the names of various nets
as it does synthesis, so if you recall we used some extra commands in
Synopsys DC to generate a .namemap
file. This name mapping file maps
high-level RTL names to low-level gate-level names and will result in a
better match between these two models. So the next step is to read in
this previously generated .namemap
file.
pt_shell> source ../dc-syn/post-synth.namemap
We are now ready to read in the actual activity factors which will be
used for power analysis. The .saif
file comes from a .vcd
file which
in turn came from running a simulation with a test harness. We need to
strip off part of the instance names in the .saif
file since the
gate-level netlist does not have this test harness. Again, the key is to
make sure we do everything we can to ensure as many nets as possible
match between the .saif
generated from RTL and the gate-level netlist.
pt_shell> read_saif "../../sim/build/sort-rtl-struct-random.saif" -strip_path "TOP/v"
The .db
file includes parasitic capacitance estimates for every pin of
every standard cell, but to improve the accuracy of power analysis, we
also need to include parasitic capacitances from the interconnect. Recall
that we used Synopsys ICC to generate exactly this information in a
.sbpf
file. So we now read in these additional parasitic capacitance
values for every net in the gate-level netlist.
pt_shell> read_parasitics -format sbpf "../icc-par/post-par.sbpf.max"
We now have everything we need to perform the power analysis: (1) the
activity factor of a subset set of the nets, (2) the capacitance of every
net/port, (3) the supply voltage, and (4) the clock frequency. We use the
update_power
command to propagate activity factors to unannotated nest
and to estimate the power of our design.
pt_shell> update_power
We can use the report_power
command to show a high-level overview of
how much power the sort unit consumes.
pt_shell> report_power -nosplit
...
Internal Switching Leakage Total
Power Group Power Power Power Power ( %)
-----------------------------------------------------------
clock_network 3.1e-03 0.0 0.0 3.1e-03 (63.27%) i
register -9.1e-05 3.4e-04 1.3e-05 2.6e-04 ( 5.29%)
combinational 9.9e-04 5.6e-04 1.0e-05 1.5e-03 (31.44%)
sequential 0.0 0.0 0.0 0.0 ( 0.00%)
memory 0.0 0.0 0.0 0.0 ( 0.00%)
io_pad 0.0 0.0 0.0 0.0 ( 0.00%)
black_box 0.0 0.0 0.0 0.0 ( 0.00%)
Net Switching Power = 9.049e-04 (18.15%)
Cell Internal Power = 4.057e-03 (81.38%)
Cell Leakage Power = 2.351e-05 ( 0.47%)
---------
Total Power = 4.985e-03 (100.00%)
These numbers are in Watts. We can see that the sort unit consumes ~5mW of power when processing random input data. Power is the rate change of energy (i.e., energy divided by execution time), so the total energy is just the product of the total power, the number of cycles, and the cycle time. When we ran the sort unit simulator at the beginning of the tutorial, we saw that the simulation required 105 cycles. Assuming our sort unit runs as 1ns, this means the total energy is 512pJ. Since we are doing 100 sorts, this corresponds to about 5.12pJ per sort.
The power is broken down into internal, switching, and leakage power. Internal and switching power are both forms of dynamic power, while leakage power is a form of static power. Notice that in this case, the dynamic power is much more significant than the static power. Internal power was described earlier in this tutorial, so you may want to revisit that section. Note that internal power includes short circuit power, but it can also include the local clock power internal to the cell. In this overview, the power is also broken down by the power consumed in the global clock network, registers, and combinational logic. Note that the internal power for the registers is negative. This is normal. As mentioned earlier in this tutorial, this is just an artifact of how we do the accounting. Switching power is the power dissipated by the charging and discharging of the load capacitance at the output of each cell. Leakage power is the constant power due to subthreshold leakage. Sometimes we might want to factor out the static leakage power and focus more on the dynamic energy since including leakage power would mix energy and performance (i.e., using more cycles requires more leakage power even if we are not doing any more work during those cycles).
Although the above breakdown is somewhat useful, it is even more useful
to use the report_power
command to show how much power each module
consumes in the design.
pt_shell> report_power -nosplit -hierarchy
...
Int Switch Leak Total
Hierarchy Power Power Power Power %
---------------------------------------------------------------------
SortUnitStructRTL 4.06e-03 9.05e-04 2.35e-05 4.99e-03 100.0
elm_S1S2$000 (Reg_0) 2.51e-04 3.90e-05 1.09e-06 2.91e-04 5.8
elm_S1S2$001 (Reg_11) 2.62e-04 4.22e-05 1.11e-06 3.05e-04 6.1
elm_S1S2$002 (Reg_10) 2.49e-04 2.90e-05 1.07e-06 2.80e-04 5.6
elm_S1S2$003 (Reg_9) 2.55e-04 2.94e-05 1.11e-06 2.86e-04 5.7
elm_S2S3$000 (Reg_4) 2.07e-04 1.79e-06 1.05e-06 2.09e-04 4.2
elm_S2S3$001 (Reg_3) 2.57e-04 3.14e-05 1.08e-06 2.89e-04 5.8
elm_S2S3$002 (Reg_2) 2.58e-04 3.02e-05 1.08e-06 2.89e-04 5.8
elm_S2S3$003 (Reg_1) 2.20e-04 3.91e-07 1.08e-06 2.21e-04 4.4
val_S0S1 (RegRst_1) 2.31e-05 1.48e-07 1.64e-07 2.34e-05 0.5
val_S2S3 (RegRst_0) 2.34e-05 2.00e-07 1.64e-07 2.38e-05 0.5
minmax0_S1 (MinMaxUnit_3) 1.93e-04 1.13e-04 1.90e-06 3.08e-04 6.2
minmax0_S2 (MinMaxUnit_4) 1.99e-04 1.10e-04 1.99e-06 3.11e-04 6.2
minmax1_S1 (MinMaxUnit_0) 2.03e-04 1.15e-04 1.99e-06 3.20e-04 6.4
minmax1_S2 (MinMaxUnit_2) 2.12e-04 1.23e-04 2.13e-06 3.37e-04 6.8
val_S1S2 (RegRst_2) 2.33e-05 1.49e-07 1.64e-07 2.36e-05 0.5
minmax_S3 (MinMaxUnit_1) 1.69e-04 8.15e-05 1.63e-06 2.52e-04 5.1
elm_S0S1$000 (Reg_8) 2.66e-04 4.29e-05 1.18e-06 3.10e-04 6.2
elm_S0S1$001 (Reg_7) 2.57e-04 3.21e-05 1.18e-06 2.90e-04 5.8
elm_S0S1$002 (Reg_6) 2.75e-04 5.29e-05 1.29e-06 3.29e-04 6.6
elm_S0S1$003 (Reg_5) 2.56e-04 3.08e-05 1.08e-06 2.88e-04 5.8
From this breakdown, you can see a pretty even distribution of the power
across the modules. Note that the power consumed in the registers is
actually comparable or even a bit higher compared to the combinational
logic in each MinMaxUnit
.
Finally, we go ahead and exit Synopsys PT.
pt_shell> exit
Take a few minutes to reflect on all we have accomplished. We have taken an initial RTL design and transformed into detailed layout, but we have also been able to quantitatively analyze the area, energy, and timing of this design using a state-of-the-art ASIC toolflow.
Students are welcome to use Verilog instead of PyMTL to design their RTL models. Having said this, we will still exclusively use PyMTL for all test harnesses, FL/CL models, and simulation drivers. This really simplifies managing the course, and PyMTL is actually a very productive way to test/evaluate your Verilog RTL designs. We use PyMTL's Verilog import feature described in the Verilog tutorial to make all of this work. The following commands will run all of the tests on the Verilog implementation of the sort unit.
% cd $TOPDIR/sim/build
% rm -rf *
% py.test ../tut4_verilog/sort
As before, the tests for the SortUnitStructRTL will fail. You can just copy over your implementation of the MinMaxUnit from when you completed the Verilog tutorial. If you have not completed the Verilog tutorial then go back and do that now. After running the tests we use the sort unit simulator to translate the PyMTL RTL model into Verilog and to dump the VCD file that we want to use for power analysis.
% cd $TOPDIR/sim/build
% ../tut4_verilog/sort/sort-sim --impl rtl-struct --translate --dump-vcd
% vcd2saif -input sort-rtl-struct-random.verilator1.vcd -output sort-rtl-struct-random.saif
Take a moment to open up the translated Verilog which should be in a file
named SortUnitStructRTL_0x73ab8da9cdd886de.v
. You might ask, "Why do we
need to use PyMTL to translate the Verilog if we already have the
Verilog?" PyMTL will take care of preprocessing all of your Verilog RTL
code to ensure it is in a single Verilog file. This greatly simplifies
getting your design into the ASIC flow. This also ensures a one-to-one
match between the Verilog that was used to generate the VCD file and the
Verilog that is used in the ASIC flow.
Once you have tested your design and generated the single Verilog file and the VCD file, you can push the design through the ASIC flow using the exact same steps we used above.
Spend some time experimenting with the flow. Try pushing either the PyMTL or Verilog implementation of the sort unit through the flow with a more aggressive cycle time (e.g., 0.9ns). Can the design operate at this higher clock frequency?
Try flattening the design during synthesis by using this command:
dc_shell> compile -ungroup_all
or try using the compile_ultra
command with (or without) flattening to
see if it improves the quality of results.
dc_shell> compile_ultra -no_autoungroup