-
Workload models.
- Add ResNet-50.
-
Software engineering.
-
Port code to Python 3; drop Python 2 support.
-
Add pylintrc.
-
-
Software models.
- Share the same partition between input layer and external layers. For layered LSTMs, the number of external layers could be quite a few (e.g., 8). The complete combination of all partition choices of these external layers is too high to explore. So we assume all input and external layers share the same partition scheme.
-
Software engineering.
- Allow both relative and absolute overheads in
approx_dividable
.
- Allow both relative and absolute overheads in
- In
LoopBlockingScheme
, put weight-pinning code block after buffer sharing.
-
Hardware models.
-
Access forwarding.
-
Buffer sharing scheme.
- Use
BufShrScheme
class to represent and calculate NoC transfers.
- Use
-
-
Software models.
-
Add
SchedulingConstraint
class to specify loop blocking and partitioning constraints.- Add lazily updated rules to allow refine constraint with previous scheduling results at runtime.
- Add subclass
SchedulingConstraintLayerPipeline
for layer pipelining constraints.
-
Add
InterLayerPipeline
.- Layers are organized into
PipelineSegment
, which are simultaneously mapped on to the resource both spatially and temporally. - Each layer in the segment has a 3-tuple scheduling index including segment index, spatial index, and temporal index.
- Each layer in the segment has its resource allocation and scheduling constraint.
- Use
PipelineSegmentTiming
to capture the timing relation of layers in the segment. - Specify maximum allowed execution time overhead due to layer pipelining
in
Option
. - Specify maximum pipelining degree for layer pipelining in
Option
.
- Layers are organized into
-
Add layer pipelining optimizations.
- Ofmap forwarding: alternate layer loop ordering.
- Ifmap forwarding: sharing the same inputs from memory to multiple regions.
- Support model weight pinning when no resource time-multiplexing.
- Allow disabling optimizations for layer pipelining to fall back to basic pipelining techniques.
-
-
Hardware models.
-
Allow data source/destination regions in
Resource
to be non-DATA type. -
Allow
NodeRegion
to be folded along the w dimension in a zig-zag manner.
-
-
Software models.
-
LoopBlockingScheme
supports access forwarding and buffer sharing. -
LoopBlockingScheme
supports remote node buffers as data regions (non-data type data regions). -
partition
unit number of hops calculation supports access forwarding and buffer sharing. -
DataLayout
supports closest-first forwarding data transfer for access forwarding and buffer sharing. -
Refactor
NNDataflow
andNNDataflowScheme
to incorporate inter-layer pipelining.
-
-
Workload models.
-
Add
EltwiseLayer
. -
Allow only concatenation of layers; summation of layers turns into an additional
EltwiseLayer
. -
Add external layers to networks, which is external directly input data.
-
Add various LSTMs.
-
Add
data_loops
attribute with typeDataDimLoops
to each type of layer.
-
-
Hardware models:
-
Add DRAM region in
Resource
. -
Consider array bus width and its impact on data multicast latency.
-
Consider DRAM access time due to bandwidth limit.
-
-
Software models.
- Add choices for optimization goal: E(nergy), D(elay), or ED.
-
Software engineering.
-
Record search time.
-
Add utility
IntRange
for integer ranges. -
Add
HashableDict
class.
-
-
Hardware models.
-
2D memory type is changed to constant four node on the chip corners.
-
NodeRegion
addsdist
attribute for inter-node distance. -
NodeRegion
renamesDATA
enum toDRAM
. -
Limit to single source/destination data regions in
Resource
.
-
-
Software models.
-
Cost
uses static/idle unit cost for all nodes instead of one node. -
Scheduling
breaks loop/part cost into op/access/noc/static cost. -
Scheduling
breaks cost tie using time, using a compare key function ofSchedulingResult
. -
Add external occupancy to
MapStrategy
and merge intoNestedLoopDesc
; use it for partitioning occupancy. -
FmapRange.beg_end()
returns anIntRange
instance if with a single attribute argument, or a list ofIntRange
otherwise. -
Move partitioning scheme sub-
FmapRange
method, which is used to get partitioned fmap ranges, toPartitionScheme
. -
Move partitioning scheme projection, which is used to generate ofmap layout, to
PartitionScheme
. -
DataLayout
refactored: usePartitionScheme
to replaceFmapRangeMap
. -
partition
module refactored: use newDataLayout
class and newPartitionScheme
methods. -
SchedulingResult
uses a combinedOrderedDict
to replacedict_loop
anddict_part
. -
In partitioning schemes, each partitioning must fully utilize one dimension before starting the other, except for fmap partitioning.
-
-
Software engineering.
-
Change
NNDataflowScheme
node-time product interface to explicitly be static cost. -
Improve method names:
- Remove
DataDimLoops.data_cnt
. - Change
NodeRegion.node_iter
toNodeRegion.iter_node
. - Change
Network
method names to distinguish layer and layer name.
- Remove
-
-
Output dict
PartitionScheme
format fix. -
idivc
with inf arguments. -
Integer division
//
vs./
. -
ITCN access calculation for unit pass in
MapStrategy
. -
FmapRange
comparison. -
Unit nhops calculation for filter data uses DRAM region.
-
Unit nhops calculation considers nodes with both non-empty ifmaps and ofmaps.
-
Replication size when underutilizing PE arrays.
-
Workload models.
-
Network
method to return next layers. -
Network
usesNone
in previous/next layers for the input/output layers. -
Network
methods to return the first/last layers. -
Add batch size argument to layer fmap size methods.
-
Add default filter size to
FCLayer
. -
Add
DataDimLoops
class to denote loops that are dimensions of a data category. -
Add neural neworks: MLP-L/M/S from PRIME ISCA 2016.
-
-
Software models.
-
Add statistic properties to
SchedulingResult
. -
Add
NNDataflowScheme
class for overall NN dataflow.
-
-
Software engineering.
-
Add utilities to
LoopBlockingScheme
class. -
Add negative operation to
PhyDim2
. -
Add default arguments to
Option
.
-
-
Test.
- Add unit tests.
-
Workload models.
-
Relax
__len__
ofNetwork
to work before setting input layer. -
Allow different height and width for filters in
ConvLayer
.
-
-
Hardware models:
-
Upgrade node dimensions to node region in
Resource
. The origins of node region and memory regions are all absolute. -
Add
type
attribute toNodeRegion
to differentiate processing and data node regions inResource
. -
Change default cost of the NoC hop traversal.
-
-
Software models:
-
Add loop index generator to
LoopBlockingScheme
class. -
PE array mapping for
LocalRegionLayer
reduces regfile size. -
Loop blocking scheme result stats change from one node to all nodes.
-
Move partition occupation into
LoopBlockingScheme
constructor. -
Move
LoopBlockingScheme
verification to tests. -
Improve the workload partitioning for loop blocking exhaustive search.
-
Merge
loopcnt
attribute ofNestedLoopDesc
to a tuple. -
Change
LoopBlockingScheme
interface for blocking factors and loop orders. -
Loop blocking exhaustive search introduces regularized schemes and suboptimal schemes, to enable more skips. Also restrict the skips to CONV layer.
-
Refactor loop blocking bypass solvers, and restrict it to CONV layer.
-
Use row-stationary mapping to
LocalRegionLayer
, and merge with that ofConvLayer
. -
Generalize
LoopBlockingScheme
access model for arbitrary data loops. -
Skip equivalence when generating
PartitionScheme
. -
Check ifmap layout against layer parameters in
Scheduling
. -
Add number of nodes to scheduling result.
-
Add
type
attribute toDataLayout
to denote the type of the reside region. -
Add guarantee to generate
PartitionScheme
.
-
-
Software engineering.
-
Lazily evaluate loop blocking stats.
-
Use rich comparison instead of
__cmp__
. -
Convert
RuntimeError
exceptions to assertions. -
Define
__repr__
for class stringify, and removeStringifyClass
. -
Move map strategy class into
NNDataflow
constructor. -
Reorganize package structure.
-
Use lower-case name for all modules.
-
Add local version number to output.
-
-
Output data fetch count.
-
Error types and message typos.
-
FmapRange
comparison: overlapping ranges cannot compare. -
Multiple bugs fixed in
Util
. -
Multiple bugs fixed in
PartitionScheme
. -
Use GBUF unit access for DRAM when bypassing GBUF.
-
Partitioned ifmap range for
LocalRegionLayer
. -
Clarify ITCN accesses to be number of individual transfers to each REGF.
-
Partition
unit number of hops calculation ignores zero-sized data ranges.
-
Software models.
- Partition schemes.
- Input partitioning: partition different input fmaps (channels).
- Partition schemes.
-
Explorers and solvers:
- Loop blocking exhaustive search skips more equivalent schemes.
- Adjacent same loops in different hierarchy levels.
- Loop blocking exhaustive search skips more equivalent schemes.
-
Software engineering
- Verbose mode.
-
Software models:
- Loop blocking.
- Avoid initial zero-value fetch for output data.
- Loop blocking.
-
Software engineering.
-
Use a single global argument parser.
-
Introduce ContentHashClass.
-
-
FmapRange comparison.
-
Map strategy bug when filters are folded.
-
Workload models:
- Support loops: ifmap channel loop, ofmap channel loop, batch loop.
-
Software models:
-
Loop index generator for different loop blocking schemes.
-
Debug mode:
- Verification of the loop blocking access model.
-
-
Explorers and solvers:
- Loop blocking exhaustive search skips equivalent schemes.
-
Software models:
- Loop blocking data buffer and reuse models.
- Loop orders now also consider the order of batch loop.
- Change the model for trivial loops (with blocking factor 1).
- Loop blocking data buffer and reuse models.
-
Explorers and solvers.
- Performance improvements.
- Add loop blocking scheme cache in Scheduling.
- Use a single Scheduling instance for all same layers.
- Performance improvements.
-
Software engineering
-
Class instance used as dict key.
- Add value-based equality and hash to Layer.
- Add value-based equality and hash to PartitionScheme.
-
Add version number to output json dump.
-
-
Explorers and solvers:
- Better formatting of verification results against Eyeriss.
-
Software engineering
-
Replace numpy for better performance.
-
Move multiprocessing into loop blocking exploration for better performance scaling.
-
-
Workload models.
-
Network
: a DAG of layers, rather than a linear pipeline. -
New layer types: pooling layer (local region layer).
-
Enforce layer chaining has matched data size.
-
New neural network: GoogLeNet.
-
-
Hardware models.
NodeRegion
.- Used to denote memory regions, i.e., relative positions and sizes of memories to the computation node NoC.
- Support 2D memories, which are on the edges of the chip.
-
Software models.
-
FmapPosition
andFmapRange
: a position and a range in batched fmaps.FmapRangeMap
: efficient map structure ofFmapPosition
type.
-
DataLayout
: describes the layer i/ofmap data layout.- Use a
FmapRangeMap
to map each data element to the stored node.
- Use a
-
Partition schemes.
- Batch partitioning: partition input data within a batch.
-
-
Explorers and solvers.
-
SchedulingResultDict
: store layer scheduling results of a network. -
More checks to enforce the schedules have the correct number of operations as the given workloads.
-
-
Workload models.
- Update all network structures to include pooling layers.
-
Software models.
- Allow different partitioning factor along height and width of a fmap, i.e., allow different height and width sizes of the partitioned fmap.
-
Explorers and solvers.
NNDataflow
: new top-level class.
-
Software engineering:
-
Significant code refactoring to improve modularity.
- More classes, e.g.,
MapStrategy
,LoopBlockingScheme
,Scheduling
.
- More classes, e.g.,
-
Code style lint.
-
Update option names to be more uniform.
-
Standardize class stringify.
-
- Option
--hybrid_partition2d
, now is--hybrid_partition
.
-
Use of
map()
function inPhyDim2
. -
Name of
namedtuple
sublasses. -
Structure configuration of ResNet152.
-
Workload models:
-
Two layer types: convolutional layer and fully-connected layer.
-
Supported neural neworks: AlexNet, VGG, VGG19, ZFNet, ResNet152.
-
Supported data categories: ifmaps, ofmaps, and weights.
-
-
Hardware models:
-
2D Network-on-Chip (NoC) on the PE array (node) level.
-
2D PE array on the PE level.
-
Memory hierarchy:
- regf: register file in a PE.
- itcn: interconnect between PEs in an array.
- gbuf: global buffer of an array.
- dram: main memory.
-
Cost (energy) of computation operations (MAC, etc.), memory hierarchy accesses, NoC hop traversals, and static leakage.
-
-
Software models:
-
Eyeriss Row-Stationary mapping to PE array (Chen et al., ISCA 2016).
-
Loop blocking schemes over ifmap channel, ofmap channel, and batch loops.
- Loop reordering: exchange loop order.
- Loop blocking: split a loop into multiple ones.
-
Partition schemes to split the workload of a layer to different nodes.
- Fmap partitioning: partition height/width of a fmap.
- Output partitioning: partition different fmaps (channels).
-
-
Explorers and solvers:
-
Per-layer schedule exploration.
-
Exhaustive search loop blocking schemes and partitioning schemes.
-
Analytically solve bypass loop ordering (Gao et al., ASPLOS 2017).
-
Naive partitioning scheme (Kim et al., ISCA 2016).
-
-
Software engineering
- Support multi-process parallel processing.