Implement Matrix class to abstract algorithms away from data storage details #54

kloudkl · 2014-01-24T05:20:03Z

Currently, the algorithm codes are quite aware of the memory layout of the underlying data. Adding a Matrix class in-between helps separate concerns of different modules which is a good practice in software engineering.

The biggest benefit is to simplify coding and improve the development productivity. It will also ease understanding of the existing and future algorithms. As a result, we will see accelerated development and adoption progresses.

The Matrix class is intended to be a view of 2D array contained in a Blob. Its main functionality is to provide high level wrappers of the common operations.

using boost::move;

template<Dtype>
class Matrix {
public:
  Matrix();
  Matrix(shared_ptr<Blob<Dtype> > blob);
  Matrix<Dtype> mul(Matrix<Dtype>& that) {
    Matrix<Dtype> product;
    caffe_gpu_gemm(...);
    return move(product);
  }
  Matrix<Dtype> add(Matrix<Dtype>& that);
  minus, div, rdiv, sqr, pow, exp, conv, sum, max, min, mean, std, ones, zeros, rand, randn, size, rows, cols, row, col, roi, t/transpose, rot90, ...
private:  
  shared_ptr<Blob<Dtype> > blob_;
  size_t num_;
  size_t channel_;
  size_t offset_;
}

So that we can write like codes like the following snippets.
The convolution:

output = image.conv(filter);

The fully connected layer:

output = weight.mul(input).add(bias);

The ReLU activation:

activation = input.max(0);

The Softmax activation

activations = input.exp();
probs = activations.rdiv(activations.sum(dim));

As you can see, the API is highly inspired by MATLAB which also motivates ArrayFire C++. But of course the snippets are only rough sketches. Many more details need to be considered. For example, if the performance price of boost move operations is too high, it could be replaced by shared_ptr which would complicate the user codes a little. Another question is should we pass in the shared_ptr of the result matrix instead of returning it. More importantly, the GPU codes may greatly differ from the CPU codes depending on whether CUDA can play well with the proposed API syntax.

Therefore, this issue's scope is limited to the implementation of the Matrix classes for both kinds of devices. Porting algorithms should be put into independent issues until benchmark results show no performance gap between the low level API and the proposed high level API.

Welcome efforts to refine the API and help implement it.

Yangqing · 2014-01-24T05:30:15Z

I am in general against writing a matrix class, or using an existing matrix
class (in which case it would be very tricky to synchronize CPU and GPU
operations). What we essentially should need is a Tensor class that
achieves 4-dimensional array operations, but that involves some substantial
changes to more than half of the code.

I am also a little against Matlab style implementations. For example, the
code:

activations = input.exp();
probs = activations.rdiv(activations.sum(dim));

effectively allocates two arrays, activations and probs, and then discards
them on the fly. Of course this could be written in a more careful way by
preallocating arrays, like exp(input, &activation), but it would introduce
careless codes more often. The current code actually requires you to
explicitly define such "buffer" blobs, which I believe is important in
writing effectively codes.

I do like the idea of separating interface from actual implementations. The
Blob class is sort of halfway here - I was in a fast iteration when writing
all those codes, but one can imaging better separation between the blob
operation interfaces and the actual blob implementations (e.g. do
add(blob1, blob2), or conv(blob1, blob2)), which is essentially what you
are proposing here. At this stage, I don't think refactoring is an urgent
issue though.

Yangqing

On Thu, Jan 23, 2014 at 9:20 PM, kloudkl notifications@github.com wrote:

Currently, the algorithm codes are quite aware of the memory layout of the
underlying data. Adding a Matrix class in-between helps seperate concerns
of different modules which is a good practice in software engineering.

The biggest benefits is to simplify coding and improve the development
productivity. It will also ease understanding of the existing and future
algorithms. As a result, we will see accelerate the development and
adoption progress.

The Matrix class is intended to be a view of 2D array contained in a Blob.
Its main functionality is to provide high level wrapper of the common
operations.

using boost::move;
templateclass Matrix {public:
Matrix();
Matrix(shared_ptr<Blob > blob);
Matrix mul(Matrix& that) {
Matrix product;
caffe_gpu_gemm(...);
return move(product);
}
Matrix add(Matrix& that);
minus, div, rdiv, sqr, pow, exp, conv, sum, max, min, mean, std, ones, zeros, rand, randn, size, rows, cols, row, col, roi, t/transpose, rot90, ...private:
shared_ptr<Blob > blob_;
size_t num_;
size_t channel_;
size_t offset_;}

So that we can write like codes like the following snippets.
The convolution:

output = image.conv(filter);

The fully connected layer:

output = weight.mul(input).add(bias);

The ReLU activation:

activation = input.max(0);

The Softmax activation

activations = input.exp();probs = activations.rdiv(activations.sum(dim));

As you can see, the API is highly inspired by the MATLAB counterparts
which also motivates ArrayFire C++. But of course the snippets are only a
rough sketch. Many more details need to be considered. For example, if the
performance price of boost move operations is too high, it could be replace
by shared_ptr which would complicate the user codes a little. Another
question is should we pass in the shared_ptr of the result matrix instead
of returning it. More importantly, the GPU codes may greatly differ from
the CPU codes depending on whether CUDA can play well with the proposed API
syntax.

Therefore, this issue's scope is limited to the implementation of the
Matrix classes for both kinds of devices. Porting algorithms should be
delayed until benchmark results shows no performance gap between the low
level API and the proposed high level ones.

Welcome efforts to refine the APIs and help implement them.

Reply to this email directly or view it on GitHubhttps://github.com//issues/54
.

kloudkl · 2014-01-24T05:41:50Z

Thanks for your suggestions! In a larger context of this proposal, I am wondering for a while what are the vision, scope, dos with priorities and dont's of Caffe? If you have a plan that can direct the community towards a shared destination, it would concentrate the limited resources out there and lead to more effective development and wider adoption in the near future.

Yangqing · 2014-02-11T06:25:33Z

Closed per #85.

Mali GPU does not support host unified memory in fact #53

kloudkl mentioned this issue Feb 11, 2014

Replace atlas/cblas routines with Eigen in the math functions #85

Closed

Yangqing closed this as completed Feb 11, 2014

shelhamer added the wontfix label Feb 12, 2014

rodrigob mentioned this issue Feb 13, 2014

Add steps to install multi-threaded OpenBLAS on Ubuntu for the gh-pages branch #81

Closed

naibaf7 added a commit that referenced this issue Feb 2, 2017

Merge pull request #54 from DVEfremov/issues-53

65ca1ec

Mali GPU does not support host unified memory in fact #53

shuguang101 mentioned this issue Jan 20, 2018

Segmentation Fault: 11 - OSX high sierra - please Help #6019

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Matrix class to abstract algorithms away from data storage details #54

Implement Matrix class to abstract algorithms away from data storage details #54

kloudkl commented Jan 24, 2014

Yangqing commented Jan 24, 2014

kloudkl commented Jan 24, 2014

Yangqing commented Feb 11, 2014

Implement Matrix class to abstract algorithms away from data storage details #54

Implement Matrix class to abstract algorithms away from data storage details #54

Comments

kloudkl commented Jan 24, 2014

Yangqing commented Jan 24, 2014

kloudkl commented Jan 24, 2014

Yangqing commented Feb 11, 2014