Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize convolutional neural network #2341

Closed
wants to merge 11 commits into from

Conversation

adiwajshing
Copy link

@adiwajshing adiwajshing commented Mar 26, 2020

Hello all

I was playing around with CNNs in mlpack on the MNIST data set and for some reason, only 1 core of my CPU was being used while training. When I looked into the code, the code for the Convolutional layers & Pooling layers was not parallelize and so did exactly that using OpenMP. I also made a few incremental improvements to the NaiveConvolution class & the Convolution class. I have also successfully run the test suite on my changes.

With the changes, on my dual core MacBook Air, I could train in about ~55% the time it was taking earlier. Here are some results:

Without the parallelization:
Reading data ...
Training ...
Epoch 0: Training Accuracy = 9.44709%, Validation Accuracy = 9.40476%time taken: 80s
Epoch 1: Training Accuracy = 15.0106%, Validation Accuracy = 14.9286%time taken: 85s
Epoch 2: Training Accuracy = 19.3148%, Validation Accuracy = 19.119%time taken: 107s
Epoch 3: Training Accuracy = 24.0661%, Validation Accuracy = 23.5714%time taken: 94s
Epoch 4: Training Accuracy = 28.6349%, Validation Accuracy = 28.4048%time taken: 88s
Predicting ...
total time taken: 454; avg. epoch duration: 90.8s

With the parallelization:
Reading data ...
Training ...
Epoch 0: Training Accuracy = 9.7672%, Validation Accuracy = 9.83333%time taken: 49s
Epoch 1: Training Accuracy = 14.4127%, Validation Accuracy = 14.619%time taken: 47s
Epoch 2: Training Accuracy = 19.6376%, Validation Accuracy = 19.0952%time taken: 47s
Epoch 3: Training Accuracy = 25.0317%, Validation Accuracy = 24.5%time taken: 47s
Epoch 4: Training Accuracy = 29.8677%, Validation Accuracy = 29.9286%time taken: 47s
Predicting ...
total time taken: 237; avg. epoch duration: 47.4s

Without the parallelization:
Training ...
Epoch 0: Training Accuracy = 82.8704%, Validation Accuracy = 82.1429%time taken: 361s
(didn't bother to run more because it was taking so long :/)

With the parallelization:
Reading data ...
Training ...
Epoch 0: Training Accuracy = 82.4815%, Validation Accuracy = 81.9286%time taken: 225s
Epoch 1: Training Accuracy = 88.8439%, Validation Accuracy = 88.5952%time taken: 229s
Epoch 2: Training Accuracy = 91.6323%, Validation Accuracy = 91.7857%time taken: 220s
Epoch 3: Training Accuracy = 93.2116%, Validation Accuracy = 93.1905%time taken: 221s
Epoch 4: Training Accuracy = 94.3386%, Validation Accuracy = 93.8333%time taken: 220s
Predicting ...
total time taken: 1115; avg. epoch duration: 223s

If this is good, I could clean up my changes, extend this to the AtrousConvolution and other classes that need changes?

@mlpack-bot
Copy link

mlpack-bot bot commented Mar 26, 2020

Thanks for opening your first pull request in this repository! Someone will review it when they have a chance. In the mean time, please be sure that you've handled the following things, to make the review process quicker and easier:

  • All code should follow the style guide
  • Documentation added for any new functionality
  • Tests added for any new functionality
  • Tests that are added follow the testing guide
  • Headers and license information added to the top of any new code files
  • HISTORY.md updated if the changes are big or user-facing
  • All CI checks should be passing

Thank you again for your contributions! 👍

@adiwajshing adiwajshing changed the title Parallelize artificial network Parallelize convolutional neural network Mar 26, 2020
@kartikdutt18
Copy link
Member

Hey @adiwajshing , do you mind running these tests with BLAS or OpenBLAS?
Maybe that might help in CPU training because training mostly requires armadillo operations (like multiplication etc) and they can be parallelized using BLAS or OpenBLAS which is more prominent also.

@adiwajshing
Copy link
Author

I Already did that. As the convolution functions don't really involve too many armadillo operations and so using BLAS or OpenBLAS didn't help. Also, the pooling functions involve a combination of for loops & armadillo operations. That's why just parallelizing the for loops in these operations made a huge difference.

Copy link
Member

@kartikdutt18 kartikdutt18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, So I haven't gone through the file yet, I think I need to pay a bit more attention to why some of the changes were made. I'll go over them tomorrow.
Till then I had a couple of questions:

  1. How was this tested ?
  2. Can you share the results with BLAS and OpenBLAS?
  3. Do you mind sharing the testing script ?

Sorry to bother with so many questions, I would love to see convolution improve however something that I learned is that benchmarking especially correct benchmarking is not so easy.
Some other compiler flags that you might find useful:
-O3 -fopenmp -DNDEBUG -DARMA_NO_DEBUG

Comment on lines 181 to 182
arma::cube inputTemp = arma::cube(const_cast<arma::Mat<eT>&>(input).memptr(),
inputWidth, inputHeight, inSize * batchSize, false, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why we need to make this change? I might be missing something. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, its a merge issue. I'll revert that part

outputTemp.zeros();

for (size_t outMap = 0, outMapIdx = 0, batchCount = 0; outMap <
outSize * batchSize; outMap++)
arma::Cube<eT> inp = (padWLeft | padWRight | padHTop | padHBottom) != 0 ? inputPaddedTemp : inputTemp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally prefer names that are much more intuitive. Let me know what you think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed that variable earlier, don't anymore. Will remove it.

@kartikdutt18
Copy link
Member

Also kindly refer to the style guidelines here. Hmm, Looks the convolution causes out of index bound error. Let me know what you think.

@adiwajshing
Copy link
Author

That's odd. Tests were passing on my machine. I'll implement the style guide, check the tests & get back.

@kartikdutt18
Copy link
Member

Hey @adiwajshing, Something I forgot to mention yesterday, size_t causes problems with openMP across some devices so you should use omp_size_t to fix the build.

-Using omp_size_t instead of size_t when using OpenMP
-Convolution::Backward() bug fix
-Parallelized Convolution::Gradient()
@adiwajshing
Copy link
Author

@kartikdutt18 thanks a lot! I was wondering the problem OMP had with Windows. Anyway, I benchmarked everything with the -O3 optimization (the rest I had done already), and the results were quite interesting.

I have been using this mlpack script for my testing.

When the iterations per epoch < 1000, the performance is almost exactly the same, with or without my changes. However, when the iterations per epoch >= 10000, I get at least 2x performance with these changes.

Here are some results with -O3 & 39000 iterations per epoch:

Without parallelization:
Reading data ...
Training on 37800 samples
Epoch 0: Training Accuracy = 79.6667%, Validation Accuracy = 79.4524%time taken: 91s
Epoch 1: Training Accuracy = 87.3677%, Validation Accuracy = 87.6905%time taken: 90s
Epoch 2: Training Accuracy = 90.3651%, Validation Accuracy = 90.5238%time taken: 90s
Epoch 3: Training Accuracy = 92.3307%, Validation Accuracy = 92.1905%time taken: 90s
Epoch 4: Training Accuracy = 93.2513%, Validation Accuracy = 93.119%time taken: 90s
Predicting ...
total time taken: 451; avg. epoch duration: 90.2

With parallelization
Training on 37800 samples
Epoch 0: Training Accuracy = 73.8651%, Validation Accuracy = 73.6667%time taken: 36s
Epoch 1: Training Accuracy = 85.3333%, Validation Accuracy = 84.5476%time taken: 42s
Epoch 2: Training Accuracy = 89.537%, Validation Accuracy = 89.0952%time taken: 35s
Epoch 3: Training Accuracy = 91.3333%, Validation Accuracy = 91.0238%time taken: 35s
Epoch 4: Training Accuracy = 92.5979%, Validation Accuracy = 92.3095%time taken: 34s
Predicting ...
total time taken: 182; avg. epoch duration: 36.4

@@ -55,28 +55,32 @@ class NaiveConvolution
const size_t dW = 1,
const size_t dH = 1,
const size_t dilationW = 1,
const size_t dilationH = 1)
const size_t dilationH = 1, const size_t appending = false)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add appending option; when appending=true, the convolution just adds to the output instead of allocating a new matrix and then adding to it. Spares an allocation & some CPU time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think appending should be a bool value then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will change it to that then.

@adiwajshing
Copy link
Author

Hi @kartikdutt18

Would you have any idea why the build is failing now on 'mlpack.mlpack' and the rest is passing? The build section on Azure is just empty.

Thank you

@kartikdutt18
Copy link
Member

Hi, It's unrelated to your PR. Thanks.

Copy link
Contributor

@lozhnikov lozhnikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for the contribution. I didn't look in detail yet. I just added some minor comments regarding the style and a couple of questions.

CMakeLists.txt Outdated
@@ -481,14 +481,17 @@ if (OPENMP_FOUND)
add_definitions(-DHAS_OPENMP)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "AppleClang")
set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} /usr/local/Cellar/llvm/9.0.1/lib/libomp.dylib")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this was a wrong git add.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this line is required when building with Xcode. libomp.dylib has to be linked. In hindsight, it's too specific with the version of LLVM, maybe you could suggest a better way to do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't know since I've never used OSX. I think the cmake configuration file shouldn't depend on a particular system.

I think you can solve the issue by means of cmake variables. Try to modify the compiler flags or the environment:

# I am not quite sure which variable you need.
cmake -D CMAKE_SHARED_LINKER_FLAGS=/usr/local/Cellar/llvm/9.0.1/lib/libomp.dylib path/to/sources
# Probably cmake scan the standard variables.
LDFLAGS=/usr/local/Cellar/llvm/9.0.1/lib/libomp.dylib cmake path/to/sources

else ()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4068")
endif ()
set(OpenMP_CXX_FLAGS "")
set(OpenMP_CXX_FLAGS "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the previous indentation was correct.

Comment on lines 68 to 83
const eT *kernelPtr = filter.memptr(), *inputPtr;
size_t j, i, kj, ki;
const size_t o_cols = output.n_cols, o_rows = output.n_rows;
const size_t f_cols = filter.n_cols, f_rows = filter.n_rows;

for (size_t j = 0; j < output.n_cols; ++j)
for (j = 0; j < o_cols; ++j)
{
for (size_t i = 0; i < output.n_rows; ++i, outputPtr++)
for (i = 0; i < o_rows; ++i, outputPtr ++)
{
const eT* kernelPtr = filter.memptr();
for (size_t kj = 0; kj < filter.n_cols; ++kj)
for (kj = 0; kj < f_cols; ++kj)
{
const eT* inputPtr = input.colptr(kj * dilationW + j * dW) + i * dH;
for (size_t ki = 0; ki < filter.n_rows; ++ki, ++kernelPtr,
inputPtr += dilationH)
inputPtr = input.colptr(kj * dilationW + j * dW) + i * dH;
for (ki = 0; ki < f_rows; ++ki, ++kernelPtr, inputPtr += dilationH)
*outputPtr += *kernelPtr * (*inputPtr);
}
kernelPtr -= f_rows*f_cols;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get the point. What's the purpose of these changes? Could you elaborate a bit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working with no optimization, and so these changes made an incremental difference. However, these are just small optimizations the compiler probably would have done anyway. Can remove them if required

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I ran a test with this change. There does not seem to be a difference when optimization is enabled.

However, I then parallelized the function when the image is large enough, here are the results:

INPUT SIZE: 16
	FILTER SIZE 3: old: 0.038158s    new: 0.036094s
	FILTER SIZE 5: old: 0.044838s    new: 0.057577s
	FILTER SIZE 7: old: 0.066811s    new: 0.063331s

INPUT SIZE: 32
	FILTER SIZE 3: old: 0.127948s    new: 0.140982s
	FILTER SIZE 5: old: 0.279254s    new: 0.405592s
	FILTER SIZE 7: old: 0.625482s    new: 0.604379s

INPUT SIZE: 48
	FILTER SIZE 3: old: 0.528458s    new: 0.640729s
	FILTER SIZE 5: old: 0.718363s    new: 0.572617s
	FILTER SIZE 7: old: 1.00586s    new: 0.969761s

INPUT SIZE: 64
	FILTER SIZE 3: old: 0.503006s    new: 0.729201s
	FILTER SIZE 5: old: 1.10458s    new: 1.09253s
	FILTER SIZE 7: old: 1.80471s    new: 0.791141s

INPUT SIZE: 80
	FILTER SIZE 3: old: 0.830632s    new: 0.466671s
	FILTER SIZE 5: old: 1.58484s    new: 0.782858s
	FILTER SIZE 7: old: 2.84117s    new: 1.71891s

INPUT SIZE: 96
	FILTER SIZE 3: old: 1.21493s    new: 0.706584s
	FILTER SIZE 5: old: 2.2874s    new: 1.13919s
	FILTER SIZE 7: old: 4.15312s    new: 1.8547s

INPUT SIZE: 112
	FILTER SIZE 3: old: 1.53625s    new: 0.901717s
	FILTER SIZE 5: old: 3.10605s    new: 1.57691s
	FILTER SIZE 7: old: 6.06146s    new: 4.87558s

INPUT SIZE: 128
	FILTER SIZE 3: old: 2.34454s    new: 1.24114s
	FILTER SIZE 5: old: 4.60348s    new: 2.96314s
	FILTER SIZE 7: old: 9.58979s    new: 7.50299s

INPUT SIZE: 144
	FILTER SIZE 3: old: 3.30705s    new: 2.80564s
	FILTER SIZE 5: old: 6.0074s    new: 2.97862s
	FILTER SIZE 7: old: 9.66638s    new: 4.33965s

INPUT SIZE: 160
	FILTER SIZE 3: old: 3.08419s    new: 1.82183s
	FILTER SIZE 5: old: 6.39535s    new: 3.26939s
	FILTER SIZE 7: old: 12.2366s    new: 6.49565s

INPUT SIZE: 176
	FILTER SIZE 3: old: 4.12684s    new: 3.04944s
	FILTER SIZE 5: old: 9.23612s    new: 4.31397s
	FILTER SIZE 7: old: 16.8792s    new: 6.71569s

INPUT SIZE: 192
	FILTER SIZE 3: old: 4.55307s    new: 3.37878s
	FILTER SIZE 5: old: 9.91482s    new: 5.1516s
	FILTER SIZE 7: old: 17.925s    new: 7.8205s

INPUT SIZE: 208
	FILTER SIZE 3: old: 6.43632s    new: 4.4773s
	FILTER SIZE 5: old: 13.3028s    new: 7.28228s
	FILTER SIZE 7: old: 22.9444s    new: 10.3726s

INPUT SIZE: 224
	FILTER SIZE 3: old: 5.9436s    new: 3.52242s
	FILTER SIZE 5: old: 12.6614s    new: 6.46533s
	FILTER SIZE 7: old: 24.3096s    new: 11.6745s

INPUT SIZE: 240
	FILTER SIZE 3: old: 6.78909s    new: 3.9888s
	FILTER SIZE 5: old: 14.444s    new: 7.399s
	FILTER SIZE 7: old: 27.606s    new: 12.273s

INPUT SIZE: 256
	FILTER SIZE 3: old: 8.25704s    new: 5.41013s
	FILTER SIZE 5: old: 17.2394s    new: 9.49333s
	FILTER SIZE 7: old: 32.0725s    new: 14.911s

Here is a link to the testing script

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. I'll look into it. I thought the compiler is able to optimize the old code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but it won't be able to automatically parallelize

Comment on lines 54 to 55
#pragma omp parallel for
for (omp_size_t i = 0; i < sample.n_elem; i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop contains a pretty simple operation. I think in this case the performance depends on the memory clock speed rather than on the CPU. How do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would depend on the number of samples, right? If there are maybe only 10 samples, there may not be a difference in performance. However, as the number increases & the overload of creating the threads becomes less & less relevant, we would see a greater increase in performance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted the last comment due to wrong values.

Copy link
Contributor

@lozhnikov lozhnikov Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another comment. Now the test seems correct.

You didn't take into account that RAM frequency is much lower than CPU frequency and the loop requires 2 new values each iteration. Actually, there are a great number of factors such as memory frequency, memory bandwidth, CPU frequency, CPU cache size and so on.

I wrote a simple test which measures the duration of a similar loop.

https://gist.github.com/lozhnikov/3486432717ea04f25a722c97fbd79edd

(Weird, I couldn't upload the file, so I created a gist)

Here are the results:

I used a system with core-i7 2600K and 16GB DDR3 memory.

Look at the "Parallel (s)" and "Sequential (s)" columns.

g++ 9.2.1

g++ -O3 -fopenmp speedup_test.cpp -lgomp -o speedup_test_g++
./speedup_test_g++ 
        Size   Count Par. Parallel (s)   Count Seq. Sequential (s)
          10            6 1.880000e-07            4 5.900000e-08
         100           49 1.600000e-07           49 1.070000e-07
        1000          502 1.101000e-06          500 8.470000e-07
       10000         4964 1.166800e-05         4968 8.230000e-06
      100000        49841 2.085290e-04        49958 8.180200e-05
     1000000       500518 1.397976e-03       499577 1.487266e-03
    10000000      4997717 1.245767e-02      4998914 1.237367e-02
   100000000     49997606 1.240536e-01     49997601 1.196321e-01

clang 9.0.1

clang++ -O3 -fopenmp speedup_test.cpp -lgomp -o speedup_test_clang++
./speedup_test_clang++
        Size   Count Par. Parallel (s)   Count Seq. Sequential (s)
          10            4 4.780000e-07            5 1.270000e-07
         100           47 2.260000e-07           51 7.800000e-08
        1000          520 1.019000e-06          489 2.660000e-07
       10000         4965 1.011800e-05         4993 3.656000e-06
      100000        49904 6.398400e-05        49987 4.442300e-05
     1000000       499302 1.313890e-03       498852 1.307853e-03
    10000000      4999504 1.020465e-02      4998569 1.076408e-02
   100000000     50005009 9.804719e-02     49992123 9.778105e-02

There's hardly any difference.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I just ran a similar test on my machine:
1.6GHz dual-core Core i5, 4GB DDR3 mem

I underestimated the amount of data you would need to see a significant difference. You only see a real difference when the number of items is 10M+

If the comparison is really quick, then I can remove the parallel part, and reduce overhead. What do you say?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you can remove the parallel part here.

outMapIdx = 0;
}
size_t outMapIdx = (outMap % outSize) * inSize, batchCount = outMap/outSize;
arma::Mat<eT> &curSlice = outputTemp.slice(outMap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a tiny issue. According to the style guide we should write references and data types in one word.

Suggested change
arma::Mat<eT> &curSlice = outputTemp.slice(outMap);
arma::Mat<eT>& curSlice = outputTemp.slice(outMap);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the style issues, I'll put in all these changes in the next commit

inputHeight, inSize * batchSize, false, false);
arma::cube inputTemp;
if (padWLeft != 0 || padWRight != 0 || padHTop != 0 || padHBottom != 0) {
inputTemp = inputPaddedTemp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I missed something. Where was inputPaddedTemp defined?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's a property of the Convolution class

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just shifted the checking for padding outside the for loop.

{
batchCount++;
outMapIdx = 0;
arma::Mat<eT> &curGradTemp = gradientTemp.slice(outMapIdx+inMap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a tiny style issue. There are the same issues below.

Suggested change
arma::Mat<eT> &curGradTemp = gradientTemp.slice(outMapIdx+inMap);
arma::Mat<eT>& curGradTemp = gradientTemp.slice(outMapIdx+inMap);

arma::Mat<eT> output;
GradientConvolutionRule::Convolution(inputSlice, deltaSlice,
output, strideWidth, strideHeight);
output, strideWidth, strideHeight);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, according to the style guide the indentation was correct:)

arma::Mat<eT> rotatedFilter;
Rotate180(weight.slice(outMapIdx+inMap), rotatedFilter);
#pragma omp for
for (omp_size_t batchCount = 0; batchCount < batchSize; batchCount++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't look in detail yet. Why did you change the nesting of the loops? I was wondering if it provides any cache optimizations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we avoid extra computations. Earlier, we were retrieving and rotating the same weight slice for every batch, but now we only do it once for all batches.

Comment on lines 45 to 46
#pragma omp parallel for
for (omp_size_t i = 0; i < input.n_elem; i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think this loop contains a pretty simple operation. I guess in this case the performance depends on the memory clock speed rather than on the CPU.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think it would depend on the size of the input, right?

-removed parallel loop from bernoulli_distribution & leaky_relu
-fixed data race in convolution_impl
-conform to style guide
-parallelized naive convolution
@mlpack-bot
Copy link

mlpack-bot bot commented May 20, 2020

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added the s: stale label May 20, 2020
@adiwajshing
Copy link
Author

Hi, is there any progress on this?

@mlpack-bot mlpack-bot bot removed the s: stale label May 25, 2020
@mlpack-bot
Copy link

mlpack-bot bot commented Jun 24, 2020

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added the s: stale label Jun 24, 2020
@mlpack-bot mlpack-bot bot closed this Jul 1, 2020
@FabioMBB
Copy link

Hi Guys, was this ever merged?

@rcurtin
Copy link
Member

rcurtin commented Feb 27, 2021

Hey @adiwajshing, sorry, it looks like this one kind of fell off the list a little bit. Are you still interested in it? We can reopen the PR and I can try and review it and get it merged. @FabioMBB said he had some success with it.

@adiwajshing
Copy link
Author

@rcurtin sure -- can review

@rcurtin rcurtin reopened this Mar 14, 2021
@mlpack-bot mlpack-bot bot removed the s: stale label Mar 14, 2021
@mlpack-bot
Copy link

mlpack-bot bot commented Mar 14, 2021

Thanks for opening your first pull request in this repository! Someone will review it when they have a chance. In the mean time, please be sure that you've handled the following things, to make the review process quicker and easier:

  • All code should follow the style guide
  • Documentation added for any new functionality
  • Tests added for any new functionality
  • Tests that are added follow the testing guide
  • Headers and license information added to the top of any new code files
  • HISTORY.md updated if the changes are big or user-facing
  • All CI checks should be passing

Thank you again for your contributions! 👍

@rcurtin
Copy link
Member

rcurtin commented Mar 14, 2021

I tried the PR locally and saw some really nice speedup (3x) on the mnist_cnn example! (See mlpack/examples#144 for a few more details.) However, it seems like there might be a minor correctness issue, since I was getting different results in each case.

I merged master into this branch---hopefully I didn't mess anything up during the merge process. 👍

I think that we should try to figure out what's wrong here and then incorporate it, because based on every benchmark we've seen here it's a nice speedup.

@adiwajshing
Copy link
Author

@rcurtin good to hear. I'll see where the issue is over the next few days and try and resolve it. Can you give me a specific test I can try out and what you expect out of it?

@rcurtin
Copy link
Member

rcurtin commented Mar 15, 2021

@adiwajshing yeah, so what I did was build the mnist_cnn example from https://github.com/mlpack/examples, both against master and against this branch. I noticed different output---

Here's master:

$ time LD_LIBRARY_PATH=/home/ryan/src/mlpack/build/lib/ ./mnist_cnn 
Reading data ...
Start training ...
Epoch 1
2150.4680 [====================================================================================================] 100% - ETA: 0s - loss: 2148.47
1080/1080 [====================================================================================================] 100% - 44s 40ms/step - loss: 2150.46
Validation loss: 2.77534e+06.
Epoch 2
309.01780 [====================================================================================================] 100% - ETA: 0s - loss: 308.731
1080/1080 [====================================================================================================] 100% - 45s 41ms/step - loss: 309.017
Validation loss: 1.21627e+06.
Epoch 3
157.82680 [====================================================================================================] 100% - ETA: 0s - loss: 157.68
1080/1080 [====================================================================================================] 100% - 44s 41ms/step - loss: 157.826
Validation loss: 719143.
Epoch 4
96.544480 [====================================================================================================] 100% - ETA: 0s - loss: 96.455
1080/1080 [====================================================================================================] 100% - 45s 41ms/step - loss: 96.5444
Validation loss: 485079.
Epoch 5
65.006180 [====================================================================================================] 100% - ETA: 0s - loss: 64.946
1080/1080 [====================================================================================================] 100% - 44s 41ms/step - loss: 65.0061
Validation loss: 346494.
Epoch 6
46.316680 [====================================================================================================] 100% - ETA: 0s - loss: 46.2738
1080/1080 [====================================================================================================] 100% - 45s 41ms/step - loss: 46.3166
Validation loss: 254303.
Epoch 7
34.012180 [====================================================================================================] 100% - ETA: 0s - loss: 33.9806
1080/1080 [====================================================================================================] 100% - 45s 42ms/step - loss: 34.0121
Validation loss: 190001.
Epoch 8
25.310880 [====================================================================================================] 100% - ETA: 0s - loss: 25.2874
1080/1080 [====================================================================================================] 100% - 46s 42ms/step - loss: 25.3108
Validation loss: 144774.
Epoch 9
19.189280 [====================================================================================================] 100% - ETA: 0s - loss: 19.1715
1080/1080 [====================================================================================================] 100% - 45s 42ms/step - loss: 19.1892
Validation loss: 123542.
Epoch 10
14.814480 [====================================================================================================] 100% - ETA: 0s - loss: 14.8007
1080/1080 [====================================================================================================] 100% - 46s 42ms/step - loss: 14.8144
Validation loss: 99427.4.
Epoch 11
11.806080 [====================================================================================================] 100% - ETA: 0s - loss: 11.795
1080/1080 [====================================================================================================] 100% - 47s 43ms/step - loss: 11.806
Validation loss: 82894.4.
Epoch 12
9.4929180 [====================================================================================================] 100% - ETA: 0s - loss: 9.48413
1080/1080 [====================================================================================================] 100% - 46s 43ms/step - loss: 9.49291
Validation loss: 70650.2.
Epoch 13
7.6873380 [====================================================================================================] 100% - ETA: 0s - loss: 7.68022
1080/1080 [====================================================================================================] 100% - 48s 44ms/step - loss: 7.68733
Validation loss: 60630.2.
Epoch 14
6.4707980 [====================================================================================================] 100% - ETA: 0s - loss: 6.4648
1080/1080 [====================================================================================================] 100% - 48s 44ms/step - loss: 6.47079
Validation loss: 57717.8.
Epoch 15
5.2746980 [====================================================================================================] 100% - ETA: 0s - loss: 5.26981
1080/1080 [====================================================================================================] 100% - 47s 43ms/step - loss: 5.27469
Validation loss: 52386.2.
Epoch 16
4.3709180 [====================================================================================================] 100% - ETA: 0s - loss: 4.36687
1080/1080 [====================================================================================================] 100% - 47s 44ms/step - loss: 4.37091
Validation loss: 50859.3.
Epoch 17
3.7218780 [====================================================================================================] 100% - ETA: 0s - loss: 3.71843
1080/1080 [====================================================================================================] 100% - 48s 45ms/step - loss: 3.72187
Validation loss: 42875.3.
Epoch 18
3.1274680 [====================================================================================================] 100% - ETA: 0s - loss: 3.12457
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 3.12746
Validation loss: 41909.3.
Epoch 19
2.8877980 [====================================================================================================] 100% - ETA: 0s - loss: 2.88512
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 2.88779
Validation loss: 41560.
Epoch 20
2.4151180 [====================================================================================================] 100% - ETA: 0s - loss: 2.41287
1080/1080 [====================================================================================================] 100% - 50s 46ms/step - loss: 2.41511
Validation loss: 31542.1.
Epoch 21
2.1022380 [====================================================================================================] 100% - ETA: 0s - loss: 2.10029
1080/1080 [====================================================================================================] 100% - 50s 47ms/step - loss: 2.10223
Validation loss: 34162.1.
Epoch 22
1.9803680 [====================================================================================================] 100% - ETA: 0s - loss: 1.97853
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 1.98036
Validation loss: 32856.6.
Epoch 23
1.7359580 [====================================================================================================] 100% - ETA: 0s - loss: 1.73434
1080/1080 [====================================================================================================] 100% - 50s 46ms/step - loss: 1.73595
Validation loss: 30949.5.
Epoch 24
1.5832580 [====================================================================================================] 100% - ETA: 0s - loss: 1.58178
1080/1080 [====================================================================================================] 100% - 50s 46ms/step - loss: 1.58325
Validation loss: 28673.1.
Epoch 25
1.3232780 [====================================================================================================] 100% - ETA: 0s - loss: 1.32204
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 1.32327
Validation loss: 29069.5.
Epoch 26
1.2727880 [====================================================================================================] 100% - ETA: 0s - loss: 1.27165
1080/1080 [====================================================================================================] 100% - 50s 46ms/step - loss: 1.27278
Validation loss: 25303.8.
Epoch 27
1.2116580 [====================================================================================================] 100% - ETA: 0s - loss: 1.21053
1080/1080 [====================================================================================================] 100% - 50s 46ms/step - loss: 1.21165
Validation loss: 26911.2.
Epoch 28
1.0501380 [====================================================================================================] 100% - ETA: 0s - loss: 1.04916
1080/1080 [====================================================================================================] 100% - 50s 46ms/step - loss: 1.05013
Validation loss: 24317.2.
Epoch 29
1.0180980 [====================================================================================================] 100% - ETA: 0s - loss: 1.01715
1080/1080 [====================================================================================================] 100% - 50s 46ms/step - loss: 1.01809
Validation loss: 23769.9.
Epoch 30
0.9273660 [====================================================================================================] 100% - ETA: 0s - loss: 0.926508
1080/1080 [====================================================================================================] 100% - 50s 46ms/step - loss: 0.927366
Validation loss: 22773.5.
Epoch 31
0.8413060 [====================================================================================================] 100% - ETA: 0s - loss: 0.840528
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.841306
Validation loss: 20781.8.
Epoch 32
0.8605250 [====================================================================================================] 100% - ETA: 0s - loss: 0.859729
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.860525
Validation loss: 23822.4.
Epoch 33
0.7307910 [====================================================================================================] 100% - ETA: 0s - loss: 0.730115
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.730791
Validation loss: 21462.2.
Epoch 34
0.7655850 [====================================================================================================] 100% - ETA: 0s - loss: 0.764877
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.765585
Validation loss: 23261.7.
Epoch 35
0.6729320 [====================================================================================================] 100% - ETA: 0s - loss: 0.672309
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.672932
Validation loss: 21602.3.
Epoch 36
0.7120680 [====================================================================================================] 100% - ETA: 0s - loss: 0.711401
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.71206
Validation loss: 19921.5.
Epoch 37
0.6118950 [====================================================================================================] 100% - ETA: 0s - loss: 0.611329
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.611895
Validation loss: 21609.2.
Epoch 38
0.5873770 [====================================================================================================] 100% - ETA: 0s - loss: 0.586833
1080/1080 [====================================================================================================] 100% - 50s 46ms/step - loss: 0.587377
Validation loss: 17931.3.
Epoch 39
0.5226340 [====================================================================================================] 100% - ETA: 0s - loss: 0.522151
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.522634
Validation loss: 18900.3.
Epoch 40
0.5187220 [====================================================================================================] 100% - ETA: 0s - loss: 0.518242
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.518722
Validation loss: 18648.6.
Epoch 41
0.4907320 [====================================================================================================] 100% - ETA: 0s - loss: 0.490278
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.490732
Validation loss: 21183.7.
Epoch 42
0.5101930 [====================================================================================================] 100% - ETA: 0s - loss: 0.509721
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.510193
Validation loss: 18619.4.
Epoch 43
0.4506470 [====================================================================================================] 100% - ETA: 0s - loss: 0.450231
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.450647
Validation loss: 17743.5.
Epoch 44
0.4243330 [====================================================================================================] 100% - ETA: 0s - loss: 0.42394
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.424333
Validation loss: 19425.8.
Epoch 45
0.4857840 [====================================================================================================] 100% - ETA: 0s - loss: 0.485334
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.485784
Validation loss: 18821.4.
Epoch 46
0.4195440 [====================================================================================================] 100% - ETA: 0s - loss: 0.419156
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.419544
Validation loss: 17202.
Epoch 47
0.4419720 [====================================================================================================] 100% - ETA: 0s - loss: 0.441564
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.441972
Validation loss: 18088.7.
Epoch 48
0.3860690 [====================================================================================================] 100% - ETA: 0s - loss: 0.385711
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.386069
Validation loss: 16440.7.
Epoch 49
0.4003420 [====================================================================================================] 100% - ETA: 0s - loss: 0.399971
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.400342
Validation loss: 17862.4.
Epoch 50
0.4180330 [====================================================================================================] 100% - ETA: 0s - loss: 0.417647
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.418033
Validation loss: 18738.5.
Epoch 51
0.4121690 [====================================================================================================] 100% - ETA: 0s - loss: 0.411788
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.412169
Validation loss: 17226.5.
Epoch 52
0.2911080 [====================================================================================================] 100% - ETA: 0s - loss: 0.290731
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.291
Validation loss: 16592.6.
Epoch 53
0.4126380 [====================================================================================================] 100% - ETA: 0s - loss: 0.412248
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.41263
Validation loss: 17579.6.
Epoch 54
0.3190650 [====================================================================================================] 100% - ETA: 0s - loss: 0.31877
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.319065
Validation loss: 18338.1.
Epoch 55
0.3547280 [====================================================================================================] 100% - ETA: 0s - loss: 0.354392
1080/1080 [====================================================================================================] 100% - 49s 46ms/step - loss: 0.35472
Validation loss: 16453.6.
Epoch 56
0.3648910 [====================================================================================================] 100% - ETA: 0s - loss: 0.364554
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.364891
Validation loss: 16569.9.
Epoch 57
0.3383880 [====================================================================================================] 100% - ETA: 0s - loss: 0.338067
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.33838
Validation loss: 16980.7.
Epoch 58
0.3366140 [====================================================================================================] 100% - ETA: 0s - loss: 0.336302
1080/1080 [====================================================================================================] 100% - 49s 45ms/step - loss: 0.336614
Validation loss: 18033.8.
Accuracy: train = 99.5278%,	 valid = 97.65%
Predicting ...
Saving predicted labels to "results.csv."...
Neural network model is saved to "model.bin"
Finished

real	48m49.996s
user	268m9.838s
sys	0m15.430s

and here's this branch:

$ time LD_LIBRARY_PATH=/home/ryan/src/mlpack/build-2341/lib/ ./mnist_cnn 
Reading data ...
Start training ...
Epoch 1
5116.3280 [====================================================================================================] 100% - ETA: 0s - loss: 5111.58
1080/1080 [====================================================================================================] 100% - 9s 8ms/step - loss: 5116.32
Validation loss: 1.56285e+07.
Epoch 2
1995.6280 [====================================================================================================] 100% - ETA: 0s - loss: 1993.78
1080/1080 [====================================================================================================] 100% - 9s 9ms/step - loss: 1995.62
Validation loss: 1.12203e+07.
Epoch 3
1722.3080 [====================================================================================================] 100% - ETA: 0s - loss: 1720.7
1080/1080 [====================================================================================================] 100% - 9s 8ms/step - loss: 1722.3
Validation loss: 1.21165e+07.
Epoch 4
2006.1280 [====================================================================================================] 100% - ETA: 0s - loss: 2004.26
1080/1080 [====================================================================================================] 100% - 9s 8ms/step - loss: 2006.12
Validation loss: 1.53188e+07.
Epoch 5
3868.2280 [====================================================================================================] 100% - ETA: 0s - loss: 3864.65
1080/1080 [====================================================================================================] 100% - 9s 8ms/step - loss: 3868.22
Validation loss: 3.19173e+07.
Epoch 6
7577.0280 [====================================================================================================] 100% - ETA: 0s - loss: 7570.01
1080/1080 [====================================================================================================] 100% - 8s 8ms/step - loss: 7577.02
Validation loss: 6.51282e+07.
Epoch 7
11557.880 [====================================================================================================] 100% - ETA: 0s - loss: 11547.1
1080/1080 [====================================================================================================] 100% - 8s 7ms/step - loss: 11557.8
Validation loss: 1.11472e+08.
Epoch 8
14903.180 [====================================================================================================] 100% - ETA: 0s - loss: 14889.3
1080/1080 [====================================================================================================] 100% - 9s 8ms/step - loss: 14903.1
Validation loss: 1.59785e+08.
Epoch 9
17642.780 [====================================================================================================] 100% - ETA: 0s - loss: 17626.4
1080/1080 [====================================================================================================] 100% - 9s 8ms/step - loss: 17642.7
Validation loss: 1.05952e+08.
Epoch 10
212571080 [====================================================================================================] 100% - ETA: 0s - loss: 21237.4
1080/1080 [====================================================================================================] 100% - 8s 7ms/step - loss: 21257
Validation loss: 1.44208e+08.
Epoch 11
24389.180 [====================================================================================================] 100% - ETA: 0s - loss: 24366.5
1080/1080 [====================================================================================================] 100% - 9s 8ms/step - loss: 24389.1
Validation loss: 1.35117e+08.
Epoch 12
27116.780 [====================================================================================================] 100% - ETA: 0s - loss: 27091.6
1080/1080 [====================================================================================================] 100% - 8s 7ms/step - loss: 27116.7
Validation loss: 1.59217e+08.
Accuracy: train = 89.387%,	 valid = 89.2333%
Predicting ...
Saving predicted labels to "results.csv."...
Neural network model is saved to "model.bin"
Finished

real	2m2.085s
user	44m9.558s
sys	0m50.993s

So you can see that something's different, and my assumption at this point is that there's some small difference in the convolution code somewhere. However, it doesn't seem like there is a failing test case in mlpack_test that might make it easy to reproduce. But I also noticed that the original output you gave has slightly different results vs. the master branch too, so maybe that is an easier thing to reproduce.

If you don't get a chance, I'll try and dig in, but it may be a handful of days before I have the chance. 👍

@mlpack-bot
Copy link

mlpack-bot bot commented Apr 14, 2021

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants