-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize convolutional neural network #2341
Conversation
Thanks for opening your first pull request in this repository! Someone will review it when they have a chance. In the mean time, please be sure that you've handled the following things, to make the review process quicker and easier:
Thank you again for your contributions! 👍 |
Hey @adiwajshing , do you mind running these tests with BLAS or OpenBLAS? |
I Already did that. As the convolution functions don't really involve too many armadillo operations and so using BLAS or OpenBLAS didn't help. Also, the pooling functions involve a combination of for loops & armadillo operations. That's why just parallelizing the for loops in these operations made a huge difference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, So I haven't gone through the file yet, I think I need to pay a bit more attention to why some of the changes were made. I'll go over them tomorrow.
Till then I had a couple of questions:
- How was this tested ?
- Can you share the results with BLAS and OpenBLAS?
- Do you mind sharing the testing script ?
Sorry to bother with so many questions, I would love to see convolution improve however something that I learned is that benchmarking especially correct benchmarking is not so easy.
Some other compiler flags that you might find useful:
-O3 -fopenmp -DNDEBUG -DARMA_NO_DEBUG
arma::cube inputTemp = arma::cube(const_cast<arma::Mat<eT>&>(input).memptr(), | ||
inputWidth, inputHeight, inSize * batchSize, false, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason why we need to make this change? I might be missing something. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, its a merge issue. I'll revert that part
outputTemp.zeros(); | ||
|
||
for (size_t outMap = 0, outMapIdx = 0, batchCount = 0; outMap < | ||
outSize * batchSize; outMap++) | ||
arma::Cube<eT> inp = (padWLeft | padWRight | padHTop | padHBottom) != 0 ? inputPaddedTemp : inputTemp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally prefer names that are much more intuitive. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed that variable earlier, don't anymore. Will remove it.
Also kindly refer to the style guidelines here. Hmm, Looks the convolution causes out of index bound error. Let me know what you think. |
That's odd. Tests were passing on my machine. I'll implement the style guide, check the tests & get back. |
Hey @adiwajshing, Something I forgot to mention yesterday, size_t causes problems with openMP across some devices so you should use |
-Using omp_size_t instead of size_t when using OpenMP -Convolution::Backward() bug fix -Parallelized Convolution::Gradient()
@kartikdutt18 thanks a lot! I was wondering the problem OMP had with Windows. Anyway, I benchmarked everything with the -O3 optimization (the rest I had done already), and the results were quite interesting. I have been using this mlpack script for my testing. When the iterations per epoch < 1000, the performance is almost exactly the same, with or without my changes. However, when the iterations per epoch >= 10000, I get at least 2x performance with these changes. Here are some results with -O3 & 39000 iterations per epoch: Without parallelization: With parallelization |
@@ -55,28 +55,32 @@ class NaiveConvolution | |||
const size_t dW = 1, | |||
const size_t dH = 1, | |||
const size_t dilationW = 1, | |||
const size_t dilationH = 1) | |||
const size_t dilationH = 1, const size_t appending = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add appending option; when appending=true, the convolution just adds to the output instead of allocating a new matrix and then adding to it. Spares an allocation & some CPU time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think appending
should be a bool
value then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will change it to that then.
Would you have any idea why the build is failing now on 'mlpack.mlpack' and the rest is passing? The build section on Azure is just empty. Thank you |
Hi, It's unrelated to your PR. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thanks for the contribution. I didn't look in detail yet. I just added some minor comments regarding the style and a couple of questions.
CMakeLists.txt
Outdated
@@ -481,14 +481,17 @@ if (OPENMP_FOUND) | |||
add_definitions(-DHAS_OPENMP) | |||
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}") | |||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}") | |||
if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "AppleClang") | |||
set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} /usr/local/Cellar/llvm/9.0.1/lib/libomp.dylib") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably this was a wrong git add
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this line is required when building with Xcode. libomp.dylib has to be linked. In hindsight, it's too specific with the version of LLVM, maybe you could suggest a better way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't know since I've never used OSX. I think the cmake configuration file shouldn't depend on a particular system.
I think you can solve the issue by means of cmake variables. Try to modify the compiler flags or the environment:
# I am not quite sure which variable you need.
cmake -D CMAKE_SHARED_LINKER_FLAGS=/usr/local/Cellar/llvm/9.0.1/lib/libomp.dylib path/to/sources
# Probably cmake scan the standard variables.
LDFLAGS=/usr/local/Cellar/llvm/9.0.1/lib/libomp.dylib cmake path/to/sources
else () | ||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4068") | ||
endif () | ||
set(OpenMP_CXX_FLAGS "") | ||
set(OpenMP_CXX_FLAGS "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the previous indentation was correct.
const eT *kernelPtr = filter.memptr(), *inputPtr; | ||
size_t j, i, kj, ki; | ||
const size_t o_cols = output.n_cols, o_rows = output.n_rows; | ||
const size_t f_cols = filter.n_cols, f_rows = filter.n_rows; | ||
|
||
for (size_t j = 0; j < output.n_cols; ++j) | ||
for (j = 0; j < o_cols; ++j) | ||
{ | ||
for (size_t i = 0; i < output.n_rows; ++i, outputPtr++) | ||
for (i = 0; i < o_rows; ++i, outputPtr ++) | ||
{ | ||
const eT* kernelPtr = filter.memptr(); | ||
for (size_t kj = 0; kj < filter.n_cols; ++kj) | ||
for (kj = 0; kj < f_cols; ++kj) | ||
{ | ||
const eT* inputPtr = input.colptr(kj * dilationW + j * dW) + i * dH; | ||
for (size_t ki = 0; ki < filter.n_rows; ++ki, ++kernelPtr, | ||
inputPtr += dilationH) | ||
inputPtr = input.colptr(kj * dilationW + j * dW) + i * dH; | ||
for (ki = 0; ki < f_rows; ++ki, ++kernelPtr, inputPtr += dilationH) | ||
*outputPtr += *kernelPtr * (*inputPtr); | ||
} | ||
kernelPtr -= f_rows*f_cols; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get the point. What's the purpose of these changes? Could you elaborate a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working with no optimization, and so these changes made an incremental difference. However, these are just small optimizations the compiler probably would have done anyway. Can remove them if required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I ran a test with this change. There does not seem to be a difference when optimization is enabled.
However, I then parallelized the function when the image is large enough, here are the results:
INPUT SIZE: 16
FILTER SIZE 3: old: 0.038158s new: 0.036094s
FILTER SIZE 5: old: 0.044838s new: 0.057577s
FILTER SIZE 7: old: 0.066811s new: 0.063331s
INPUT SIZE: 32
FILTER SIZE 3: old: 0.127948s new: 0.140982s
FILTER SIZE 5: old: 0.279254s new: 0.405592s
FILTER SIZE 7: old: 0.625482s new: 0.604379s
INPUT SIZE: 48
FILTER SIZE 3: old: 0.528458s new: 0.640729s
FILTER SIZE 5: old: 0.718363s new: 0.572617s
FILTER SIZE 7: old: 1.00586s new: 0.969761s
INPUT SIZE: 64
FILTER SIZE 3: old: 0.503006s new: 0.729201s
FILTER SIZE 5: old: 1.10458s new: 1.09253s
FILTER SIZE 7: old: 1.80471s new: 0.791141s
INPUT SIZE: 80
FILTER SIZE 3: old: 0.830632s new: 0.466671s
FILTER SIZE 5: old: 1.58484s new: 0.782858s
FILTER SIZE 7: old: 2.84117s new: 1.71891s
INPUT SIZE: 96
FILTER SIZE 3: old: 1.21493s new: 0.706584s
FILTER SIZE 5: old: 2.2874s new: 1.13919s
FILTER SIZE 7: old: 4.15312s new: 1.8547s
INPUT SIZE: 112
FILTER SIZE 3: old: 1.53625s new: 0.901717s
FILTER SIZE 5: old: 3.10605s new: 1.57691s
FILTER SIZE 7: old: 6.06146s new: 4.87558s
INPUT SIZE: 128
FILTER SIZE 3: old: 2.34454s new: 1.24114s
FILTER SIZE 5: old: 4.60348s new: 2.96314s
FILTER SIZE 7: old: 9.58979s new: 7.50299s
INPUT SIZE: 144
FILTER SIZE 3: old: 3.30705s new: 2.80564s
FILTER SIZE 5: old: 6.0074s new: 2.97862s
FILTER SIZE 7: old: 9.66638s new: 4.33965s
INPUT SIZE: 160
FILTER SIZE 3: old: 3.08419s new: 1.82183s
FILTER SIZE 5: old: 6.39535s new: 3.26939s
FILTER SIZE 7: old: 12.2366s new: 6.49565s
INPUT SIZE: 176
FILTER SIZE 3: old: 4.12684s new: 3.04944s
FILTER SIZE 5: old: 9.23612s new: 4.31397s
FILTER SIZE 7: old: 16.8792s new: 6.71569s
INPUT SIZE: 192
FILTER SIZE 3: old: 4.55307s new: 3.37878s
FILTER SIZE 5: old: 9.91482s new: 5.1516s
FILTER SIZE 7: old: 17.925s new: 7.8205s
INPUT SIZE: 208
FILTER SIZE 3: old: 6.43632s new: 4.4773s
FILTER SIZE 5: old: 13.3028s new: 7.28228s
FILTER SIZE 7: old: 22.9444s new: 10.3726s
INPUT SIZE: 224
FILTER SIZE 3: old: 5.9436s new: 3.52242s
FILTER SIZE 5: old: 12.6614s new: 6.46533s
FILTER SIZE 7: old: 24.3096s new: 11.6745s
INPUT SIZE: 240
FILTER SIZE 3: old: 6.78909s new: 3.9888s
FILTER SIZE 5: old: 14.444s new: 7.399s
FILTER SIZE 7: old: 27.606s new: 12.273s
INPUT SIZE: 256
FILTER SIZE 3: old: 8.25704s new: 5.41013s
FILTER SIZE 5: old: 17.2394s new: 9.49333s
FILTER SIZE 7: old: 32.0725s new: 14.911s
Here is a link to the testing script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. I'll look into it. I thought the compiler is able to optimize the old code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, but it won't be able to automatically parallelize
#pragma omp parallel for | ||
for (omp_size_t i = 0; i < sample.n_elem; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop contains a pretty simple operation. I think in this case the performance depends on the memory clock speed rather than on the CPU. How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would depend on the number of samples, right? If there are maybe only 10 samples, there may not be a difference in performance. However, as the number increases & the overload of creating the threads becomes less & less relevant, we would see a greater increase in performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted the last comment due to wrong values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added another comment. Now the test seems correct.
You didn't take into account that RAM frequency is much lower than CPU frequency and the loop requires 2 new values each iteration. Actually, there are a great number of factors such as memory frequency, memory bandwidth, CPU frequency, CPU cache size and so on.
I wrote a simple test which measures the duration of a similar loop.
https://gist.github.com/lozhnikov/3486432717ea04f25a722c97fbd79edd
(Weird, I couldn't upload the file, so I created a gist)
Here are the results:
I used a system with core-i7 2600K and 16GB DDR3 memory.
Look at the "Parallel (s)" and "Sequential (s)" columns.
g++ 9.2.1
g++ -O3 -fopenmp speedup_test.cpp -lgomp -o speedup_test_g++
./speedup_test_g++
Size Count Par. Parallel (s) Count Seq. Sequential (s)
10 6 1.880000e-07 4 5.900000e-08
100 49 1.600000e-07 49 1.070000e-07
1000 502 1.101000e-06 500 8.470000e-07
10000 4964 1.166800e-05 4968 8.230000e-06
100000 49841 2.085290e-04 49958 8.180200e-05
1000000 500518 1.397976e-03 499577 1.487266e-03
10000000 4997717 1.245767e-02 4998914 1.237367e-02
100000000 49997606 1.240536e-01 49997601 1.196321e-01
clang 9.0.1
clang++ -O3 -fopenmp speedup_test.cpp -lgomp -o speedup_test_clang++
./speedup_test_clang++
Size Count Par. Parallel (s) Count Seq. Sequential (s)
10 4 4.780000e-07 5 1.270000e-07
100 47 2.260000e-07 51 7.800000e-08
1000 520 1.019000e-06 489 2.660000e-07
10000 4965 1.011800e-05 4993 3.656000e-06
100000 49904 6.398400e-05 49987 4.442300e-05
1000000 499302 1.313890e-03 498852 1.307853e-03
10000000 4999504 1.020465e-02 4998569 1.076408e-02
100000000 50005009 9.804719e-02 49992123 9.778105e-02
There's hardly any difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I just ran a similar test on my machine:
1.6GHz dual-core Core i5, 4GB DDR3 mem
I underestimated the amount of data you would need to see a significant difference. You only see a real difference when the number of items is 10M+
If the comparison is really quick, then I can remove the parallel part, and reduce overhead. What do you say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you can remove the parallel part here.
outMapIdx = 0; | ||
} | ||
size_t outMapIdx = (outMap % outSize) * inSize, batchCount = outMap/outSize; | ||
arma::Mat<eT> &curSlice = outputTemp.slice(outMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a tiny issue. According to the style guide we should write references and data types in one word.
arma::Mat<eT> &curSlice = outputTemp.slice(outMap); | |
arma::Mat<eT>& curSlice = outputTemp.slice(outMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the style issues, I'll put in all these changes in the next commit
inputHeight, inSize * batchSize, false, false); | ||
arma::cube inputTemp; | ||
if (padWLeft != 0 || padWRight != 0 || padHTop != 0 || padHBottom != 0) { | ||
inputTemp = inputPaddedTemp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably I missed something. Where was inputPaddedTemp
defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's a property of the Convolution class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just shifted the checking for padding outside the for loop.
{ | ||
batchCount++; | ||
outMapIdx = 0; | ||
arma::Mat<eT> &curGradTemp = gradientTemp.slice(outMapIdx+inMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a tiny style issue. There are the same issues below.
arma::Mat<eT> &curGradTemp = gradientTemp.slice(outMapIdx+inMap); | |
arma::Mat<eT>& curGradTemp = gradientTemp.slice(outMapIdx+inMap); |
arma::Mat<eT> output; | ||
GradientConvolutionRule::Convolution(inputSlice, deltaSlice, | ||
output, strideWidth, strideHeight); | ||
output, strideWidth, strideHeight); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, according to the style guide the indentation was correct:)
arma::Mat<eT> rotatedFilter; | ||
Rotate180(weight.slice(outMapIdx+inMap), rotatedFilter); | ||
#pragma omp for | ||
for (omp_size_t batchCount = 0; batchCount < batchSize; batchCount++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't look in detail yet. Why did you change the nesting of the loops? I was wondering if it provides any cache optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we avoid extra computations. Earlier, we were retrieving and rotating the same weight slice for every batch, but now we only do it once for all batches.
#pragma omp parallel for | ||
for (omp_size_t i = 0; i < input.n_elem; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think this loop contains a pretty simple operation. I guess in this case the performance depends on the memory clock speed rather than on the CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think it would depend on the size of the input, right?
-removed parallel loop from bernoulli_distribution & leaky_relu -fixed data race in convolution_impl -conform to style guide -parallelized naive convolution
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍 |
Hi, is there any progress on this? |
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍 |
Hi Guys, was this ever merged? |
Hey @adiwajshing, sorry, it looks like this one kind of fell off the list a little bit. Are you still interested in it? We can reopen the PR and I can try and review it and get it merged. @FabioMBB said he had some success with it. |
@rcurtin sure -- can review |
Thanks for opening your first pull request in this repository! Someone will review it when they have a chance. In the mean time, please be sure that you've handled the following things, to make the review process quicker and easier:
Thank you again for your contributions! 👍 |
I tried the PR locally and saw some really nice speedup (3x) on the I merged master into this branch---hopefully I didn't mess anything up during the merge process. 👍 I think that we should try to figure out what's wrong here and then incorporate it, because based on every benchmark we've seen here it's a nice speedup. |
@rcurtin good to hear. I'll see where the issue is over the next few days and try and resolve it. Can you give me a specific test I can try out and what you expect out of it? |
@adiwajshing yeah, so what I did was build the Here's
and here's this branch:
So you can see that something's different, and my assumption at this point is that there's some small difference in the convolution code somewhere. However, it doesn't seem like there is a failing test case in If you don't get a chance, I'll try and dig in, but it may be a handful of days before I have the chance. 👍 |
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍 |
Hello all
I was playing around with CNNs in mlpack on the MNIST data set and for some reason, only 1 core of my CPU was being used while training. When I looked into the code, the code for the Convolutional layers & Pooling layers was not parallelize and so did exactly that using OpenMP. I also made a few incremental improvements to the NaiveConvolution class & the Convolution class. I have also successfully run the test suite on my changes.
With the changes, on my dual core MacBook Air, I could train in about ~55% the time it was taking earlier. Here are some results:
Without the parallelization:
Reading data ...
Training ...
Epoch 0: Training Accuracy = 9.44709%, Validation Accuracy = 9.40476%time taken: 80s
Epoch 1: Training Accuracy = 15.0106%, Validation Accuracy = 14.9286%time taken: 85s
Epoch 2: Training Accuracy = 19.3148%, Validation Accuracy = 19.119%time taken: 107s
Epoch 3: Training Accuracy = 24.0661%, Validation Accuracy = 23.5714%time taken: 94s
Epoch 4: Training Accuracy = 28.6349%, Validation Accuracy = 28.4048%time taken: 88s
Predicting ...
total time taken: 454; avg. epoch duration: 90.8s
With the parallelization:
Reading data ...
Training ...
Epoch 0: Training Accuracy = 9.7672%, Validation Accuracy = 9.83333%time taken: 49s
Epoch 1: Training Accuracy = 14.4127%, Validation Accuracy = 14.619%time taken: 47s
Epoch 2: Training Accuracy = 19.6376%, Validation Accuracy = 19.0952%time taken: 47s
Epoch 3: Training Accuracy = 25.0317%, Validation Accuracy = 24.5%time taken: 47s
Epoch 4: Training Accuracy = 29.8677%, Validation Accuracy = 29.9286%time taken: 47s
Predicting ...
total time taken: 237; avg. epoch duration: 47.4s
Without the parallelization:
Training ...
Epoch 0: Training Accuracy = 82.8704%, Validation Accuracy = 82.1429%time taken: 361s
(didn't bother to run more because it was taking so long :/)
With the parallelization:
Reading data ...
Training ...
Epoch 0: Training Accuracy = 82.4815%, Validation Accuracy = 81.9286%time taken: 225s
Epoch 1: Training Accuracy = 88.8439%, Validation Accuracy = 88.5952%time taken: 229s
Epoch 2: Training Accuracy = 91.6323%, Validation Accuracy = 91.7857%time taken: 220s
Epoch 3: Training Accuracy = 93.2116%, Validation Accuracy = 93.1905%time taken: 221s
Epoch 4: Training Accuracy = 94.3386%, Validation Accuracy = 93.8333%time taken: 220s
Predicting ...
total time taken: 1115; avg. epoch duration: 223s
If this is good, I could clean up my changes, extend this to the AtrousConvolution and other classes that need changes?