[MLLIB] [spark-2352] Implementation of an Artificial Neural Network (ANN) #1290

bgreeven · 2014-07-03T04:03:55Z

The code contains a multi-layer ANN implementation, with variable number of inputs, outputs and hidden nodes. It takes as input an RDD vector pairs, corresponding to the training set with inputs and outputs.

Next to two automated tests, an example program is also included, which also contains a graphical representation.

mengxr · 2014-07-03T17:47:56Z

@bgreeven Please add [MLLIB] to your PR, following https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . It makes easier for people who want to search MLlib's PRs. Thanks!

mengxr · 2014-07-03T17:48:02Z

Jenkins, test this please.

mengxr · 2014-07-16T04:37:34Z

Jenkins, retest this please.

SparkQA · 2014-07-16T04:43:05Z

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16708/consoleFull

SparkQA · 2014-07-16T04:43:49Z

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN( noInp: Integer, noHid: Integer, noOut: Integer, b: Double ) extends Gradient with ANN {
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16708/consoleFull

matthelb · 2014-07-19T22:45:28Z

@bgreeven Are you continuing work on this pull request so that it passes all unit tests?

bgreeven · 2014-07-29T00:46:50Z

Hi Matthew,

Sure, I can. I was on holiday during the last two weeks, but now back in office. I'll update the code this week.

Best regards,

Bert

From: Matthew Burke [mailto:notifications@github.com]
Sent: 20 July 2014 06:46
To: apache/spark
Cc: Bert Greevenbosch
Subject: Re: [spark] [MLLIB] [spark-2352] Implementation of an 1-hidden layer Artificial Neural Network (ANN) (#1290)

@bgreevenhttps://github.com/bgreeven Are you continuing work on this pull request so that it passes all unit tests?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/1290#issuecomment-49531526.

bgreeven · 2014-07-30T03:24:50Z

I updated the two sources to comply with "sbt/sbt scalastyle". Maybe retry the unit tests with the new modifications?

mengxr · 2014-07-30T04:19:46Z

Jenkins, add to whitelist.

mengxr · 2014-07-30T04:20:02Z

Jenkins, test this please.

mengxr · 2014-07-30T04:20:35Z

@bgreeven Jenkins will be automatically triggered for future updates.

SparkQA · 2014-07-30T04:23:49Z

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull

SparkQA · 2014-07-30T04:27:00Z

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull

SparkQA · 2014-07-30T09:09:01Z

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17440/consoleFull

SparkQA · 2014-07-30T09:12:10Z

QA results for PR 1290:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17440/consoleFull

mengxr · 2014-08-01T04:51:20Z

@bgreeven The filename mllib/src/main/scala/org/apache/spark/mllib/ann/GeneralizedSteepestDescendAlgorithm doesn't have .scala extension.

bgreeven · 2014-08-01T05:46:48Z

Thanks a lot! I have added the extension now.

SparkQA · 2014-08-01T05:49:01Z

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17649/consoleFull

SparkQA · 2014-08-01T06:41:12Z

QA results for PR 1290:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17649/consoleFull

SparkQA · 2014-08-01T08:09:08Z

QA tests have started for PR 1290. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17665/consoleFull

SparkQA · 2014-08-01T09:01:07Z

QA results for PR 1290:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class GeneralizedSteepestDescendModel(val weights: Vector )
trait ANN {
class LeastSquaresGradientANN(
class ANNUpdater extends Updater {
class ParallelANN (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17665/consoleFull

hunggpham · 2014-08-11T13:28:53Z

Hi Bert,

I want to try your ANN on Spark but could not find it in the latest clone. It's probably not there yet despite the successful tests and merge messages above (10 days ago). How can I get a copy of your ANN code and try it out.

Thanks,
Hung Pham

debasish83 · 2014-08-11T15:32:42Z

Hung,
You can merge the repository on your spark fork and you should be able to see the code..

debasish83 · 2014-08-11T15:46:24Z

SteepestDescend should be SteepestDescent !

bgreeven · 2014-08-12T05:34:12Z

SteepestDescend -> SteepestDescent can be changed. Thanks for noticing.

Hung Pham, did it work out for you now?

hunggpham · 2014-08-12T10:37:14Z

Yes i forked your repository and can see the codes now. One question:
I don't see backprop codes. Will that be added later? Thanks.

Sent from cell phone. Please excuse typo & brevity
On Aug 12, 2014 1:34 AM, "Bert Greevenbosch" notifications@github.com
wrote:

SteepestDescend -> SteepestDescent can be changed. Thanks for noticing.

Hung Pham, did it work out for you now?

—
Reply to this email directly or view it on GitHub
#1290 (comment).

bgreeven · 2014-08-13T00:53:51Z

The ANN uses the existent GradientDescent from mllib.optimization for back propagation. It uses the gradient from the new LeastSquaresGradientANN class, and updates using the new ANNUpdater class.

This line in ANNUpdater.compute is the backbone of the back propagation:

brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights)

hunggpham · 2014-08-13T20:43:31Z

I finally see the backprop codes in the 2 for loops inside
LeastSquaresGradientANN
that calculates the gradient which is then used to update the weights by
ANNUpdater.

Thanks, Bert.

On Tue, Aug 12, 2014 at 8:54 PM, Bert Greevenbosch <notifications@github.com

wrote:

The ANN uses the existent GradientDescent from mllib.optimization for back
propagation. It uses the gradient from the new LeastSquaresGradientANN
class, and updates using the new ANNUpdater class.

This line in ANNUpdater.compute is the backbone of the back propagation:

brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights)

—
Reply to this email directly or view it on GitHub
#1290 (comment).

avulanov · 2014-08-15T07:39:53Z

@bgreeven I've tried to train ParallelANNWithSGD with 3 layers 1000x500x18, numiterations 1000, stepSize 1. My dataset has ~2000 instances, 1000 features, 18 classes. After 17 hours it didn't finish and I killed the Spark process. I think there are some performance issues. I'll try to look at your code but without comments it would be challenging :)

hntd187 · 2015-06-09T21:59:15Z

@avulanov If it would aid in speeding this up I can test or benchmark on some EC2 instances we have, which run on mesos. If you want to give a general dataset to use we could work something out.

avulanov · 2015-06-12T01:21:43Z

@hntd187 It would be good to discuss this. Currently I plan to use mnist8m and 6 layer network 784-2500-2000-1500-1000-500-10 which is the best fully configuration for mnist from http://yann.lecun.com/exdb/mnist/. However I am still looking for a more modern dataset probably with more features and corresponding configuration. Are you aware of any?

hntd187 · 2015-06-12T01:33:59Z

@avulanov To be perfectly honest, does the "modern-ness" of the dataset really matter? This dataset has been a standard for a long time in this area so it seems perfectly reasonable to use this as most people working in this area would recognize the data and know relatively how to compare it to their own implementation.

avulanov · 2015-06-12T02:32:40Z

@hntd187 This is true, however it seems that "modern" datasets tend to have more features, so 784 features of mnist might seems too little these days. Anyway, the basic idea of benchmark is as follows: compare performance of Caffe and this implementation both in CPU and GPU mode with different numbers of nodes (workers) for Spark. Performance should be measured in samples/second processed. Here comes another problem: data formats that are supported by Spark and Caffe do not intersect. I can convert mnist8m (libsvm) to HDF5 for Caffe, however it will have different size that means that Caffe will read different amount of data from disk. Do you have an idea how to handle this problem?

hntd187 · 2015-06-12T02:57:12Z

@avulanov Can spark even read an HDF5 file or would we have to write that as well? While, I can't donate any professional time to this conversion problem, but I may be able to assist if we wanted to write a conversion independently. I suppose the problem here, is even if we get HDF5 and run it in caffe, how would we get spark to use it? Reading a bit online and looking around, it seems to be the consensus to use the pyhdf5 library to read the files in and do a flatMap to RDDs, but that would seem horribly inefficient on a large dataset and we'd be shooting yourselves in the foot trying to make that scale. So I think, our best bet is if we want to compare to caffe is either, get caffe to read another format or add HDF5 reading capability to spark either via a hack or an actual contribution. First one is not ideal, second one is obviously more time consuming.

Original code: apache/spark#1290

avulanov · 2015-06-12T19:29:00Z

@hntd187 Thanks for suggestion, it seems that implementing the HDF5 reader for Spark is the most reasonable option. I need to think what would be the minimum viable implementation.

@thvasilo You should consider using the latest version, https://github.com/avulanov/spark/tree/ann-interface-gemm and also DBN from https://github.com/witgo/spark/tree/ann-interface-gemm-dbn

hntd187 · 2015-06-12T19:49:26Z

@avulanov Would you like to split some of this work up or do you want to tackle this alone?

avulanov · 2015-06-17T00:40:47Z

@hntd187 Any help is really appreciated. We can split it into two functions: read and write. The good place to implement them is MLUtils as saveAsHDF5 and loadHDF5.

hntd187 · 2015-06-17T21:02:06Z

@avulanov How about I take the read and you take the write?

In an ideal world we should be able to take the implementation from here https://github.com/h5py/h5py and load it into some form of RDD.

Here are the java tools for HDF5 http://www.hdfgroup.org/products/java/release/download.html which is the bindings for the file format, hopefully given the implementations out there this should be pretty straight forward.

hntd187 · 2015-06-18T14:37:03Z

@avulanov Also, we're going to have to add a dependency with this with the HDF5 library, I think this should be handled the way the netlib is handled with the user having to enable a profile when building spark. So, normally it wouldn't be available, but if you build with it you can use it. I'll update the POM to account for that.

avulanov · 2015-06-18T19:11:34Z

@hntd187 Thanks for the links.

I am not sure that presence of hdf5 library should be handled on compilation step because there will be no fallback for the functions we are going to implement, as it is the case for netlib (it falls back to java implementation if you don't include jni binaries). Lets continue our discussion here https://issues.apache.org/jira/browse/SPARK-8449

Myasuka · 2015-07-16T14:03:51Z

hi, @avulanov , I have forked your repository about ann-benchmark https://github.com/avulanov/ann-benchmark/blob/master/spark/spark.scala . I feel a little confused about the mini-batch training, it seems that batchSize in code val trainer = new FeedForwardTrainer(topology, 780, 10).setBatchSize(batchSize) means the size of sub-block matrix you group the original input matrix into, and setMiniBatchFraction(1.0) in trainer.SGDOptimizer.setNumIterations(numIterations).setMiniBatchFraction(1.0).setStepSize(0.03) means you actually use full-batch gradient descent not the mini-batch gradient descent method. Does it performs well on mnist8m data? Maybe you can share the training parameters in detail, such as layer units, mini-batch size, stepsize and so on.

avulanov · 2015-07-16T17:17:39Z

@Myasuka Thank you for your interest in the benchmark. The goal of the benchmark is to measure the scalability of my implementation and to compare its efficiency with the other tools, such as Caffe. I measure the time needed for one epoch of batch gradient descent on a large network with ~12M of parameters. I don't measure the convergence rate or the accuracy, because they are very use-case specific and they don't show directly how scalable the particular machine learning tool is. The benchmark could be improved though and I am working on it, so thank you for your suggestion.

Myasuka · 2015-07-17T03:18:31Z

@avulanov I try to run test on minist data with SGD optimizer, however, I cannot reproduce the result in #1290 (comment), I use topology with (780, 10), and set batchsize as 1000, miniBatchFraction as 1.0, numIterations as 500, the accuracy is only 75%, if I set miniBatchFraction as 0.1, the accuracy still stays at 75%. Would you please share your training parameter in detail so that I can promote the accuray to 90% ?

avulanov · 2015-07-17T21:40:50Z

@Myasuka LBFGS was used in the mentioned experiment. SGD needs more iterations to converge in this case.

bnoreus · 2015-07-24T12:36:23Z

Hey guys,

I want to start of by saying Thank You for this piece of code. The ANN has been working beautifully so far. I have one question though: When I run the training on a dataset I imported from AWS s3, the logging says "Opening s3://blabla.txt for reading" over and over again. I interpret this as the program opening the S3-file a lot of times, instead of just once. Is this true? Wouldnt it be much faster if the file was only opened once?

avulanov · 2015-07-27T09:59:19Z

@bnoreus Thank you for your feedback. This code does not implement any file-related operations. It works with RDD only. I assume that the logging comes from the other piece of code you are using.

This pull request contains the following feature for ML: - Multilayer Perceptron classifier This implementation is based on our initial pull request with bgreeven: #1290 and inspired by very insightful suggestions from mengxr and witgo (I would like to thank all other people from the mentioned thread for useful discussions). The original code was extensively tested and benchmarked. Since then, I've addressed two main requirements that prevented the code from merging into the main branch: - Extensible interface, so it will be easy to implement new types of networks - Main building blocks are traits `Layer` and `LayerModel`. They are used for constructing layers of ANN. New layers can be added by extending the `Layer` and `LayerModel` traits. These traits are private in this release in order to save path to improve them based on community feedback - Back propagation is implemented in general form, so there is no need to change it (optimization algorithm) when new layers are implemented - Speed and scalability: this implementation has to be comparable in terms of speed to the state of the art single node implementations. - The developed benchmark for large ANN shows that the proposed code is on par with C++ CPU implementation and scales nicely with the number of workers. Details can be found here: https://github.com/avulanov/ann-benchmark - DBN and RBM by witgo https://github.com/witgo/spark/tree/ann-interface-gemm-dbn - Dropout https://github.com/avulanov/spark/tree/ann-interface-gemm mengxr and dbtsai kindly agreed to perform code review. Author: Alexander Ulanov <nashb@yandex.ru> Author: Bert Greevenbosch <opensrc@bertgreevenbosch.nl> Closes #7621 from avulanov/SPARK-2352-ann and squashes the following commits: 4806b6f [Alexander Ulanov] Addressing reviewers comments. a7e7951 [Alexander Ulanov] Default blockSize: 100. Added documentation to blockSize parameter and DataStacker class f69bb3d [Alexander Ulanov] Addressing reviewers comments. 374bea6 [Alexander Ulanov] Moving ANN to ML package. GradientDescent constructor is now spark private. 43b0ae2 [Alexander Ulanov] Addressing reviewers comments. Adding multiclass test. 9d18469 [Alexander Ulanov] Addressing reviewers comments: unnecessary copy of data in predict 35125ab [Alexander Ulanov] Style fix in tests e191301 [Alexander Ulanov] Apache header a226133 [Alexander Ulanov] Multilayer Perceptron regressor and classifier

mengxr · 2015-07-31T18:33:36Z

@bgreeven We recently merged #7621 from @avulanov . Under the hood, it contains the ANN implementation based on this PR. Additional features should come in follow-up PRs. So do you mind closing this PR for now? We can move the discussion to the JIRA page on individual features. Thanks a lot for your contribution and everyone for the discussion!

AmplabJenkins · 2015-08-04T17:03:51Z

Merged build finished. Test FAILed.

08s011003 · 2015-08-11T01:17:33Z

@bgreeven Hi, I try to train the model using this implementation and found weird outcome from the output from LBFGS as following:
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 6.22e-08) 3.59731
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 3.17e-08) 1.77486
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 1.56e-08) 0.885332
15/08/11 08:52:24 INFO optimize.LBFGS: Step Size: 1.000
15/08/11 08:52:24 INFO optimize.LBFGS: Val and Grad Norm: 5.77126e+07 (rel: 7.91e-09)
0.442205

I launch model training by : var model = ArtificialNeuralNetwork.train(trainData, Array(2, 3), 5000, 1e-8)

The problem is training process just iterate a few step before return. Obviously the error of the validation test is too large to satisfy expectation.What is the problem?

mengxr · 2015-08-11T21:11:06Z

I closed this PR. We can use Apache JIRA to continue discussion on individual issues.

…Partitioning (#1290)

bgreeven changed the title ~~[spark-2352] Implementation of an 1-hidden layer Artificial Neural Network (ANN)~~ [MLLIB] [spark-2352] Implementation of an 1-hidden layer Artificial Neural Network (ANN) Jul 4, 2014

thvasilo pushed a commit to thvasilo/flink that referenced this pull request Jun 12, 2015

Port of Spark ANN implementation to Flink.

9be3827

Original code: apache/spark#1290

hhbyyh mentioned this pull request Jul 23, 2015

[SPARK-9273] [MLlib] Add Convolutional Neural network to Spark MLlib #7609

Closed

avulanov mentioned this pull request Jul 23, 2015

[SPARK-9471] [ML] Multilayer Perceptron #7621

Closed

asfgit closed this in 423cdfd Aug 11, 2015

wangyum pushed a commit that referenced this pull request May 26, 2023

[CARMEL-6647] Enhance DemoteBucketJoin to support Alias Aware Output …

7efe9b7

…Partitioning (#1290)

[MLLIB] [spark-2352] Implementation of an Artificial Neural Network (ANN) #1290

[MLLIB] [spark-2352] Implementation of an Artificial Neural Network (ANN) #1290

Conversation

bgreeven commented Jul 3, 2014

mengxr commented Jul 3, 2014

mengxr commented Jul 3, 2014

mengxr commented Jul 16, 2014

SparkQA commented Jul 16, 2014

SparkQA commented Jul 16, 2014

matthelb commented Jul 19, 2014

bgreeven commented Jul 29, 2014

Bert

bgreeven commented Jul 30, 2014

mengxr commented Jul 30, 2014

mengxr commented Jul 30, 2014

mengxr commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

mengxr commented Aug 1, 2014

bgreeven commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

hunggpham commented Aug 11, 2014

debasish83 commented Aug 11, 2014

debasish83 commented Aug 11, 2014

bgreeven commented Aug 12, 2014

hunggpham commented Aug 12, 2014

bgreeven commented Aug 13, 2014

hunggpham commented Aug 13, 2014

avulanov commented Aug 15, 2014

hntd187 commented Jun 9, 2015

avulanov commented Jun 12, 2015

hntd187 commented Jun 12, 2015

avulanov commented Jun 12, 2015

hntd187 commented Jun 12, 2015

avulanov commented Jun 12, 2015

hntd187 commented Jun 12, 2015

avulanov commented Jun 17, 2015

hntd187 commented Jun 17, 2015

hntd187 commented Jun 18, 2015

avulanov commented Jun 18, 2015

Myasuka commented Jul 16, 2015

avulanov commented Jul 16, 2015

Myasuka commented Jul 17, 2015

avulanov commented Jul 17, 2015

bnoreus commented Jul 24, 2015

avulanov commented Jul 27, 2015

mengxr commented Jul 31, 2015

AmplabJenkins commented Aug 4, 2015

08s011003 commented Aug 11, 2015

mengxr commented Aug 11, 2015