-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLLIB] [spark-2352] Implementation of an Artificial Neural Network (ANN) #1290
Conversation
@bgreeven Please add |
Jenkins, test this please. |
Jenkins, retest this please. |
QA tests have started for PR 1290. This patch merges cleanly. |
QA results for PR 1290: |
@bgreeven Are you continuing work on this pull request so that it passes all unit tests? |
Hi Matthew, Sure, I can. I was on holiday during the last two weeks, but now back in office. I'll update the code this week. Best regards, BertFrom: Matthew Burke [mailto:notifications@github.com] @bgreevenhttps://github.com/bgreeven Are you continuing work on this pull request so that it passes all unit tests? — |
I updated the two sources to comply with "sbt/sbt scalastyle". Maybe retry the unit tests with the new modifications? |
Jenkins, add to whitelist. |
Jenkins, test this please. |
@bgreeven Jenkins will be automatically triggered for future updates. |
QA tests have started for PR 1290. This patch merges cleanly. |
QA results for PR 1290: |
QA tests have started for PR 1290. This patch merges cleanly. |
QA results for PR 1290: |
@bgreeven The filename |
Thanks a lot! I have added the extension now. |
QA tests have started for PR 1290. This patch merges cleanly. |
QA results for PR 1290: |
QA tests have started for PR 1290. This patch merges cleanly. |
QA results for PR 1290: |
Hi Bert, I want to try your ANN on Spark but could not find it in the latest clone. It's probably not there yet despite the successful tests and merge messages above (10 days ago). How can I get a copy of your ANN code and try it out. Thanks, |
Hung, |
SteepestDescend should be SteepestDescent ! |
SteepestDescend -> SteepestDescent can be changed. Thanks for noticing. Hung Pham, did it work out for you now? |
Yes i forked your repository and can see the codes now. One question: Sent from cell phone. Please excuse typo & brevity
|
The ANN uses the existent GradientDescent from mllib.optimization for back propagation. It uses the gradient from the new LeastSquaresGradientANN class, and updates using the new ANNUpdater class. This line in ANNUpdater.compute is the backbone of the back propagation: brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights) |
I finally see the backprop codes in the 2 for loops inside Thanks, Bert. On Tue, Aug 12, 2014 at 8:54 PM, Bert Greevenbosch <notifications@github.com
|
@bgreeven I've tried to train |
@avulanov If it would aid in speeding this up I can test or benchmark on some EC2 instances we have, which run on mesos. If you want to give a general dataset to use we could work something out. |
@hntd187 It would be good to discuss this. Currently I plan to use mnist8m and 6 layer network 784-2500-2000-1500-1000-500-10 which is the best fully configuration for mnist from http://yann.lecun.com/exdb/mnist/. However I am still looking for a more modern dataset probably with more features and corresponding configuration. Are you aware of any? |
@avulanov To be perfectly honest, does the "modern-ness" of the dataset really matter? This dataset has been a standard for a long time in this area so it seems perfectly reasonable to use this as most people working in this area would recognize the data and know relatively how to compare it to their own implementation. |
@hntd187 This is true, however it seems that "modern" datasets tend to have more features, so 784 features of mnist might seems too little these days. Anyway, the basic idea of benchmark is as follows: compare performance of Caffe and this implementation both in CPU and GPU mode with different numbers of nodes (workers) for Spark. Performance should be measured in samples/second processed. Here comes another problem: data formats that are supported by Spark and Caffe do not intersect. I can convert mnist8m (libsvm) to HDF5 for Caffe, however it will have different size that means that Caffe will read different amount of data from disk. Do you have an idea how to handle this problem? |
@avulanov Can spark even read an HDF5 file or would we have to write that as well? While, I can't donate any professional time to this conversion problem, but I may be able to assist if we wanted to write a conversion independently. I suppose the problem here, is even if we get HDF5 and run it in caffe, how would we get spark to use it? Reading a bit online and looking around, it seems to be the consensus to use the pyhdf5 library to read the files in and do a flatMap to RDDs, but that would seem horribly inefficient on a large dataset and we'd be shooting yourselves in the foot trying to make that scale. So I think, our best bet is if we want to compare to caffe is either, get caffe to read another format or add HDF5 reading capability to spark either via a hack or an actual contribution. First one is not ideal, second one is obviously more time consuming. |
Original code: apache/spark#1290
@hntd187 Thanks for suggestion, it seems that implementing the HDF5 reader for Spark is the most reasonable option. I need to think what would be the minimum viable implementation. @thvasilo You should consider using the latest version, https://github.com/avulanov/spark/tree/ann-interface-gemm and also DBN from https://github.com/witgo/spark/tree/ann-interface-gemm-dbn |
@avulanov Would you like to split some of this work up or do you want to tackle this alone? |
@hntd187 Any help is really appreciated. We can split it into two functions: read and write. The good place to implement them is MLUtils as saveAsHDF5 and loadHDF5. |
@avulanov How about I take the read and you take the write? In an ideal world we should be able to take the implementation from here https://github.com/h5py/h5py and load it into some form of RDD. Here are the java tools for HDF5 http://www.hdfgroup.org/products/java/release/download.html which is the bindings for the file format, hopefully given the implementations out there this should be pretty straight forward. |
@avulanov Also, we're going to have to add a dependency with this with the HDF5 library, I think this should be handled the way the netlib is handled with the user having to enable a profile when building spark. So, normally it wouldn't be available, but if you build with it you can use it. I'll update the POM to account for that. |
@hntd187 Thanks for the links. I am not sure that presence of hdf5 library should be handled on compilation step because there will be no fallback for the functions we are going to implement, as it is the case for netlib (it falls back to java implementation if you don't include jni binaries). Lets continue our discussion here https://issues.apache.org/jira/browse/SPARK-8449 |
hi, @avulanov , I have forked your repository about ann-benchmark https://github.com/avulanov/ann-benchmark/blob/master/spark/spark.scala . I feel a little confused about the mini-batch training, it seems that |
@Myasuka Thank you for your interest in the benchmark. The goal of the benchmark is to measure the scalability of my implementation and to compare its efficiency with the other tools, such as Caffe. I measure the time needed for one epoch of batch gradient descent on a large network with ~12M of parameters. I don't measure the convergence rate or the accuracy, because they are very use-case specific and they don't show directly how scalable the particular machine learning tool is. The benchmark could be improved though and I am working on it, so thank you for your suggestion. |
@avulanov I try to run test on minist data with SGD optimizer, however, I cannot reproduce the result in #1290 (comment), I use topology with (780, 10), and set batchsize as 1000, miniBatchFraction as 1.0, numIterations as 500, the accuracy is only 75%, if I set miniBatchFraction as 0.1, the accuracy still stays at 75%. Would you please share your training parameter in detail so that I can promote the accuray to 90% ? |
@Myasuka LBFGS was used in the mentioned experiment. SGD needs more iterations to converge in this case. |
Hey guys, I want to start of by saying Thank You for this piece of code. The ANN has been working beautifully so far. I have one question though: When I run the training on a dataset I imported from AWS s3, the logging says "Opening s3://blabla.txt for reading" over and over again. I interpret this as the program opening the S3-file a lot of times, instead of just once. Is this true? Wouldnt it be much faster if the file was only opened once? |
@bnoreus Thank you for your feedback. This code does not implement any file-related operations. It works with RDD only. I assume that the logging comes from the other piece of code you are using. |
This pull request contains the following feature for ML: - Multilayer Perceptron classifier This implementation is based on our initial pull request with bgreeven: #1290 and inspired by very insightful suggestions from mengxr and witgo (I would like to thank all other people from the mentioned thread for useful discussions). The original code was extensively tested and benchmarked. Since then, I've addressed two main requirements that prevented the code from merging into the main branch: - Extensible interface, so it will be easy to implement new types of networks - Main building blocks are traits `Layer` and `LayerModel`. They are used for constructing layers of ANN. New layers can be added by extending the `Layer` and `LayerModel` traits. These traits are private in this release in order to save path to improve them based on community feedback - Back propagation is implemented in general form, so there is no need to change it (optimization algorithm) when new layers are implemented - Speed and scalability: this implementation has to be comparable in terms of speed to the state of the art single node implementations. - The developed benchmark for large ANN shows that the proposed code is on par with C++ CPU implementation and scales nicely with the number of workers. Details can be found here: https://github.com/avulanov/ann-benchmark - DBN and RBM by witgo https://github.com/witgo/spark/tree/ann-interface-gemm-dbn - Dropout https://github.com/avulanov/spark/tree/ann-interface-gemm mengxr and dbtsai kindly agreed to perform code review. Author: Alexander Ulanov <nashb@yandex.ru> Author: Bert Greevenbosch <opensrc@bertgreevenbosch.nl> Closes #7621 from avulanov/SPARK-2352-ann and squashes the following commits: 4806b6f [Alexander Ulanov] Addressing reviewers comments. a7e7951 [Alexander Ulanov] Default blockSize: 100. Added documentation to blockSize parameter and DataStacker class f69bb3d [Alexander Ulanov] Addressing reviewers comments. 374bea6 [Alexander Ulanov] Moving ANN to ML package. GradientDescent constructor is now spark private. 43b0ae2 [Alexander Ulanov] Addressing reviewers comments. Adding multiclass test. 9d18469 [Alexander Ulanov] Addressing reviewers comments: unnecessary copy of data in predict 35125ab [Alexander Ulanov] Style fix in tests e191301 [Alexander Ulanov] Apache header a226133 [Alexander Ulanov] Multilayer Perceptron regressor and classifier
@bgreeven We recently merged #7621 from @avulanov . Under the hood, it contains the ANN implementation based on this PR. Additional features should come in follow-up PRs. So do you mind closing this PR for now? We can move the discussion to the JIRA page on individual features. Thanks a lot for your contribution and everyone for the discussion! |
Merged build finished. Test FAILed. |
@bgreeven Hi, I try to train the model using this implementation and found weird outcome from the output from LBFGS as following: I launch model training by : var model = ArtificialNeuralNetwork.train(trainData, Array(2, 3), 5000, 1e-8) The problem is training process just iterate a few step before return. Obviously the error of the validation test is too large to satisfy expectation.What is the problem? |
I closed this PR. We can use Apache JIRA to continue discussion on individual issues. |
The code contains a multi-layer ANN implementation, with variable number of inputs, outputs and hidden nodes. It takes as input an RDD vector pairs, corresponding to the training set with inputs and outputs.
Next to two automated tests, an example program is also included, which also contains a graphical representation.