Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-1185] Support large array in several operators (part 1) #13418

Merged
merged 13 commits into from
Dec 1, 2018

Conversation

apeforest
Copy link
Contributor

@apeforest apeforest commented Nov 27, 2018

Description

This PR fixed the large array issue (#13036, #13070) in the following operators:
ndarray.ones
ndarray.zeros
ndarray.sum
ndarray.slice
ndarray.random.uniform
ndarray.empty

This PR is only the first effort that addresses some limitations in some basic operators. More tests and fix are coming as we identify more operators. This is part of a large project tracked in JIRA by https://issues.apache.org/jira/browse/MXNET-1184

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Nightly tests are added for complicated/long-running ones
  • Code is well-documented:
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Support large array (> 3billion elements) in certain operators
  • Added nightly tests on arrays of 5 billion elements

Comments

This PR only supports the total number of elements in the array to go beyond 2 billion. The max shape size in any of the dimension is still limited to 2^31 because this PR does not update the C API. Due to the large change required in all language bindings, we decided to break the original PR (#13191) into several smaller parts. And this PR is part 1.

This PR only changes the indexing variable type from int32_t to int64_t, not the data type of each element. No performance impact observed. For future guidance of data types to choose, please refer to https://cwiki.apache.org/confluence/display/MXNET/Large+Tensor+Support

@zheng-da
Copy link
Contributor

The PR looks good to me.

@zheng-da
Copy link
Contributor

thanks for the great work. your code modifies a lot of code and the test doesn't cover all of them.
could you please add more unit tests to make sure the modified code is covered?
Here I tried a bunch of tests. Most of them work. Unfortunately, broadcast div fails. Could you please double check the problem in broadcast div?

import mxnet as mx
import numpy as np
long_dim = 50000000
arr = mx.nd.ones(shape=(long_dim, 200))

deg = mx.nd.ones(shape=(long_dim,1)) * 2
# Test broadcast div
res = arr/deg
assert np.sum(arr[-1].asnumpy() == 0.5) == arr.shape[1]

# Test element-wise
arr2 = mx.nd.ones(shape=(long_dim, 200))
res = arr + arr2
assert np.sum(res[-1].asnumpy() == 2) == arr.shape[1]
res = arr + 1
assert np.sum(res[-1].asnumpy() == 2) == arr.shape[1]
res = mx.nd.sqrt(arr + 3)
assert np.sum(res[-1].asnumpy() == 2) == arr.shape[1]

# Test reduce
assert mx.nd.sum(arr).asnumpy() == arr.shape[0] * arr.shape[1]

# Test dot
weight = mx.nd.ones(shape=(200, 100))
res = mx.nd.dot(arr, weight)
assert np.sum(res[-1].asnumpy() == 200) == weight.shape[1]

# Test FullyConnected
res = mx.nd.FullyConnected(arr, weight, num_hidden=weight.shape[1], no_bias=True)
assert np.sum(res[-1].asnumpy() == 200) == weight.shape[1]

# Test broadcast
range = mx.nd.arange(0, long_dim).reshape(long_dim, 1)
res = mx.nd.broadcast_to(range, shape=(range.shape[0], 200))
assert np.sum(res[-1].asnumpy() == long_dim) == res.shape[1]
res = mx.nd.broadcast_like(range, arr)
assert np.sum(res[-1].asnumpy() == long_dim) == arr.shape[1]

# Test clip
data = res
res = mx.nd.clip(data, a_min=100, a_max=1000)
assert np.sum(res[-1].asnumpy() == 1000) == arr.shape[1]


# Test take
idx = mx.nd.arange(long_dim-1000, long_dim)
res = mx.nd.take(arr, idx)
assert np.sum(res[-1].asnumpy() == 1) == res.shape[1]

# Test slice
res = mx.nd.slice(arr, begin=(long_dim-1000, 1), end=(long_dim, 100))
assert np.sum(res[-1].asnumpy() == 1) == res.shape[1]

# Test slice assign
res = arr.copy()
res[long_dim-1:long_dim] = 1000
assert np.sum(res[-1].asnumpy() == 1000) == arr.shape[1]

# Test expand_dims
res = mx.nd.expand_dims(arr, axis=1)
assert res.shape == (arr.shape[0], 1, arr.shape[1])

# Test squeeze
data = res
res = mx.nd.squeeze(data)
assert sum(res.shape == arr.shape) == 2

@vandanavk
Copy link
Contributor

@mxnet-label-bot add [pr-awaiting-review, NDArray]

@marcoabreu marcoabreu added NDArray pr-awaiting-review PR is waiting for code review labels Nov 27, 2018
Copy link
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty useful. However, going forward to ensure consistency, could you guide us as to
when to use
size_t, index_t and auto (instead of the usual int)
Thanks!

Copy link
Contributor

@samskalicky samskalicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. How are the test coming? If you need help with the test creation, can you point out a group of things needing testing and i'll help out.

@apeforest
Copy link
Contributor Author

@zheng-da I have fixed the bug in broadcast_div and added the tests you provided. Thanks for your review and testing!

tests/nightly/test_large_array.py Outdated Show resolved Hide resolved
tests/nightly/test_large_array.py Outdated Show resolved Hide resolved
@zheng-da
Copy link
Contributor

Could you fix this bug as well? thanks.

def run_linear():
    data = mx.nd.ones(shape=(50*1000*1000, 100))
    linear = gluon.nn.Dense(100)
    linear.initialize(ctx=mx.cpu(0))
    res = linear(data)
    res.wait_to_read()

@apeforest
Copy link
Contributor Author

@zheng-da I don't see problem with this test. I have added it to the nightly test script. Please review.

@apeforest
Copy link
Contributor Author

@ChaiBapchya Yes, I will document this in cwiki after this change is merged. In general, index_t should be used for indexing elements in the tensor. size_t is used for returning the size of object or total number of elements. auto is a keyword in C++11 which derives the data type based on the rhs.

I found the Google C++ style guide to be a very good one to follow in programming: https://google.github.io/styleguide/cppguide.html#Integer_Types

@zheng-da
Copy link
Contributor

@apeforest i can't run the test for the dense layer in my C5.18x.

@zheng-da
Copy link
Contributor

do you build your code with mkldnn or without mkldnn?

@zheng-da
Copy link
Contributor

i can verify that the test code for dense works when mxnet isn't built with mkldnn.

@anirudh2290
Copy link
Member

@pengzhao-intel raised some good concerns on #13036 about performance. Also from @wkcn tests I see div op performance affected between int32 and int64. Have we done any micro-benchmarks for the operators affected ?

Copy link
Contributor

@lebeg lebeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

tests/nightly/test_large_array.py Outdated Show resolved Hide resolved
@apeforest
Copy link
Contributor Author

apeforest commented Nov 29, 2018

@anirudh2290 This PR only changes the indexing variable type from int32_t to int64_t, not the data type of each element. So the performance impact is different from what @wkcn test.

I also did a test in broadcast_div operator in MXNet and found no obvious runtime change (int64_t version actually is slightly faster)

import mxnet as mx
import numpy as np
from mxnet.test_utils import check_speed

a = mx.sym.ones(shape=(100000, 10000))
b = mx.sym.ones(shape=(100000, 1)) * 2
c = mx.sym.broadcast_div(a, b)

print(check_speed(c))

Master branch: 15.357867193222045 seconds
PR branch: 12.930877685546875 seconds

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. You have some good information in the conversation which I suggest to add to the comments section of the PR. It will make things clearer and easier for reviewers to catch up in the first place. I think you should mention: (1) This PR only changes the indexing variable type from int32_t to int64_t, not the data type of each element. No performance impact observed. (2) Provide a cwiki link where you will add guidance like: "In general, index_t should be used for indexing elements in the tensor. size_t is used for returning the size of object or total number of elements. auto is a keyword in C++11 which derives the data type based on the rhs."

@apeforest
Copy link
Contributor Author

@yuxihu Thanks for your nice suggestion. I have updated the PR comment section.

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. Looks good overall. Can we also add a unit test for scatter_nd.

@apeforest
Copy link
Contributor Author

@anirudh2290 Given the overhead of repeating CI, can we get this merged and add unit test for scatter_nd in an second PR? This PR is only part-1 of a larger PR anyways. Thanks.

@pengzhao-intel
Copy link
Contributor

@pengzhao-intel raised some good concerns on #13036 about performance. Also from @wkcn tests I see div op performance affected between int32 and int64. Have we done any micro-benchmarks for the operators affected ?

@anirudh2290 Agree with you.
I see the changes make the INT64 by default in several OPs.
Is it a little aggresive? How about use template to switch the data type?

@zheng-da
Copy link
Contributor

@pengzhao-intel I think the performance impact is minimal. Only the index type is changed. The operations on the index is basically add and comparison. All other computation is much more expensive. I double there is measurable performance difference.

@pengzhao-intel
Copy link
Contributor

@zheng-da @apeforest I am fine with the current PR since it's only the first part of the changes.

I have re-visited the proposal in here and I suggest to add more details and data, such as how CUDA and MKL-DNN supports INT64 type and any potential impacts.

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a TODO (JIRA) for the remaining test. Also, makes sense to bring this proposal up for discussion on the dev list.

@apeforest
Copy link
Contributor Author

Thank god the CI tests finally passed!
@anirudh2290 A JIRA epic has already been created to track this project: https://issues.apache.org/jira/browse/MXNET-1184. I will bring this proposal to dev list for discusssion. Since this change should update APIs in all language bindings, help is needed from experts in different language support.

@anirudh2290 anirudh2290 merged commit 0d480fb into apache:master Dec 1, 2018
sergeykolychev pushed a commit that referenced this pull request Dec 5, 2018
…ile (#13478)

* updated to v1.5.0

* Bumped minor version from 1.4.0 to 1.5.0 on master

* added Anirudh as maintainer for R package

... adding something useful and re-trigger PR check

* Updated license file for clojure, onnx-tensorrt, gtest, R-package

* Get the correct include path in pip package (#13452)

* add find_include_path API

* address reviewer comment

* change return type from list to string

* add unit test

* address reviewer comment

* address reviewer comment

* address reviewer comment

* address reviewer comment

* fix include path problem in pip package

* add comment

* fix lint error

* address reviewer comment

* address reviewer comment

* Use ~/.ccache as default ccache directory so is not cache is not erased on reboot (#13431)

* Skip flaky test #13446 (#13480)

* Rewrite dataloader with process pool, improves responsiveness and reliability (#13447)

* fix recordio.py

* rewrite dataloader with pool

* fix batch as tuple

* fix prefetching

* fix pylint

* picklable function

* use pickle

* add missing commit

* Fix errors in docstrings for subgraph op; use code directive (#13463)

* [MXNET-1158] JVM Memory Management Documentation (#13105)

* update train_mnist

* Add documentation for JVM Memory Management

* update doc

* address nit picks

* address nit picks

* Grammar and clarity edits for memory management doc

* Edits for scala memory management

* Update memory-management.md

* Update memory-management.md

* Update memory-management.md

* capitalization fix

* Update row_sparse tutorial (#13414)

Update row_sparse tutorial

* Add resiliency to onnx export code (#13426)

* Added resiliency to onnx export code

- With previous infer-shape implementation, if input shape was list instead of tuple or if extra non-existent parameters were provided, the code would still work. The fixes in this commit make sure that behavior is restored to prevent any compatibility issues with existing export code.

* Fixed name of net in unittest

* Fix pylint

* [MXNET-1185] Support large array in several operators (part 1) (#13418)

* fix a few operators with large arrays (# of elements)

* fix bug in broadcast_div and add tests

* address reviewer comment

* add unit test

* add empty line

* retrigger CI

* [MXNET-1210 ] Gluon Audio - Example (#13325)

* Initialized the example

* Addressed PR comments, about existing synset.txt file - no overwrite

* RST - docstring issues fixed

* added README

* Addressed PR comments

* Addressed PR comments, checking Divide by 0

* Raising error if format is not supported.

* changed a line for ndarray of labels

* Trigger CI

* Trigger CI

* PR comments addressed around skip_header argument

* Addressed PR comments around librosa import

* PR Comments

* Passing lazy=lazy from argument

* Added PR comments, labels to README.MD

* Trigger CI

* Addressing PR Comments in README

* Modified README.md

* Added example under audio folder

* Retrigger CI

* Retrigger CI

* ONNX export: Instance normalization, Shape (#12920)

* ONNX import/export: Make backend_rep common

* ONNX export: Instance Normalization

* ONNX export: Shape operator

* Clarify dependency on OpenCV in CNN Visualization tutorial. (#13495)

* clarify ops faq regarding docs strings (#13492)

* Add graph_compact operator. (#13436)

* add graph_compact.

* fix.

* add doc.

* add tests for graph_compact.

* address comments.

* update docs.

* trigger CI

* Deprecate Jenkinsfile (#13474)

* update github location for sampled_block.py (#13508)

Updated to https://github.com/dmlc/gluon-nlp/blob/master/src/gluonnlp/model/sampled_block.py

* #13453 [Clojure] - Add Spec Validations to the Optimizer namespace (#13499)

* ONNX export: Logical operators (#12852)

* Fix cmake options parsing in dev_menu (#13458)

Add GPU+MKLDNN unittests to dev_menu

* Revert "Manually track num_max_thread (#12380)" (#13501)

This reverts commit 7541021.

* Feature/mkldnn static 2 (#13503)

* build mkldnn as static lib

* update makefile to statically build mkldnn

* build static mkldnn

* fix static name

* fix static name

* update static for mac

* rename mkldnn dep in ci

* remove moving mkldnn dynamic lib

* remove commented code

* remove mkldnn dnaymic for unitest

* force static for mkldnn lib

* remove dynamic mkldnn bind

* only link windows

* add mkldnn.mk

* try force linking

* remove mkldnn dynanmic check

* remove test mkldnn install

* fix spacing

* fix index

* add artifacts

* add comment about windows

* remove static

* update makefile

* fix toctree Sphinx errors (#13489)

* fix toctree errors

* nudging file for CI

* Disabled flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker (#13527)

* [MXNET-1234] Fix shape inference problems in Activation backward (#13409)

* Provide a failing test for ReLU activation shape inference bug

* Fix Activation backward shape inference

fixes: #13333

* Add softsign Activation to test_gluon.py

* Use activation in GPU if we are using CUDNN and not MKLDNN as it's happening right now

* Don't disable MKLDNN
zhaoyao73 pushed a commit to zhaoyao73/incubator-mxnet that referenced this pull request Dec 13, 2018
…e#13418)

* fix a few operators with large arrays (# of elements)

* fix bug in broadcast_div and add tests

* address reviewer comment

* add unit test

* add empty line

* retrigger CI
zhaoyao73 pushed a commit to zhaoyao73/incubator-mxnet that referenced this pull request Dec 13, 2018
…ile (apache#13478)

* updated to v1.5.0

* Bumped minor version from 1.4.0 to 1.5.0 on master

* added Anirudh as maintainer for R package

... adding something useful and re-trigger PR check

* Updated license file for clojure, onnx-tensorrt, gtest, R-package

* Get the correct include path in pip package (apache#13452)

* add find_include_path API

* address reviewer comment

* change return type from list to string

* add unit test

* address reviewer comment

* address reviewer comment

* address reviewer comment

* address reviewer comment

* fix include path problem in pip package

* add comment

* fix lint error

* address reviewer comment

* address reviewer comment

* Use ~/.ccache as default ccache directory so is not cache is not erased on reboot (apache#13431)

* Skip flaky test apache#13446 (apache#13480)

* Rewrite dataloader with process pool, improves responsiveness and reliability (apache#13447)

* fix recordio.py

* rewrite dataloader with pool

* fix batch as tuple

* fix prefetching

* fix pylint

* picklable function

* use pickle

* add missing commit

* Fix errors in docstrings for subgraph op; use code directive (apache#13463)

* [MXNET-1158] JVM Memory Management Documentation (apache#13105)

* update train_mnist

* Add documentation for JVM Memory Management

* update doc

* address nit picks

* address nit picks

* Grammar and clarity edits for memory management doc

* Edits for scala memory management

* Update memory-management.md

* Update memory-management.md

* Update memory-management.md

* capitalization fix

* Update row_sparse tutorial (apache#13414)

Update row_sparse tutorial

* Add resiliency to onnx export code (apache#13426)

* Added resiliency to onnx export code

- With previous infer-shape implementation, if input shape was list instead of tuple or if extra non-existent parameters were provided, the code would still work. The fixes in this commit make sure that behavior is restored to prevent any compatibility issues with existing export code.

* Fixed name of net in unittest

* Fix pylint

* [MXNET-1185] Support large array in several operators (part 1) (apache#13418)

* fix a few operators with large arrays (# of elements)

* fix bug in broadcast_div and add tests

* address reviewer comment

* add unit test

* add empty line

* retrigger CI

* [MXNET-1210 ] Gluon Audio - Example (apache#13325)

* Initialized the example

* Addressed PR comments, about existing synset.txt file - no overwrite

* RST - docstring issues fixed

* added README

* Addressed PR comments

* Addressed PR comments, checking Divide by 0

* Raising error if format is not supported.

* changed a line for ndarray of labels

* Trigger CI

* Trigger CI

* PR comments addressed around skip_header argument

* Addressed PR comments around librosa import

* PR Comments

* Passing lazy=lazy from argument

* Added PR comments, labels to README.MD

* Trigger CI

* Addressing PR Comments in README

* Modified README.md

* Added example under audio folder

* Retrigger CI

* Retrigger CI

* ONNX export: Instance normalization, Shape (apache#12920)

* ONNX import/export: Make backend_rep common

* ONNX export: Instance Normalization

* ONNX export: Shape operator

* Clarify dependency on OpenCV in CNN Visualization tutorial. (apache#13495)

* clarify ops faq regarding docs strings (apache#13492)

* Add graph_compact operator. (apache#13436)

* add graph_compact.

* fix.

* add doc.

* add tests for graph_compact.

* address comments.

* update docs.

* trigger CI

* Deprecate Jenkinsfile (apache#13474)

* update github location for sampled_block.py (apache#13508)

Updated to https://github.com/dmlc/gluon-nlp/blob/master/src/gluonnlp/model/sampled_block.py

* apache#13453 [Clojure] - Add Spec Validations to the Optimizer namespace (apache#13499)

* ONNX export: Logical operators (apache#12852)

* Fix cmake options parsing in dev_menu (apache#13458)

Add GPU+MKLDNN unittests to dev_menu

* Revert "Manually track num_max_thread (apache#12380)" (apache#13501)

This reverts commit 7541021.

* Feature/mkldnn static 2 (apache#13503)

* build mkldnn as static lib

* update makefile to statically build mkldnn

* build static mkldnn

* fix static name

* fix static name

* update static for mac

* rename mkldnn dep in ci

* remove moving mkldnn dynamic lib

* remove commented code

* remove mkldnn dnaymic for unitest

* force static for mkldnn lib

* remove dynamic mkldnn bind

* only link windows

* add mkldnn.mk

* try force linking

* remove mkldnn dynanmic check

* remove test mkldnn install

* fix spacing

* fix index

* add artifacts

* add comment about windows

* remove static

* update makefile

* fix toctree Sphinx errors (apache#13489)

* fix toctree errors

* nudging file for CI

* Disabled flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker (apache#13527)

* [MXNET-1234] Fix shape inference problems in Activation backward (apache#13409)

* Provide a failing test for ReLU activation shape inference bug

* Fix Activation backward shape inference

fixes: apache#13333

* Add softsign Activation to test_gluon.py

* Use activation in GPU if we are using CUDNN and not MKLDNN as it's happening right now

* Don't disable MKLDNN
@apeforest apeforest deleted the bugfix/large-array-part1 branch February 25, 2020 18:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NDArray pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants