-
-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue] #6614
Comments
The traceback points to sklearn/externals/joblib/parallel.py as the origin of the error. I am not sure, whether this error should occur, but I will have a look into why exactly it is raised. |
Another related, maybe more central, issue is #5481. IIRC estimators should avoid in-place modification of More explanation: when the input data is big enough, joblib use memmapping by default. This allows to share the input data across workers instead of having a copy of the input data in each worker. See this for more details. The memmap is opened in read-only mode because of possible data corruption if different workers write into the same data. If you have access to the |
Thanks for the lucid explanation Loic! |
Would it be possible to add an optional parameter to estimators that sets max_nbytes=None when accessing joblib.Parallel when set to True? It shouldn't be too hard to implement and fix the issue. |
A more proper fix would to modify Looking at this problem more closely I found a work-around in case it is useful: from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.datasets import fetch_20newsgroups
data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')
clf = OneVsRestClassifier(estimator=SVC(), n_jobs=-1)
class MyTfidfVectorizer(TfidfVectorizer):
def fit_transform(self, X, y):
result = super(MyTfidfVectorizer, self).fit_transform(X, y)
result.sort_indices()
return result
pipeLine = Pipeline([('tfidf', MyTfidfVectorizer(min_df=10)),
('clf', clf)])
trainx = data_train.data
trainy = data_train.target
evalx = data_test.data
evaly = data_test.target
pipeLine.fit(trainx, trainy)
predictValue = pipeLine.predict(evalx)
print(classification_report(evaly, predictValue)) This is based on the fact that the in-place modification only happens when the output of the TfidfVectorizer doesn't have its indices sorted. |
There should be a |
It' does seem very very similar to #5481 althoug this one needs sparse matrices to be triggered IIRC. |
Ok. We should add a common test anyhow that covers dense and sparse. But alright, let's leave this open to make sure we cover this case. |
Getting the same error. |
Any updates here? Still getting the same error. |
No updates I am afraid. You are more welcome to submit a PR for this issue or use the work-around suggested in #6614 (comment). |
… Now trying to find cause of difference in shape of train and test features.
For the record I opened scipy/scipy#8678. |
Because of ValueError: WRITEBACKIFCOPY base is read-only Bug: scikit-learn/scikit-learn#6614 Solution: none.
I got the similar Issue.
Here when I changed the code as below it worked.
|
Thanks @csvankhede ! Yes, it's a scipy bug as mentionned above. Adding a workaround for all code that uses Just for the record, would, x_train_multilabel.sort_indices()
clf.fit(x_train_multilabel, y_train) also work in your case? |
Adding dummy sorting estimator to pipeline is even simpler workaround :)
|
This PR introduces the optional *max_nbytes* parameter on *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within *multiclass.py*. Such parameter is in addition to the already existing *n_jobs* one and might be useful when dealing with a large training set processed by concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1 (meaning that the number of jobs is set to the number of CPU cores). In this case, [Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation) is called with the default "loky" backend, that [implements multi-processing](https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism); *Parallel* also sets a default 1-megabyte [threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion) on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break the job with exception **ValueError: UPDATEIFCOPY base is read-only**. *Parallel* uses *max_nbytes* to control this threshold. Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays. Fixes scikit-learn#6614 Expected to also fix scikit-learn#4597
This PR introduces the optional *max_nbytes* parameter on *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within *multiclass.py*. Such parameter is in addition to the already existing *n_jobs* one and might be useful when dealing with a large training set processed by concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1 (meaning that the number of jobs is set to the number of CPU cores). In this case, [Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation) is called with the default "loky" backend, that [implements multi-processing](https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism); *Parallel* also sets a default 1-megabyte [threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion) on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break the job with exception **ValueError: UPDATEIFCOPY base is read-only**. *Parallel* uses *max_nbytes* to control this threshold. Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays. Fixes scikit-learn#6614 Expected to also fix scikit-learn#4597
Changing *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within multiclass.py, by replacing "n_jobs" parameter with keyworded, variable-length argument list, in order to allow any "Parallel" parameter to be passed, as well as support "parallel_backend" context manager. "n_jobs" remains one of the possible parameters, but other ones can be added, including "max_nbytes", which might be useful in order to avoid ValueError when dealing with a large training set processed by concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1. More specifically, in parallel computing of large arrays with "loky" backend, [Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation) sets a default 1-megabyte [threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion) on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break jobs with exception **ValueError: UPDATEIFCOPY base is read-only**. *Parallel* uses *max_nbytes* to control this threshold. Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays. Fixes scikit-learn#6614 See also scikit-learn#4597
Changing *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within multiclass.py, by replacing "n_jobs" parameter with keyworded, variable-length argument list, in order to allow any "Parallel" parameter to be passed, as well as support "parallel_backend" context manager. "n_jobs" remains one of the possible parameters, but other ones can be added, including "max_nbytes", which might be useful in order to avoid ValueError when dealing with a large training set processed by concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1. More specifically, in parallel computing of large arrays with "loky" backend, [Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation) sets a default 1-megabyte [threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion) on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break jobs with exception **ValueError: UPDATEIFCOPY base is read-only**. *Parallel* uses *max_nbytes* to control this threshold. Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays. Fixes scikit-learn#6614 See also scikit-learn#4597 Changed _get_args in _testing.py in order to also accept 'parallel_params' vararg.
Thanks for the workaround, I am no longer getting the error "ValueError: UPDATEIFCOPY base is read-only"
The code to reproduce is the following:
Where X is a (1375308, 80614) scipy sparse matrix obtained with a CountVectorizer() and Y is a (1375308, 157) binarized label matrix. I am using a machine with 72 CPU cores and 252 GB of RAM. I previously ran the same code with n_jobs=1 and it took 4hours56min to run so I was expecting a 72x speedup. It indeed crashes after 5 minutes which suggests that the training was almost finished. Apologies if this is unrelated, or if I am late to the party. EDIT: I used n_job=30 and it works, the job was crashing simply because the parallel workers were exhausting the ram. |
I am closing this issue since it should be solved by installing the future SciPy release since scipy/scipy#18192 has been merged |
To reproduce:
Output:
ValueError: UPDATEIFCOPY base is read-only
The text was updated successfully, but these errors were encountered: