Skip to content

Commit

Permalink
Spelling errors in comments and documentation (#1669)
Browse files Browse the repository at this point in the history
* Fix spelling mistakes in the code

* Fix typos in the doc

* Update KNN regressor file with changes suggested by reviewer.

Co-authored-by: Geoffrey Bolmier <geoffrey.bolmier@gmail.com>

---------

Co-authored-by: Hoang Anh Ngo <50743576+hoanganhngo610@users.noreply.github.com>
Co-authored-by: Geoffrey Bolmier <geoffrey.bolmier@gmail.com>
  • Loading branch information
3 people authored Feb 3, 2025
1 parent 5531ae5 commit 18f31d6
Show file tree
Hide file tree
Showing 45 changed files with 64 additions and 64 deletions.
4 changes: 2 additions & 2 deletions docs/examples/batch-to-online.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
" ('lin_reg', linear_model.LogisticRegression(solver='lbfgs'))\n",
"])\n",
"\n",
"# Define a determistic cross-validation procedure\n",
"# Define a deterministic cross-validation procedure\n",
"cv = model_selection.KFold(n_splits=5, shuffle=True, random_state=42)\n",
"\n",
"# Compute the MSE values\n",
Expand Down Expand Up @@ -356,7 +356,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The results seem to be exactly the same! The twist is that the running statistics won't be very accurate for the first few observations. In general though this doesn't matter too much. Some would even go as far as to say that this descrepancy is beneficial and acts as some sort of regularization...\n",
"The results seem to be exactly the same! The twist is that the running statistics won't be very accurate for the first few observations. In general though this doesn't matter too much. Some would even go as far as to say that this discrepancy is beneficial and acts as some sort of regularization...\n",
"\n",
"Now the idea is that we can compute the running statistics of each feature and scale them as they come along. The way to do this with River is to use the `StandardScaler` class from the `preprocessing` module, as so:"
]
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/building-a-simple-nowcasting-model.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -446,7 +446,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We've managed to get a good looking prediction curve with a reasonably simple model. What's more our model has the advantage of being interpretable and easy to debug. There surely are more rocks to squeeze (e.g. tune the hyperparameters, use an ensemble model, etc.) but we'll leave that as an exercice to the reader.\n",
"We've managed to get a good looking prediction curve with a reasonably simple model. What's more our model has the advantage of being interpretable and easy to debug. There surely are more rocks to squeeze (e.g. tune the hyperparameters, use an ensemble model, etc.) but we'll leave that as an exercise to the reader.\n",
"\n",
"As a finishing touch we'll rewrite our pipeline using the `|` operator, which is called a \"pipe\"."
]
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/content-personalization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"A good recommender model should at the very least understand what kind of items each user prefers. One of the simplest and yet performant way to do this is Simon Funk's SGD method he developped for the Netflix challenge and wrote about [here](https://sifter.org/simon/journal/20061211.html). It models each user and each item as latent vectors. The dot product of these two vectors is the expected preference of the user for the item."
"A good recommender model should at the very least understand what kind of items each user prefers. One of the simplest and yet performant way to do this is Simon Funk's SGD method he developed for the Netflix challenge and wrote about [here](https://sifter.org/simon/journal/20061211.html). It models each user and each item as latent vectors. The dot product of these two vectors is the expected preference of the user for the item."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/sentence-classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -814,7 +814,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The command below allows you to download the pre-trained embeddings that spaCy makes available. More informations about spaCy and its installation may be found here [here](https://spacy.io/usage)."
"The command below allows you to download the pre-trained embeddings that spaCy makes available. More information about spaCy and its installation may be found here [here](https://spacy.io/usage)."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion docs/faq/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,4 +58,4 @@ There are many great open-source libraries for building neural network models. W

## Who are the authors of this library?

We are research engineers, graduate students, PhDs and machine learning researchers. The members of the develompent team are mainly located in France, Brazil and New Zealand.
We are research engineers, graduate students, PhDs and machine learning researchers. The members of the development team are mainly located in France, Brazil and New Zealand.
2 changes: 1 addition & 1 deletion docs/introduction/basic-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Dictionaries are therefore a perfect fit. They're native to Python and have exce

In production, you're almost always going to face data streams which you have to react to, such as users visiting your website. The advantage of online machine learning is that you can design models that make predictions as well as learn from this data stream as it flows.

But of course, when you're developping a model, you don't usually have access to a real-time feed on which to evaluate your model. You usually have an offline dataset which you want to evaluate your model on. River provides some datasets which can be read in online manner, one sample at a time. It is however crucial to keep in mind that the goal is to reproduce a production scenario as closely as possible, in order to ensure your model will perform just as well in production.
But of course, when you're developing a model, you don't usually have access to a real-time feed on which to evaluate your model. You usually have an offline dataset which you want to evaluate your model on. River provides some datasets which can be read in online manner, one sample at a time. It is however crucial to keep in mind that the goal is to reproduce a production scenario as closely as possible, in order to ensure your model will perform just as well in production.

## Model evaluation

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@
}
},
"source": [
"We see that `ADWIN` successfully indicates the presence of drift (red vertical lines) close to the begining of a new data distribution.\n",
"We see that `ADWIN` successfully indicates the presence of drift (red vertical lines) close to the beginning of a new data distribution.\n",
"\n",
"\n",
"---\n",
Expand Down
2 changes: 1 addition & 1 deletion docs/recipes/active-learning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Active learning is primarly used to label data in an efficient manner. However, in an online setting, active learning can also be used simply to speed up training. The point is that you can achieve a very good performance without training on an entire dataset. Active learning is a powerful way to decide which samples to train on."
"Active learning is primarily used to label data in an efficient manner. However, in an online setting, active learning can also be used simply to speed up training. The point is that you can achieve a very good performance without training on an entire dataset. Active learning is a powerful way to decide which samples to train on."
]
},
{
Expand Down
6 changes: 3 additions & 3 deletions docs/recipes/cloning-and-mutating.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"source": [
"Sometimes you might want to reset a model, or edit (what we call mutate) its attributes. This can be useful in an online environment. Indeed, if you detect a drift, then you might want to mutate a model's attributes. Or if you see that a model's performance is plummeting, then you might to reset it to its \"factory settings\".\n",
"\n",
"Anyway, this is not to convince you, but rather to say that a model's attributes don't have be to set in stone throughout its lifetime. In particular, if you're developping your own model, then you might want to have good tools to do this. This is what this recipe is about."
"Anyway, this is not to convince you, but rather to say that a model's attributes don't have be to set in stone throughout its lifetime. In particular, if you're developing your own model, then you might want to have good tools to do this. This is what this recipe is about."
]
},
{
Expand Down Expand Up @@ -332,9 +332,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"All attributes are immutable by default. Under the hood, each model can specify a set of mutable attributes via the `_mutable_attributes` property. In theory this can be overriden. But the general idea is that we will progressively add more and more mutable attributes with time.\n",
"All attributes are immutable by default. Under the hood, each model can specify a set of mutable attributes via the `_mutable_attributes` property. In theory this can be overridden. But the general idea is that we will progressively add more and more mutable attributes with time.\n",
"\n",
"And that concludes this recipe. Arguably, this recipe caters to advanced users, and in particular users who are developping their own models. And yet, one could also argue that modifying parameters of a model on-the-fly is a great tool to have at your disposal when you're doing online machine learning."
"And that concludes this recipe. Arguably, this recipe caters to advanced users, and in particular users who are developing their own models. And yet, one could also argue that modifying parameters of a model on-the-fly is a great tool to have at your disposal when you're doing online machine learning."
]
}
],
Expand Down
6 changes: 3 additions & 3 deletions docs/recipes/on-hoeffding-trees.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
"\n",
"In this guide, we are going to:\n",
"\n",
"1. summarize the differences accross the multiple HT versions available;\n",
"1. summarize the differences across the multiple HT versions available;\n",
"2. learn how to inspect tree models;\n",
"3. learn how to manage the memory usage of HTs;\n",
"4. compare numerical tree splitters and understand their impact on the iDT induction process.\n",
Expand Down Expand Up @@ -888,7 +888,7 @@
"- $n$: Number of observations seen so far.\n",
"- $c$: the number of classes.\n",
"- $s$: the number of split points to evaluate (which means that this is a user-given parameter).\n",
"- $h$: the number of histogram bins or hash slots. Tipically, $h \\ll n$.\n",
"- $h$: the number of histogram bins or hash slots. Typically, $h \\ll n$.\n",
"\n",
"### 4.1. Classification tree splitters\n",
"\n",
Expand All @@ -906,7 +906,7 @@
"- The number of split points can be configured in the Gaussian splitter. Increasing this number makes this splitter slower, but it also potentially increases the quality of the obtained query points, implying enhanced tree accuracy. \n",
"- The number of stored bins can be selected in the Histogram splitter. Increasing this number increases the memory footprint and running time of this splitter, but it also potentially makes its split candidates more accurate and positively impacts on the tree's final predictive performance.\n",
"\n",
"Next, we provide a brief comparison of the classification splitters using 10K instances of the Random RBF synthetic dataset. Note that the tree equiped with the Exhaustive splitter does not use Naive Bayes leaves."
"Next, we provide a brief comparison of the classification splitters using 10K instances of the Random RBF synthetic dataset. Note that the tree equipped with the Exhaustive splitter does not use Naive Bayes leaves."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion docs/releases/0.12.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
## drift

- Refactor the concept drift detectors to match the remaining of River's API. Warnings are only issued by detectors that support this feature.
- Drifts can be assessed via the property `drift_detected`. Warning signals can be acessed by the property `warning_detected`. The `update` now returns `self`.
- Drifts can be assessed via the property `drift_detected`. Warning signals can be accessed by the property `warning_detected`. The `update` now returns `self`.
- Ensure all detectors automatically reset their inner states after a concept drift detection.
- Streamline `DDM`, `EDDM`, `HDDM_A`, and `HDDM_W`. Make the configurable parameters names match their respective papers.
- Fix bugs in `EDDM` and `HDDM_W`.
Expand Down
2 changes: 1 addition & 1 deletion docs/releases/0.19.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Calling `learn_one` in a pipeline will now update each part of the pipeline in t
## forest

- Fixed issue with `forest.ARFClassifier` which couldn't be passed a `CrossEntropy` metric.
- Fixed a bug in `forest.AMFClassifier` which slightly improves predictive accurary.
- Fixed a bug in `forest.AMFClassifier` which slightly improves predictive accuracy.
- Added `forest.AMFRegressor`.

## multioutput
Expand Down
2 changes: 1 addition & 1 deletion docs/releases/0.8.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,6 @@

## tree

- Unifed base class structure applied to all tree models.
- Unified base class structure applied to all tree models.
- Bug fixes.
- Added `tree.SGTClassifier` and `tree.SGTRegressor`.
10 changes: 5 additions & 5 deletions river/anomaly/lof.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,17 +149,17 @@ class LocalOutlierFactor(anomaly.base.AnomalyDetector):
The algorithm take into account the following elements:
- `NewPoints`: new points;
- `kNN(p)`: the k-nearest neighboors of `p` (the k-closest points to `p`);
- `RkNN(p)`: the reverse-k-nearest neighboors of `p` (points that have `p` as one of their neighboors);
- `kNN(p)`: the k-nearest neighbors of `p` (the k-closest points to `p`);
- `RkNN(p)`: the reverse-k-nearest neighbors of `p` (points that have `p` as one of their neighbors);
- `set_upd_lrd`: Set of points that need to have the local reachability distance updated;
- `set_upd_lof`: Set of points that need to have the local outlier factor updated.
This current implementation within `River`, based on the original one in the paper, follows the following steps:
1) Insert new data points (`NewPoints`) and calculate its distance to existing points;
2) Update the nreaest neighboors and reverse nearest neighboors of all the points;
2) Update the nearest neighbors and reverse nearest neighbors of all the points;
3) Define sets of affected points that required updates;
4) Calculate the reachability-distance from new point to neighboors (`NewPoints` -> `kNN(NewPoints)`)
and from rev-neighboors to new point (`RkNN(NewPoints)` -> `NewPoints`);
4) Calculate the reachability-distance from new point to neighbors (`NewPoints` -> `kNN(NewPoints)`)
and from rev-neighbors to new point (`RkNN(NewPoints)` -> `NewPoints`);
5) Update the reachability-distance for affected points: `RkNN(RkNN(NewPoints))` -> `RkNN(NewPoints)`
6) Update local reachability distance of affected points: `lrd(set_upd_lrd)`;
7) Update local outlier factor: `lof(set_upd_lof)`.
Expand Down
2 changes: 1 addition & 1 deletion river/base/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,7 @@ def log_method_calls(
):
"""A context manager to log method calls.
All method calls will be logged by default. This behavior can be overriden by passing filtering
All method calls will be logged by default. This behavior can be overridden by passing filtering
functions.
Parameters
Expand Down
2 changes: 1 addition & 1 deletion river/cluster/dbstream.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ class DBSTREAM(base.Clusterer):
DBSTREAM [^1] is a clustering algorithm for evolving data streams.
It is the first micro-cluster-based online clustering component that
explicitely captures the density between micro-clusters via a shared
explicitly captures the density between micro-clusters via a shared
density graph. The density information in the graph is then exploited
for reclustering based on actual density between adjacent micro clusters.
Expand Down
2 changes: 1 addition & 1 deletion river/cluster/odac.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ class ODAC(base.Clusterer):
├── CH1_LVL_3 d1=0.71 [5, 6]
└── CH2_LVL_3 d1=0.71 [7, 8]
You can acess some properties of the clustering model directly:
You can access some properties of the clustering model directly:
>>> model.n_clusters
11
Expand Down
2 changes: 1 addition & 1 deletion river/cluster/streamkmeans.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ class STREAMKMeans(base.Clusterer):
However, instead of using the traditional k-means, which requires a total reclustering
each time the temporary chunk of data points is full, the implementation of this algorithm
uses an increamental k-means.
uses an incremental k-means.
At first, the cluster centers are initialized with a `KMeans` instance. For a new point `p`:
Expand Down
2 changes: 1 addition & 1 deletion river/cluster/textclust.py
Original file line number Diff line number Diff line change
Expand Up @@ -583,7 +583,7 @@ def merge(self, microcluster, t, omega, fading_factor, term_fading, realtime):
microcluster.fade(t, omega, fading_factor, term_fading, realtime)

self.time = t
# here we merge an existing mc wth the current mc. The tf values as well as the ids have to be transferred
# here we merge an existing mc with the current mc. The tf values as well as the ids have to be transferred
for k in list(microcluster.tf.keys()):
if k in self.tf:
self.tf[k]["tf"] += microcluster.tf[k]["tf"]
Expand Down
2 changes: 1 addition & 1 deletion river/compose/renamer.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ class Renamer(base.Transformer):
Parameters
----------
mapping
Dictionnary describing substitution rules. Keys in `mapping` that are not a feature's name are silently ignored.
Dictionary describing substitution rules. Keys in `mapping` that are not a feature's name are silently ignored.
Examples
--------
Expand Down
2 changes: 1 addition & 1 deletion river/datasets/restaurants.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
class Restaurants(base.RemoteDataset):
"""Data from the Kaggle Recruit Restaurants challenge.
The goal is to predict the number of visitors in each of 829 Japanese restaurants over a priod
The goal is to predict the number of visitors in each of 829 Japanese restaurants over a period
of roughly 16 weeks. The data is ordered by date and then by restaurant ID.
References
Expand Down
2 changes: 1 addition & 1 deletion river/drift/dummy.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ class DummyDriftDetector(base.DriftDetector):
The 'w' value must be greater than zero when 'trigger_method' is 'random'.
Since we set `dynamic_cloning` to `True`, a clone of the periodic trigger will
have its internal paramenters changed:
have its internal parameters changed:
>>> rtrigger = rtrigger.clone()
>>> for i, v in enumerate(data):
Expand Down
2 changes: 1 addition & 1 deletion river/drift/retrain.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ class DriftRetrainingClassifier(base.Wrapper, base.Classifier):
"""Drift retraining classifier.
This classifier is a wrapper for any classifier. It monitors the incoming data for concept
drifts and warnings in the model's accurary. In case a warning is detected, a background model
drifts and warnings in the model's accuracy. In case a warning is detected, a background model
starts to train. If a drift is detected, the model will be replaced by the background model,
and the background model will be reset.
Expand Down
2 changes: 1 addition & 1 deletion river/ensemble/boosting.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,7 +303,7 @@ def learn_one(self, x, y, **kwargs):
# the best model's not yet trained will receive lambda values for training from the model's that correctly classified an instance.
# the values of lambda increase in case a mistake is made and decrease in case a right prediction is made.
# the worst models are more likely to make mistakes, increasing the value of lambda.
# Then, the best's model are likely to receive a high value of lambda and decreasing gradually throughout the remaning models to be trained
# Then, the best's model are likely to receive a high value of lambda and decreasing gradually throughout the remaining models to be trained
# It's similar to a system where the rich get richer.
for i in range(self.n_models):
if correct:
Expand Down
2 changes: 1 addition & 1 deletion river/imblearn/random.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ class RandomSampler(ClassificationSampler):
desired_dist
The desired class distribution. The keys are the classes whilst the values are the desired
class percentages. The values must sum up to 1. If set to `None`, then the observations
will be sampled uniformly at random, which is stricly equivalent to using
will be sampled uniformly at random, which is strictly equivalent to using
`ensemble.BaggingClassifier`.
sampling_rate
The desired ratio of data to sample.
Expand Down
2 changes: 1 addition & 1 deletion river/linear_model/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ def _fit(self, x, y, w, get_grad):
def _update_weights(self, x):
# L1 cumulative penalty helper

# Apply penalty to each weight iteratively, with the potential of being parrallelized by using VectorDict
# Apply penalty to each weight iteratively, with the potential of being parallelized by using VectorDict
for j, xj in x.items():
wj_temp = self._weights[j]

Expand Down
Loading

0 comments on commit 18f31d6

Please sign in to comment.