From 3e04a41a9b4cfe6b07b2d747883261aaf3830c1b Mon Sep 17 00:00:00 2001
From: Reza Gharibi <reza.gharibi.rg@gmail.com>
Date: Mon, 25 Oct 2021 15:18:02 +0330
Subject: [PATCH] Fix some writing issues in the docs (#14136)

* Fix some writing issues in the docs

* Run code quality check
---
 ISSUES.md                                    |  2 +-
 docs/source/add_new_model.rst                |  6 +++---
 docs/source/add_new_pipeline.rst             | 18 +++++++++---------
 docs/source/community.md                     |  2 +-
 docs/source/converting_tensorflow_models.rst |  4 ++--
 docs/source/custom_datasets.rst              | 12 ++++++------
 docs/source/debugging.rst                    |  2 +-
 docs/source/model_sharing.rst                |  2 +-
 docs/source/parallelism.md                   |  8 ++++----
 9 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/ISSUES.md b/ISSUES.md
index e35332259a97..fa0c89610025 100644
--- a/ISSUES.md
+++ b/ISSUES.md
@@ -205,7 +205,7 @@ You are not required to read the following guidelines before opening an issue. H
 
    If you really tried to make a short reproducible code but couldn't figure it out, it might be that having a traceback will give the developer enough information to know what's going on. But if it is not enough and we can't reproduce the problem, we can't really solve it.
 
-   Do not dispair if you can't figure it out from the begining, just share what you can and perhaps someone else will be able to help you at the forums.
+   Do not despair if you can't figure it out from the beginning, just share what you can and perhaps someone else will be able to help you at the forums.
 
    If your setup involves any custom datasets, the best way to help us reproduce the problem is to create a [Google Colab notebook](https://colab.research.google.com/) that demonstrates the issue and once you verify that the issue still exists, include a link to that notebook in the Issue. Just make sure that you don't copy and paste the location bar url of the open notebook - as this is private and we won't be able to open it. Instead, you need to click on `Share` in the right upper corner of the notebook, select `Get Link` and then copy and paste the public link it will give to you.
 
diff --git a/docs/source/add_new_model.rst b/docs/source/add_new_model.rst
index 8a231cbca5c9..7b1d1d495b93 100644
--- a/docs/source/add_new_model.rst
+++ b/docs/source/add_new_model.rst
@@ -76,7 +76,7 @@ Let's take a look:
 
 As you can see, we do make use of inheritance in 🤗 Transformers, but we keep the level of abstraction to an absolute
 minimum. There are never more than two levels of abstraction for any model in the library. :obj:`BrandNewBertModel`
-inherits from :obj:`BrandNewBertPreTrainedModel` which in turn inherits from :class:`~transformres.PreTrainedModel` and
+inherits from :obj:`BrandNewBertPreTrainedModel` which in turn inherits from :class:`~transformers.PreTrainedModel` and
 that's it. As a general rule, we want to make sure that a new model only depends on
 :class:`~transformers.PreTrainedModel`. The important functionalities that are automatically provided to every new
 model are :meth:`~transformers.PreTrainedModel.from_pretrained` and
@@ -271,7 +271,7 @@ logical components from one another and to have faster debugging cycles as inter
 notebooks are often easier to share with other contributors, which might be very helpful if you want to ask the Hugging
 Face team for help. If you are familiar with Jupiter notebooks, we strongly recommend you to work with them.
 
-The obvious disadvantage of Jupyther notebooks is that if you are not used to working with them you will have to spend
+The obvious disadvantage of Jupyter notebooks is that if you are not used to working with them you will have to spend
 some time adjusting to the new programming environment and that you might not be able to use your known debugging tools
 anymore, like ``ipdb``.
 
@@ -674,7 +674,7 @@ the ``input_ids`` (usually the word embeddings) are identical. And then work you
 network. At some point, you will notice a difference between the two implementations, which should point you to the bug
 in the 🤗 Transformers implementation. From our experience, a simple and efficient way is to add many print statements
 in both the original implementation and 🤗 Transformers implementation, at the same positions in the network
-respectively, and to successively remove print statements showing the same values for intermediate presentions.
+respectively, and to successively remove print statements showing the same values for intermediate presentations.
 
 When you're confident that both implementations yield the same output, verifying the outputs with
 ``torch.allclose(original_output, output, atol=1e-3)``, you're done with the most difficult part! Congratulations - the
diff --git a/docs/source/add_new_pipeline.rst b/docs/source/add_new_pipeline.rst
index a0f2c1265213..a49772fbd3d2 100644
--- a/docs/source/add_new_pipeline.rst
+++ b/docs/source/add_new_pipeline.rst
@@ -13,9 +13,9 @@ How to add a pipeline to 🤗 Transformers?
 =======================================================================================================================
 
 First and foremost, you need to decide the raw entries the pipeline will be able to take. It can be strings, raw bytes,
-dictionnaries or whatever seems to be the most likely desired input. Try to keep these inputs as pure Python as
-possible as it makes compatibility easier (even through other languages via JSON). Those will be the :obj:`inputs` of
-the pipeline (:obj:`preprocess`).
+dictionaries or whatever seems to be the most likely desired input. Try to keep these inputs as pure Python as possible
+as it makes compatibility easier (even through other languages via JSON). Those will be the :obj:`inputs` of the
+pipeline (:obj:`preprocess`).
 
 Then define the :obj:`outputs`. Same policy as the :obj:`inputs`. The simpler, the better. Those will be the outputs of
 :obj:`postprocess` method.
@@ -50,15 +50,15 @@ Start by inheriting the base class :obj:`Pipeline`. with the 4 methods needed to
             return best_class
 
 
-The structure of this breakdown is to support relatively seemless support for CPU/GPU, while supporting doing
+The structure of this breakdown is to support relatively seamless support for CPU/GPU, while supporting doing
 pre/postprocessing on the CPU on different threads
 
-:obj:`preprocess` will take the original defined inputs, and turn them something feedable to the model. It might
-contain more information and is usally a :obj:`Dict`.
+:obj:`preprocess` will take the originally defined inputs, and turn them into something feedable to the model. It might
+contain more information and is usually a :obj:`Dict`.
 
-:obj:`_forward` is the implementation detail and is not meant to be called directly :obj:`forward` is the preferred
+:obj:`_forward` is the implementation detail and is not meant to be called directly. :obj:`forward` is the preferred
 called method as it contains safeguards to make sure everything is working on the expected device. If anything is
-linked to a real model it belongs in the :obj:`_forward` method, anything else is in the preprocess/postrocess.
+linked to a real model it belongs in the :obj:`_forward` method, anything else is in the preprocess/postprocess.
 
 :obj:`postprocess` methods will take the output of :obj:`_forward` and turn it into the final output that were decided
 earlier.
@@ -124,7 +124,7 @@ Create a new file ``tests/test_pipelines_MY_PIPELINE.py`` with example with the
 The :obj:`run_pipeline_test` function will be very generic and run on small random models on every possible
 architecture as defined by :obj:`model_mapping` and :obj:`tf_model_mapping`.
 
-This is very important to test future compatibilty, meaning if someone adds a new model for
+This is very important to test future compatibility, meaning if someone adds a new model for
 :obj:`XXXForQuestionAnswering` then the pipeline test will attempt to run on it. Because the models are random it's
 impossible to check for actual values, that's why There is a helper :obj:`ANY` that will simply attempt to match the
 output of the pipeline TYPE.
diff --git a/docs/source/community.md b/docs/source/community.md
index aab47ba899f4..a6f52d835743 100644
--- a/docs/source/community.md
+++ b/docs/source/community.md
@@ -36,7 +36,7 @@ This page regroups resources around 🤗 Transformers developed by the community
 |[fine-tune a non-English GPT-2 Model with Trainer class](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb) | How to fine-tune a non-English GPT-2 Model with Trainer class | [Philipp Schmid](https://www.philschmid.de) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)|
 |[Fine-tune a DistilBERT Model for Multi Label Classification task](https://github.com/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb) | How to fine-tune a DistilBERT Model for Multi Label Classification task | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb)|
 |[Fine-tune ALBERT for sentence-pair classification](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb) | How to fine-tune an ALBERT model or another BERT-based model for the sentence-pair classification task | [Nadir El Manouzi](https://github.com/NadirEM) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)|
-|[Fine-tune Roberta for sentiment analysis](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | How to fine-tune an Roberta model for sentiment analysis | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)|
+|[Fine-tune Roberta for sentiment analysis](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | How to fine-tune a Roberta model for sentiment analysis | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)|
 |[Evaluating Question Generation Models](https://github.com/flexudy-pipe/qugeev) | How accurate are the answers to questions generated by your seq2seq transformer model? | [Pascal Zoleko](https://github.com/zolekode) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)|
 |[Classify text with DistilBERT and Tensorflow](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | How to fine-tune DistilBERT for text classification in TensorFlow | [Peter Bayerle](https://github.com/peterbayerle) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)|
 |[Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | How to warm-start a *EncoderDecoderModel* with a *bert-base-uncased* checkpoint for summarization on CNN/Dailymail | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)|
diff --git a/docs/source/converting_tensorflow_models.rst b/docs/source/converting_tensorflow_models.rst
index feae098fecb2..190946c0b3d3 100644
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
@@ -13,8 +13,8 @@
 Converting Tensorflow Checkpoints
 =======================================================================================================================
 
-A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models
-than be loaded using the ``from_pretrained`` methods of the library.
+A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints to models
+that can be loaded using the ``from_pretrained`` methods of the library.
 
 .. note::
     Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) available in any
diff --git a/docs/source/custom_datasets.rst b/docs/source/custom_datasets.rst
index 8c552ab50eba..116b91cb545e 100644
--- a/docs/source/custom_datasets.rst
+++ b/docs/source/custom_datasets.rst
@@ -17,7 +17,7 @@ Fine-tuning with custom datasets
 
     The datasets used in this tutorial are available and can be more easily accessed using the `🤗 Datasets library
     <https://github.com/huggingface/datasets>`_. We do not use this library to access the datasets here since this
-    tutorial meant to illustrate how to work with your own data. A brief of introduction can be found at the end of the
+    tutorial meant to illustrate how to work with your own data. A brief introduction can be found at the end of the
     tutorial in the section ":ref:`datasetslib`".
 
 This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The guide
@@ -74,8 +74,8 @@ read this in.
     train_texts, train_labels = read_imdb_split('aclImdb/train')
     test_texts, test_labels = read_imdb_split('aclImdb/test')
 
-We now have a train and test dataset, but let's also also create a validation set which we can use for for evaluation
-and tuning without tainting our test set results. Sklearn has a convenient utility for creating such splits:
+We now have a train and test dataset, but let's also create a validation set which we can use for for evaluation and
+tuning without tainting our test set results. Sklearn has a convenient utility for creating such splits:
 
 .. code-block:: python
 
@@ -91,8 +91,8 @@ pre-trained DistilBert, so let's use the DistilBert tokenizer.
     tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
 
 Now we can simply pass our texts to the tokenizer. We'll pass ``truncation=True`` and ``padding=True``, which will
-ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum input
-length. This will allow us to feed batches of sequences into the model at the same time.
+ensure that all of our sequences are padded to the same length and are truncated to be no longer than model's maximum
+input length. This will allow us to feed batches of sequences into the model at the same time.
 
 .. code-block:: python
 
@@ -213,7 +213,7 @@ instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
 Fine-tuning with native PyTorch/TensorFlow
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-We can also train use native PyTorch or TensorFlow:
+We can also train using native PyTorch or TensorFlow:
 
 .. code-block:: python
 
diff --git a/docs/source/debugging.rst b/docs/source/debugging.rst
index de445b295539..235e32b77fa2 100644
--- a/docs/source/debugging.rst
+++ b/docs/source/debugging.rst
@@ -154,7 +154,7 @@ input elements was ``6.27e+04`` and same for the output was ``inf``.
 You can see here, that ``T5DenseGatedGeluDense.forward`` resulted in output activations, whose absolute max value was
 around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have ``Dropout`` which renormalizes
 the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an
-overlow (``inf``).
+overflow (``inf``).
 
 As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16
 numbers.
diff --git a/docs/source/model_sharing.rst b/docs/source/model_sharing.rst
index 6c3641749b42..345e0db1be2c 100644
--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
@@ -90,7 +90,7 @@ Directly push your model to the hub
    picture-in-picture" allowfullscreen></iframe>
 
 Once you have an API token (either stored in the cache or copied and pasted in your notebook), you can directly push a
-finetuned model you saved in :obj:`save_drectory` by calling:
+finetuned model you saved in :obj:`save_directory` by calling:
 
 .. code-block:: python
 
diff --git a/docs/source/parallelism.md b/docs/source/parallelism.md
index 5e6a77d74f50..3ab2499542e3 100644
--- a/docs/source/parallelism.md
+++ b/docs/source/parallelism.md
@@ -35,7 +35,7 @@ The following is the brief description of the main concepts that will be describ
 1. DataParallel (DP) - the same setup is replicated multiple times, and each being fed a slice of the data. The processing is done in parallel and all setups are synchronized at the end of each training step.
 2. TensorParallel (TP) - each tensor is split up into multiple chunks, so instead of having the whole tensor reside on a single gpu, each shard of the tensor resides on its designated gpu. During processing each shard gets processed separately and in parallel on different GPUs and the results are synced at the end of the step. This is what one may call horizontal parallelism, as the splitting happens on horizontal level.
 3. PipelineParallel (PP) - the model is split up vertically (layer-level) across multiple GPUs, so that only one or several layers of the model are places on a single gpu. Each gpu processes in parallel different stages of the pipeline and working on a small chunk of the batch.
-4. Zero Redundancy Optimizer (ZeRO) - Also performs sharding of the tensors somewhat similar to TP, except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model does't need to be modified. It also supports various offloading techniques to compensate for limited GPU memory.
+4. Zero Redundancy Optimizer (ZeRO) - Also performs sharding of the tensors somewhat similar to TP, except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need to be modified. It also supports various offloading techniques to compensate for limited GPU memory.
 5. Sharded DDP - is another name for the foundational ZeRO concept as used by various other implementations of ZeRO.
 
 
@@ -110,7 +110,7 @@ To me this sounds like an efficient group backpacking weight distribution strate
 2. person B carries the stove
 3. person C carries the axe
 
-Now each night they all share what they have with others and get from others what the don't have, and in the morning they pack up their allocated type of gear and continue on their way. This is Sharded DDP / Zero DP.
+Now each night they all share what they have with others and get from others what they don't have, and in the morning they pack up their allocated type of gear and continue on their way. This is Sharded DDP / Zero DP.
 
 Compare this strategy to the simple one where each person has to carry their own tent, stove and axe, which would be far more inefficient. This is DataParallel (DP and DDP) in Pytorch.
 
@@ -140,7 +140,7 @@ we just sliced it in 2 vertically, placing layers 0-3 onto GPU0 and 4-7 to GPU1.
 
 Now while data travels from layer 0 to 1, 1 to 2 and 2 to 3 this is just the normal model. But when data needs to pass from layer 3 to layer 4 it needs to travel from GPU0 to GPU1 which introduces a communication overhead. If the participating GPUs are on the same compute node (e.g. same physical machine) this copying is pretty fast, but if the GPUs are located on different compute nodes (e.g. multiple machines) the communication overhead could be significantly larger.
 
-Then layers 4 to 5 to 6 to 7 are as a normal model would have and when the 7th layer completes we often need to send the data back to layer 0 where the labels are (or alternatively send the labels to the the last layer). Now the loss can be computed and the optimizer can do its work.
+Then layers 4 to 5 to 6 to 7 are as a normal model would have and when the 7th layer completes we often need to send the data back to layer 0 where the labels are (or alternatively send the labels to the last layer). Now the loss can be computed and the optimizer can do its work.
 
 Problems:
 - the main deficiency and why this one is called "naive" MP, is that all but one GPU is idle at any given moment. So if 4 GPUs are used, it's almost identical to quadrupling the amount of memory of a single GPU, and ignoring the rest of the hardware. Plus there is the overhead of copying the data between devices. So 4x 6GB cards will be able to accommodate the same size as 1x 24GB card using naive MP, except the latter will complete the training faster, since it doesn't have the data copying overhead. But, say, if you have 40GB cards and need to fit a 45GB model you can with 4x 40GB cards (but barely because of the gradient and optimizer states)
@@ -272,7 +272,7 @@ Implementations:
 
 One of the main features of DeepSpeed is ZeRO, which is a super-scalable extension of DP. It has already been discussed in [ZeRO Data Parallel](#zero-data-parallel). Normally it's a standalone feature that doesn't require PP or TP. But it can be combined with PP and TP.
 
-When ZeRO-DP is combined with PP (and optinally TP) it typically enables only ZeRO stage 1 (optimizer sharding).
+When ZeRO-DP is combined with PP (and optionally TP) it typically enables only ZeRO stage 1 (optimizer sharding).
 
 While it's theoretically possible to use ZeRO stage 2 (gradient sharding) with Pipeline Parallelism, it will have bad performance impacts. There would need to be an additional reduce-scatter collective for every micro-batch to aggregate the gradients before sharding, which adds a potentially significant communication overhead. By nature of Pipeline Parallelism, small micro-batches are used and instead the focus is on trying to balance arithmetic intensity (micro-batch size) with minimizing the Pipeline bubble (number of micro-batches). Therefore those communication costs are going to hurt.