GitHub - Sarinakasaiyan/Natural-Language-Processing-NLP-Tokenization: #Text Tokenization with NLTK in Natural Language Processing #Word Tokenisation #Word normalization #stemming #Lemmatization #nltk

Text Tokenization with NLTK in Natural Language Processing

This project demonstrates how to use the NLTK (Natural Language Toolkit) module for text tokenization. Tokenization is a fundamental step in Natural Language Processing (NLP) that divides text into smaller, processable units.

1. section 1

This code uses the NLTK library to tokenize a given sentence into individual words. It downloads the necessary data and prints the tokens, including words and punctuation marks, from the input string.

2. section 2

This code uses the NLTK library to tokenize a sentence into individual words. After downloading the necessary data, it prints the tokens, which include only words and punctuation marks from the input string, while special characters like \n are ignored.

3. section 3

This code uses the NLTK library to tokenize a text into separate sentences. After downloading the necessary data, it prints the sentences from the input string, and special characters like \n are recognized as separate tokens.

4. section 4

This code uses the WordNet lemmatizer from the NLTK library to convert words to their base forms.

Explanation:

Importing Libraries : The NLTK and WordNetLemmatizer modules are imported.
Creating an Instance : An instance of the WordNetLemmatizer is created.
Lemmatization:
- 'dogs' is reduced to 'dog'
- 'pianos' is converted to 'piano'
- 'python' remains unchanged
- 'cheaper' is lemmatized to 'cheap' (pos="a")
- 'better' remains 'better' (pos="n")
- 'playing' is reduced to 'play' (pos='v')

Summary: The code demonstrates how to use the WordNet lemmatizer to reduce words to their base forms while considering parts of speech.

Regular Expression [How to find all the in the text?]

[^a-zA-Z]?[tT]he[^a-zA-Z]

Test String [https://www.regexpal.com]

Regular Expression [How to find all the in the text except for the first sentence. By not leaving a question mark]

[^a-zA-Z] [tT]he [^a-zA-Z]

Test String [https://www.regexpal.com]

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
NLP_Sarinakasaiyan_exercise1.ipynb		NLP_Sarinakasaiyan_exercise1.ipynb
NLP_Sarinakasaiyan_exercise2.ipynb		NLP_Sarinakasaiyan_exercise2.ipynb
NLP_Sarinakasaiyan_exercise3.ipynb		NLP_Sarinakasaiyan_exercise3.ipynb
NLP_Sarinakasaiyan_exercise4.ipynb		NLP_Sarinakasaiyan_exercise4.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Tokenization with NLTK in Natural Language Processing

1. section 1

2. section 2

3. section 3

4. section 4

About

Releases

Packages

Languages

Sarinakasaiyan/Natural-Language-Processing-NLP-Tokenization

Folders and files

Latest commit

History

Repository files navigation

Text Tokenization with NLTK in Natural Language Processing

1. section 1

2. section 2

3. section 3

4. section 4

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages