Skip to content

#Text Tokenization with NLTK in Natural Language Processing #Word Tokenisation #Word normalization #stemming #Lemmatization #nltk

Notifications You must be signed in to change notification settings

Sarinakasaiyan/Natural-Language-Processing-NLP-Tokenization

Repository files navigation

Text Tokenization with NLTK in Natural Language Processing

This project demonstrates how to use the NLTK (Natural Language Toolkit) module for text tokenization. Tokenization is a fundamental step in Natural Language Processing (NLP) that divides text into smaller, processable units.

1. section 1

This code uses the NLTK library to tokenize a given sentence into individual words. It downloads the necessary data and prints the tokens, including words and punctuation marks, from the input string.

image

2. section 2

This code uses the NLTK library to tokenize a sentence into individual words. After downloading the necessary data, it prints the tokens, which include only words and punctuation marks from the input string, while special characters like \n are ignored.

image

3. section 3

This code uses the NLTK library to tokenize a text into separate sentences. After downloading the necessary data, it prints the sentences from the input string, and special characters like \n are recognized as separate tokens.

image

4. section 4

This code uses the WordNet lemmatizer from the NLTK library to convert words to their base forms.

Explanation:

  • Importing Libraries : The NLTK and WordNetLemmatizer modules are imported.
  • Creating an Instance : An instance of the WordNetLemmatizer is created.
  • Lemmatization:
    • 'dogs' is reduced to 'dog'
    • 'pianos' is converted to 'piano'
    • 'python' remains unchanged
    • 'cheaper' is lemmatized to 'cheap' (pos="a")
    • 'better' remains 'better' (pos="n")
    • 'playing' is reduced to 'play' (pos='v')

Summary: The code demonstrates how to use the WordNet lemmatizer to reduce words to their base forms while considering parts of speech.

image

Regular Expression [How to find all the in the text?]

[^a-zA-Z]?[tT]he[^a-zA-Z]

Test String [https://www.regexpal.com]

image

Regular Expression [How to find all the in the text except for the first sentence. By not leaving a question mark]

[^a-zA-Z] [tT]he [^a-zA-Z]

Test String [https://www.regexpal.com]

image

About

#Text Tokenization with NLTK in Natural Language Processing #Word Tokenisation #Word normalization #stemming #Lemmatization #nltk

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published