Skip to content

Latest commit

 

History

History
160 lines (117 loc) · 5.02 KB

README.md

File metadata and controls

160 lines (117 loc) · 5.02 KB

To build a very simple NLP (Natural Language Processing) model , we’ll create a basic text classification model that can categorize text into different categories, such as positive or negative sentiments. We’ll use Python and a library called scikit-learn, which is great for beginners.

Step 1: Set Up Your Environment

  1. Install Python: Make sure you have Python installed on your computer. You can download it from python.org.

  2. Install Required Libraries:
    Open your terminal or command prompt and run:

    pip install scikit-learn  

Step 2: Import Libraries

Open a new Python file or a Jupyter Notebook and import the necessary libraries:

from sklearn.feature_extraction.text import CountVectorizer  
from sklearn.model_selection import train_test_split  
from sklearn.naive_bayes import MultinomialNB  
from sklearn.metrics import accuracy_score  

Step 3: Prepare Your Data

For simplicity, let’s create a small dataset with some example sentences labeled as positive or negative.

# Example data  
texts = [  
    "I love this movie",  
    "This film was terrible",  
    "What a great experience",  
    "I hate this!",  
    "Absolutely fantastic!",  
    "Not good at all",  
    "Best day ever",  
    "Worst day ever"  
]  
   
# Labels: 1 for positive, 0 for negative  
labels = [1, 0, 1, 0, 1, 0, 1, 0]  

Step 4: Convert Text to Numbers

Machines understand numbers, so we need to convert the text into numerical data. We’ll use Bag of Words with CountVectorizer for this.

# Initialize the vectorizer  
vectorizer = CountVectorizer()  
   
# Fit and transform the text data  
X = vectorizer.fit_transform(texts)  
   
# Target labels  
y = labels  

Step 5: Split the Data

We’ll split the data into training and testing sets to evaluate our model’s performance.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)  

Step 6: Train the Model

We’ll use a simple Naive Bayes classifier, which works well for text classification tasks. For more details about Naive Bayes algorithms, check out the scikit-learn Naive Bayes documentation.

# Initialize the classifier  
clf = MultinomialNB()  
   
# Train the model  
clf.fit(X_train, y_train)  

Step 7: Make Predictions

Now, let’s see how well our model does on the test data.

# Predict on the test set  
y_pred = clf.predict(X_test)  

Step 8: Evaluate the Model

We’ll check the accuracy of our model.

# Calculate accuracy  
accuracy = accuracy_score(y_test, y_pred)  
print(f"Accuracy: {accuracy * 100:.2f}%")  

Output:

Accuracy: 100.00%  

Note: Since our dataset is very small, the model might show perfect accuracy, but in real scenarios, you need larger datasets for reliable performance.

Step 9: Test with New Data

Let’s try classifying some new sentences.

# New examples  
new_texts = [  
    "I really enjoy this!",  
    "This is awful",  
    "What a wonderful day",  
    "I’m so sad about this"  
]  
   
# Convert to numerical data  
new_X = vectorizer.transform(new_texts)  
   
# Make predictions  
new_predictions = clf.predict(new_X)  
   
# Display results  
for text, label in zip(new_texts, new_predictions):  
    sentiment = "Positive" if label == 1 else "Negative"  
    print(f"'{text}' -> {sentiment}")  

Output:

'I really enjoy this!' -> Positive  
'This is awful' -> Negative  
'What a wonderful day' -> Positive  
'I’m so sad about this' -> Negative  

Summary

You’ve just built a simple NLP model that can classify text as positive or negative! Here’s what we did:

  1. Set Up: Installed Python and scikit-learn.
  2. Imported Libraries: Brought in necessary tools.
  3. Prepared Data: Created example sentences and labels.
  4. Converted Text: Turned text into numbers using Bag of Words.
  5. Split Data: Divided data into training and testing sets.
  6. Trained Model: Used Naive Bayes to learn from training data.
  7. Made Predictions: Tested the model on unseen data.
  8. Evaluated: Checked accuracy.
  9. Tested New Data: Saw how the model handles new sentences.

Next Steps

To build on this, you can:

  • Use a Larger Dataset: More data can improve your model’s accuracy.
  • Explore Different Algorithms: Try other classifiers like Logistic Regression or Support Vector Machines.
  • Improve Text Processing: Use techniques like TF-IDF or word embeddings for better feature representation.
  • Handle More Classes: Extend the model to classify into more categories beyond positive and negative.