To build a very simple NLP (Natural Language Processing) model , we’ll create a basic text classification model that can categorize text into different categories, such as positive or negative sentiments. We’ll use Python and a library called scikit-learn, which is great for beginners.
-
Install Python: Make sure you have Python installed on your computer. You can download it from python.org.
-
Install Required Libraries:
Open your terminal or command prompt and run:pip install scikit-learn
Open a new Python file or a Jupyter Notebook and import the necessary libraries:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
For simplicity, let’s create a small dataset with some example sentences labeled as positive or negative.
# Example data
texts = [
"I love this movie",
"This film was terrible",
"What a great experience",
"I hate this!",
"Absolutely fantastic!",
"Not good at all",
"Best day ever",
"Worst day ever"
]
# Labels: 1 for positive, 0 for negative
labels = [1, 0, 1, 0, 1, 0, 1, 0]
Machines understand numbers, so we need to convert the text into numerical data. We’ll use Bag of Words with CountVectorizer
for this.
# Initialize the vectorizer
vectorizer = CountVectorizer()
# Fit and transform the text data
X = vectorizer.fit_transform(texts)
# Target labels
y = labels
We’ll split the data into training and testing sets to evaluate our model’s performance.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
We’ll use a simple Naive Bayes classifier, which works well for text classification tasks. For more details about Naive Bayes algorithms, check out the scikit-learn Naive Bayes documentation.
# Initialize the classifier
clf = MultinomialNB()
# Train the model
clf.fit(X_train, y_train)
Now, let’s see how well our model does on the test data.
# Predict on the test set
y_pred = clf.predict(X_test)
We’ll check the accuracy of our model.
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
Output:
Accuracy: 100.00%
Note: Since our dataset is very small, the model might show perfect accuracy, but in real scenarios, you need larger datasets for reliable performance.
Let’s try classifying some new sentences.
# New examples
new_texts = [
"I really enjoy this!",
"This is awful",
"What a wonderful day",
"I’m so sad about this"
]
# Convert to numerical data
new_X = vectorizer.transform(new_texts)
# Make predictions
new_predictions = clf.predict(new_X)
# Display results
for text, label in zip(new_texts, new_predictions):
sentiment = "Positive" if label == 1 else "Negative"
print(f"'{text}' -> {sentiment}")
Output:
'I really enjoy this!' -> Positive
'This is awful' -> Negative
'What a wonderful day' -> Positive
'I’m so sad about this' -> Negative
You’ve just built a simple NLP model that can classify text as positive or negative! Here’s what we did:
- Set Up: Installed Python and scikit-learn.
- Imported Libraries: Brought in necessary tools.
- Prepared Data: Created example sentences and labels.
- Converted Text: Turned text into numbers using Bag of Words.
- Split Data: Divided data into training and testing sets.
- Trained Model: Used Naive Bayes to learn from training data.
- Made Predictions: Tested the model on unseen data.
- Evaluated: Checked accuracy.
- Tested New Data: Saw how the model handles new sentences.
To build on this, you can:
- Use a Larger Dataset: More data can improve your model’s accuracy.
- Explore Different Algorithms: Try other classifiers like Logistic Regression or Support Vector Machines.
- Improve Text Processing: Use techniques like TF-IDF or word embeddings for better feature representation.
- Handle More Classes: Extend the model to classify into more categories beyond positive and negative.