-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
ZamanidisAlexios
committed
May 14, 2020
1 parent
8473091
commit dd17026
Showing
2 changed files
with
24 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,16 @@ | ||
# articles | ||
# News articles classification and clustering | ||
|
||
In this repository we perform text classification experiments using Support Vector | ||
Machines(SVM), Random Forest, Naive Bayes and K-Nearest Neighbor Classifier. Also, | ||
we perform text clustering using K-means Clusterer. | ||
|
||
First of all, we create a data set from our documents. The input is 2225 documents | ||
and the labels consists of Business, Entertainment, Politics, Sport and Tech. | ||
|
||
To sum up, the whole procedure consists of: | ||
|
||
1) Create a data set of all documents | ||
2) Text pre-processing | ||
3) Generate Word Clouds | ||
4) Vectorization | ||
5) Classification and Clustering |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,11 @@ | ||
Consists of 2225 documents from a news website corresponding to stories in five topical areas from 2004-2005. | ||
Consists of 2225 documents from a news website corresponding to stories in five topical areas | ||
from 2004-2005. | ||
|
||
Natural Classes: 5 (business, entertainment, politics, sport, tech) | ||
Natural Classes: 5 | ||
* Business | ||
* Entertainment | ||
* Politics | ||
* Sport | ||
* Tech | ||
|
||
First line of each document is the title and the rest is the content of the article. |