Analyzing real time Twitter data with Microsoft Key Phrase Extraction and MapReduce aggregation to produce a word cloud of trending topics.
- Microsoft Azure account & active subscription
- Twitter developer account & active application
- Cognitive Sciences service
- Hadoop 3.7 cluster
- Python 3.X
- Azure Blob Storage (azure-storage-blob)
- tweepy
- pandas
- PIL
- wordcloud
- matplotlib
- cv2
- Clone this repository and install the required dependencies using
pip
- Replace lines 13-16 in
scrape_twitter.py
with your Twitter application keys - Replace lines 19-20 in
scrape_twitter.py
with your Microsoft Azure Cognitive Sciences service key and Azure Blob Storage account key, respectively - Replace lines 15-16 in
generate_graph.py
with the same Azure Blob Storage account name & key - Replace lines 3-4 of
\mapreduce\download.py
and\mapreduce\upload.py
with your Azure Blob Storage account name & key - Copy the files in
\mapreduce
to your running Hadoop cluster
- Run
python scrape_twitter.py
to start streaming Twitter data and analyzing with Microsoft Azure Key Phrase Extraction - SSH into the Hadoop cluster and run
bash run.sh
on the Hadoop cluster to run the MapReduce framework - Run
python generate_graph.py
locally in another window to generate the final cloud graph
The cloud graph will update in real time (every minute) as it receives new data from the MapReduce framework running on the Hadoop cluster.