Forked from my term Project for COMP7305 Cluster and Cloud Computing.
And I'll continue to finish and test my ideas here.
- CloudWeb:
- Show statistics data and sentiment analysis result.
- Collector:
- Collect real-time data by Twitter Python Api.
- Transform data to Kafka by Flume.
- StreamProcessorFlink:
- Analyze statistics from different dimensions.
- StreamProcessorSpark:
- Train and generate Naive Bayes Model.
- Analyze sentiment of Twitter.
Spark-MLlib-Twitter-Sentiment-Analysis
-
Navie Bayes Data 用于Spark 提前训练.
-
Stanford Core NLP 模型路径,maven 依赖中 stanford-corenlp-models
-
Deep Learning 模型
手动生成
词向量
Flume-> Kafka -> Spark Streaming -> Kafka
-> Flink -> Kafka
-> DL4J -> Kafka
Flume 将Twitter Data 搬运存储到 topic : alex1
供 Spark & Flink & DL4J 订阅。
Spark Streaming 读取topic : alex1
进行情感分析,存储结果数据到 topic : twitter-result1
供Web端订阅。
Cloud Web DL4J 读取topic : alex1
进行 dl4j 情感分析,结果数据不存储,直接吐到WebSocket监听的路由里,供Web端订阅。
Flink 读取topic : alex1
进行数据统计分析,
- twitter 语言统计结果存储到
topic : twitter-flink-lang
供Web端订阅。 - twitter 用户fans统计结果存储到
topic : twitter-flink-fans
供Web端订阅。 - twitter 用户geo统计结果存储到
topic : twitter-flink-geo
供Web端订阅。
数据格式:
- Twitter 元数据
twitter4j.Status
- 情感分析结果
ID¦Name¦Text¦NLP¦MLlib¦DeepLearning¦Latitude¦Longitude¦Profile¦Date
- Lang 统计结果
{"pt":2,"ot":26,"ja":3,"en":453,"fr":12,"es":4,}
- Fans 统计结果
200|800|500~1000|above 1000
- map 统计结果
Latitude|Longitude|time
- Start Flume to collect twitter data and transport into Kafka.
nohup flume-ng agent -f /opt/spark-twitter/7305CloudProject/Collector/TwitterToKafka.conf -Dflume.root.logger=DEBUG,console -n a1 >> flume.log 2>&1 &
- Start Spark Streaming to analysis twitter text sentiment using stanford nlp & naive bayes.
spark-submit --class "hk.hku.spark.TweetSentimentAnalyzer" --master yarn --deploy-mode cluster --num-executors 2 --executor-memory 4g --executor-cores 4 --driver-memory 4g --conf spark.kryoserializer.buffer.max=2048 --conf spark.yarn.executor.memoryOverhead=2048 /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar
- Start Flink
flink run /opt/spark-twitter/7305CloudProject/StreamProcessorFlink/target/StreamProcessorFlink-1.0-SNAPSHOT.jar
- Start CloudWeb to show the result on the website.
nohup java -Xmx3072m -jar /opt/spark-twitter/7305CloudProject/CloudWeb/target/CloudWeb-1.0-SNAPSHOT.jar &