Skip to content

Latest commit

 

History

History
131 lines (84 loc) · 5.4 KB

README.md

File metadata and controls

131 lines (84 loc) · 5.4 KB

Realtime Twitter Stream Analysis System

Forked from my term Project for COMP7305 Cluster and Cloud Computing.

And I'll continue to finish and test my ideas here.

Environment

Teammates:

Project Structure

  • CloudWeb:
    • Show statistics data and sentiment analysis result.
  • Collector:
    • Collect real-time data by Twitter Python Api.
    • Transform data to Kafka by Flume.
  • StreamProcessorFlink:
    • Analyze statistics from different dimensions.
  • StreamProcessorSpark:
    • Train and generate Naive Bayes Model.
    • Analyze sentiment of Twitter.

Project Documents

Proposal

Prensentation PPT

Related Project

Spark-MLlib-Twitter-Sentiment-Analysis

flume_kafka_spark_solr_hive

corenlp-scala-examples

deeplearning4j

canvas-barrage

Data

Flume-> Kafka -> Spark Streaming -> Kafka  
        
              -> Flink -> Kafka
              
              -> DL4J -> Kafka

FlumeTwitter Data 搬运存储到 topic : alex1 供 Spark & Flink & DL4J 订阅。

Spark Streaming 读取topic : alex1 进行情感分析,存储结果数据到 topic : twitter-result1 供Web端订阅。

Cloud Web DL4J 读取topic : alex1 进行 dl4j 情感分析,结果数据不存储,直接吐到WebSocket监听的路由里,供Web端订阅。

Flink 读取topic : alex1 进行数据统计分析,

  • twitter 语言统计结果存储到topic : twitter-flink-lang 供Web端订阅。
  • twitter 用户fans统计结果存储到topic : twitter-flink-fans 供Web端订阅。
  • twitter 用户geo统计结果存储到topic : twitter-flink-geo 供Web端订阅。

数据格式:

  • Twitter 元数据 twitter4j.Status
  • 情感分析结果 ID¦Name¦Text¦NLP¦MLlib¦DeepLearning¦Latitude¦Longitude¦Profile¦Date
  • Lang 统计结果 {"pt":2,"ot":26,"ja":3,"en":453,"fr":12,"es":4,}
  • Fans 统计结果 200|800|500~1000|above 1000
  • map 统计结果 Latitude|Longitude|time

Operation & Conf

Run

  • Start Flume to collect twitter data and transport into Kafka.
nohup flume-ng agent -f /opt/spark-twitter/7305CloudProject/Collector/TwitterToKafka.conf -Dflume.root.logger=DEBUG,console -n a1 >> flume.log 2>&1 &
  • Start Spark Streaming to analysis twitter text sentiment using stanford nlp & naive bayes.
spark-submit --class "hk.hku.spark.TweetSentimentAnalyzer" --master yarn --deploy-mode cluster --num-executors 2 --executor-memory 4g --executor-cores 4 --driver-memory 4g --conf spark.kryoserializer.buffer.max=2048 --conf spark.yarn.executor.memoryOverhead=2048 /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar
  • Start Flink
flink run /opt/spark-twitter/7305CloudProject/StreamProcessorFlink/target/StreamProcessorFlink-1.0-SNAPSHOT.jar
  • Start CloudWeb to show the result on the website.
nohup java -Xmx3072m -jar /opt/spark-twitter/7305CloudProject/CloudWeb/target/CloudWeb-1.0-SNAPSHOT.jar &