Realtime Twitter Stream Analysis System

Forked from my term Project for COMP7305 Cluster and Cloud Computing.

And I'll continue to finish and test my ideas here.

Environment

Teammates:

@GaryGao
@lexkaing
@BZbyr
@Yang Xiangyu

Project Structure

CloudWeb:
- Show statistics data and sentiment analysis result.
Collector:
- Collect real-time data by Twitter Python Api.
- Transform data to Kafka by Flume.
StreamProcessorFlink:
- Analyze statistics from different dimensions.
StreamProcessorSpark:
- Train and generate Naive Bayes Model.
- Analyze sentiment of Twitter.

Project Documents

Proposal

Prensentation PPT

Related Project

Spark-MLlib-Twitter-Sentiment-Analysis

flume_kafka_spark_solr_hive

corenlp-scala-examples

deeplearning4j

canvas-barrage

Data

Navie Bayes Data 用于Spark 提前训练.
Stanford Core NLP 模型路径，maven 依赖中 stanford-corenlp-models
Deep Learning 模型手动生成 词向量

Flume-> Kafka -> Spark Streaming -> Kafka  
        
              -> Flink -> Kafka
              
              -> DL4J -> Kafka

Flume 将Twitter Data 搬运存储到 topic : alex1 供 Spark & Flink & DL4J 订阅。

Spark Streaming 读取topic : alex1 进行情感分析，存储结果数据到 topic : twitter-result1 供Web端订阅。

Cloud Web DL4J 读取topic : alex1 进行 dl4j 情感分析，结果数据不存储，直接吐到WebSocket监听的路由里，供Web端订阅。

Flink 读取topic : alex1 进行数据统计分析，

twitter 语言统计结果存储到topic : twitter-flink-lang 供Web端订阅。
twitter 用户fans统计结果存储到topic : twitter-flink-fans 供Web端订阅。
twitter 用户geo统计结果存储到topic : twitter-flink-geo 供Web端订阅。

数据格式:

Twitter 元数据 twitter4j.Status
情感分析结果 ID¦Name¦Text¦NLP¦MLlib¦DeepLearning¦Latitude¦Longitude¦Profile¦Date
Lang 统计结果 {"pt":2,"ot":26,"ja":3,"en":453,"fr":12,"es":4,}
Fans 统计结果 200|800|500~1000|above 1000
map 统计结果 Latitude|Longitude|time

Operation & Conf

Kafka Operation
Flume Conf

Run

Start Flume to collect twitter data and transport into Kafka.

nohup flume-ng agent -f /opt/spark-twitter/7305CloudProject/Collector/TwitterToKafka.conf -Dflume.root.logger=DEBUG,console -n a1 >> flume.log 2>&1 &

Start Spark Streaming to analysis twitter text sentiment using stanford nlp & naive bayes.

spark-submit --class "hk.hku.spark.TweetSentimentAnalyzer" --master yarn --deploy-mode cluster --num-executors 2 --executor-memory 4g --executor-cores 4 --driver-memory 4g --conf spark.kryoserializer.buffer.max=2048 --conf spark.yarn.executor.memoryOverhead=2048 /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar

Start Flink

flink run /opt/spark-twitter/7305CloudProject/StreamProcessorFlink/target/StreamProcessorFlink-1.0-SNAPSHOT.jar

Start CloudWeb to show the result on the website.

nohup java -Xmx3072m -jar /opt/spark-twitter/7305CloudProject/CloudWeb/target/CloudWeb-1.0-SNAPSHOT.jar &

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Realtime Twitter Stream Analysis System

Environment

Teammates:

Project Structure

Project Documents

Related Project

Data

Operation & Conf

Run

Files

README.md

Latest commit

History

README.md

File metadata and controls

Realtime Twitter Stream Analysis System

Environment

Teammates:

Project Structure

Project Documents

Related Project

Data

Operation & Conf

Run