Skip to content

AlexTK2012/RealtimeTwitterAnalysis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Realtime Twitter Stream Analysis System

Forked from my term Project for COMP7305 Cluster and Cloud Computing.

And I'll continue to finish and test my ideas here.

Environment

Teammates:

Project Structure

  • CloudWeb:
    • Show statistics data and sentiment analysis result.
  • Collector:
    • Collect real-time data by Twitter Python Api.
    • Transform data to Kafka by Flume.
  • StreamProcessorFlink:
    • Analyze statistics from different dimensions.
  • StreamProcessorSpark:
    • Train and generate Naive Bayes Model.
    • Analyze sentiment of Twitter.

Project Documents

Proposal

Prensentation PPT

Related Project

Spark-MLlib-Twitter-Sentiment-Analysis

flume_kafka_spark_solr_hive

corenlp-scala-examples

deeplearning4j

canvas-barrage

Data

Flume-> Kafka -> Spark Streaming -> Kafka  
        
              -> Flink -> Kafka
              
              -> DL4J -> Kafka

FlumeTwitter Data 搬运存储到 topic : alex1 供 Spark & Flink & DL4J 订阅。

Spark Streaming 读取topic : alex1 进行情感分析,存储结果数据到 topic : twitter-result1 供Web端订阅。

Cloud Web DL4J 读取topic : alex1 进行 dl4j 情感分析,结果数据不存储,直接吐到WebSocket监听的路由里,供Web端订阅。

Flink 读取topic : alex1 进行数据统计分析,

  • twitter 语言统计结果存储到topic : twitter-flink-lang 供Web端订阅。
  • twitter 用户fans统计结果存储到topic : twitter-flink-fans 供Web端订阅。
  • twitter 用户geo统计结果存储到topic : twitter-flink-geo 供Web端订阅。

数据格式:

  • Twitter 元数据 twitter4j.Status
  • 情感分析结果 ID¦Name¦Text¦NLP¦MLlib¦DeepLearning¦Latitude¦Longitude¦Profile¦Date
  • Lang 统计结果 {"pt":2,"ot":26,"ja":3,"en":453,"fr":12,"es":4,}
  • Fans 统计结果 200|800|500~1000|above 1000
  • map 统计结果 Latitude|Longitude|time

Operation & Conf

Run

  • Start Flume to collect twitter data and transport into Kafka.
nohup flume-ng agent -f /opt/spark-twitter/7305CloudProject/Collector/TwitterToKafka.conf -Dflume.root.logger=DEBUG,console -n a1 >> flume.log 2>&1 &
  • Start Spark Streaming to analysis twitter text sentiment using stanford nlp & naive bayes.
spark-submit --class "hk.hku.spark.TweetSentimentAnalyzer" --master yarn --deploy-mode cluster --num-executors 2 --executor-memory 4g --executor-cores 4 --driver-memory 4g --conf spark.kryoserializer.buffer.max=2048 --conf spark.yarn.executor.memoryOverhead=2048 /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar
  • Start Flink
flink run /opt/spark-twitter/7305CloudProject/StreamProcessorFlink/target/StreamProcessorFlink-1.0-SNAPSHOT.jar
  • Start CloudWeb to show the result on the website.
nohup java -Xmx3072m -jar /opt/spark-twitter/7305CloudProject/CloudWeb/target/CloudWeb-1.0-SNAPSHOT.jar & 

About

Realtime Twitter Stream Analysis System

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 92.8%
  • Java 3.7%
  • Scala 2.0%
  • HTML 0.9%
  • CSS 0.5%
  • Python 0.1%