A modern dashboard to explore and visualize tweets related to a topic of interest.
Why use this dashboard | What questions can be answered | Installation and Configuration | Example MySQL queries | Resources |
If you're studying social networks, chances are you might want to collect and visualize tweets related to a certain topic. Fortunately, CSMaP has access to Decahose, a 10% sample of all tweets!
With minimal configuration, this repository:
- Loads tweets related to a given keyword(s) from Decahose to a local MySQL database
- Sets up a Superset dashboard
- Schedules a background job to load new tweets every 24 hours
- Users can then explore tweets and create dashboard
- How many tweets are related to a topic (e.g.: BLM, vaccine, election...)?
- Who are sharing those tweets; when and where do they post?
- What hashtags/urls are co-shared?
- Is the topic gaining or losing popularity? What is the trend?
- ...
Since new tweets are added every day, the system also schedules a background job that loads new tweets every 24 hours. The overall system architecture is shown below.
- Log in HPC, then
cd /home/$USER
git clone git@github.com:SMAPPNYU/internal-dashboard.git
cd /home/$USER/internal-dashboard
(do not rename the directory)- Modify
config.json
, changeKEYWORD
orTABLE_NAME
. Leave other configurations as they are. Note:KEYWORD
needs at least 2 words. ./init.sh config.json
(do not forget to type the first dot)- We are ready to create our first dashboard (go to Step 2-a)
Tip: if you want to track other keywords, please create another config file (e.g.: config-vaccine.json, and issue ./init.sh config-vaccine.json
)
- Inside HPC log-in node, run command:
source $HOME/.bash_profile; cd /home/$USER/internal-dashboard && ./daily_update.sh config.json
- Wait until the script finishes, follow instructions printed on the console (stdout)
- After re-connecting to HPC, proceed to Step 3
./init.sh config.json
schedule a background job that loads new tweets and hosts a new Dashboard on a new IP address every 24 hours.- To connect to a dashboard from a running job, find the hostname YOUR_HOSTNAME via
cat /home/$USER/decahose_visualization_setup/latest_hostname.txt
- Log out HPC
exit
- Log in HPC again with port forwarding,
ssh -L 8088:YOUR_HOSTNAME:8088 YOUR_NETID@log-1.nyu.cluster
(changeYOUR_HOSTNAME
andYOUR_NETID
) - Proceed to Step 3
Tip: Inside the same HPC log-in node as Step 1, run crontab -e
. You can delete the scheduled job if you prefer.
- Open browser, visit http://localhost:8088/
- Enter default username (
admin
) and password (admin
). Update the password for enhanced security. - Go to http://localhost:8088/databaseview/list/, click "+DATABASE", connect with the following string
mysql+pymysql://csmap_user:csmap@localhost:3306/tweet?read_default_file=~/.my.cnf
- Once we connect to the database, add a dataset
- Navigate to Chart page to create 20+ types of visualization: http://localhost:8088/chart/list/
Following are some MySQL queries to get started
- count number of tweets per day
SELECT COUNT(*), yymmdd
FROM covid_tweet
GROUP BY yymmdd
ORDER BY yymmdd
- top 10 tweets from users with the highest number of followers
SELECT text, user__followers_count, user__screen_name , user__name
FROM covid_tweet
WHERE user__followers_count > 10000
ORDER BY user__followers_count desc
LIMIT 10
Note: If you encounter any problem during the set up, please contact Zhouhan, email: zc1245@nyu.edu