-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathVisualising Demographic Changes _Singapore-Shivam.Rmd
297 lines (197 loc) · 14.6 KB
/
Visualising Demographic Changes _Singapore-Shivam.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
---
title: "***Demographic structure of Singapore population***"
subtitle: 'ISSS608 Visual Analytics and Applications: DataViz Makeover 8'
author: "Shreyansh Shivam"
date: "12-Mar-2020"
output:
html_document:
theme: flatly
highlight: tango
code_folding: show
df_print: paged
toc: true
toc_depth: 2
toc_float: true
---
[![](Data/linkedin.png)](https://www.linkedin.com/in/shreyansh-shivam/) | [![](Data/Tableau.png)](https://public.tableau.com/profile/shreyansh.shivam#!/)
# 1.0 Introduction
In this DataViz makeover, we are required to design a static data visualisation to reveal the demographic structure of Singapore population by age cohort (i.e. 0-4, 5-9,……) and by planning area for the year 2019.
We are using the dataset provided by Department of Statistics, Singapore with details of the population trends. To do the makeover we are going to build the data visualisation in R.
# 2.0 Data and design challenges faced
**Challenge1:** The first challenge is to prepare the data in the form where we could conduct analysis in R. The data provided is in a CSV file, and since we need to use R to build the visuals, we need to prepare the data accordingly. As the task required to provide visualisation to reveal the demographic structure of Singapore population by age cohort for different planning area in year 2019, we need to filter the data the year 2019 only, as the CSV file has data for the years 2011 to 2019.
**Challenge2:** The data provided did not have total of the population for each planning area. We had to explicitely code to get the Sum of the Population for each planning area so that we can get a clear idea of the demographic structure of Singapore population by age cohort for that planning area.
**Challenge3:** The data has population details with respect to the Age-Group and gender for each planning area. Therefore it is difficult to show the age group and gender distribution for their respective planning areas. In the absence of interactive tool tips like in visualisation softwares like Tableau, it become challenging to build an intutive visualisation to illustrate all the details.
# 3.0 Ways to overcome the challenges identified
The above mentioned challenges can be overcome by making using of relevant packages which is readily available in CRAN. Inorder to overcome the first challenge we have used Data Manipulation packages like dplyr which helps to filter the data easily.
In the data provided the individual sum is missing for the population which we are getting by using the aggregate function.
Finally to show the data using appropriate visuals, we make use of the "GGPLOT" package and inorder to make in interactive we have used "plotly". To represent the demographic structure of the Singapore population by age cohort and by planning area, the ideal visual would be a heatmap. Similary we can get and idea of the Age-sex ratio with the help of a Gender Pyramid.
***HeatMap***
Heatmaps helps to visualise data through variations in colouring.Heatmaps are good for showing variance across multiple variables, revealing any patterns, displaying whether any variables are similar to each other, and for detecting if any correlations exist in-between them.
![Sketch of Heat Map](Images/HeatMap.jpg)
Advantage of Heat Map- Providing visual paths to understanding numeric values and give a hollistic view of the spread of the population with a single visual.
***GenderPyramid***
The advantage of age sex pyramid shows the distribution of each gender for each of Age Group.
![Sketch of Gender Pyramid](Images/Gender.jpg)
# 4.0 Preparing the Visualisation
In this section, we shall describe the steps performed inorder to generate the proposed visuals. We are required to start a new R project, and to create a new R Markdown document.
## 4.1 Installing and Launching R Packages
This code chunk installs the necessary R packages and loads them into R Studio Enviornment without having to explicitly load them every time.
```{r echo=TRUE, eval=TRUE, message=FALSE}
#Code for checking if the packages are installed or not
packages <-c ('tidyverse','ggridges','aggregation','plotly','heatmaply','reshape2','plyr','dendextend')
for (p in packages){
if(!require(p, character.only=T)){
install.packages(p)
}
library(p,character.only = T)
}
```
## 4.2 Importing Data and Preparing the data det
**Importing the Data:**In this assignment, the data of Singapore population by age cohort report for each planning area from 2011 to 2019, provided by Department of Statistics, Singapore will be used. The data set is downloaded from the SingStats website. The original data set is in CSV format with the file name as "respopagesextod2011to2019". In the code chunk below, read_csv() of readr is used to import the CSV file into R and parsed it into tibble R data frame format.
```{r}
data <- read_csv("data/respopagesextod2011to2019.csv") # reading the Input CSV file
```
**Getting population for the Year 2019:** The dataset has details of the population from 2011 to 2019. Howvever as per the assignment's scope we need to reveal the demographic structure of Singapore population for the year 2019 only. Therefore we filter the data for the year 2019 using the below chunk of codes-
```{r warning=FALSE}
data2019<-data %>%
select(PA,SZ,AG,Sex,TOD,Pop,Time) %>%
filter( Time == "2019")
```
**Aggregating the population for each Planning Area:** In order to make a heat map for visualisation, we need to get the details of sum of the total population for each of the planning area for the year 2019. We use the below codes to get the desired output-
```{r warning=FALSE}
Planning_Area<- data2019$PA
Age_Group<- data2019$AG
Pops<- data2019$Pop
data_2019new = aggregate(Pops, by=list(Planning_Area,Age_Group), FUN=sum) #getting the sum of the population
names(data_2019new)<-c("PArea","AgeGroup","Population") # Renaming the Column Header with relavant names"
```
Now we are pivoting the population values for each of the corresponding age group using the below code-
```{r echo=TRUE, eval=TRUE, message=FALSE}
df1<-data_2019new %>%
pivot_wider(names_from = AgeGroup, values_from = Population)
```
Next, we need to change the rows by Planning Area name instead of row number by using the code chunk below-
```{r warning=FALSE}
row.names(df1) <- df1$PArea
```
**Transforming the data into matrix:** The data was loaded into a data frame, but it has to be a data matrix to make your heatmap.The code chunk below will be used to transform wh data frame into a data matrix.
```{r warning=FALSE}
df1<-select(df1,-PArea)
wh_matrix <- data.matrix(df1)
```
## 4.3 Making the Static Heatmap
To build the heatmap we can use the built-in R heatmap() function.To plot a cluster heatmap, we just have to use the default as shown in the code chunk below.
```{r warning=FALSE}
wh_heatmap <- heatmap(wh_matrix)
```
The heatmap do a reordering using clusterisation: it calculates the distance between each pair of rows and columns and try to order them by similarity. Moreover, the corresponding dendrogram are provided beside the heatmap. The white area highlights that there is no population for regions such as Central Water Catchment, Changi Bay,Westlands etc. As the colour becomes darker it shows the concentartion of Population.
***Limitations of Static Heatmap***: The static Heatmap is not very intuitive and cannot reveal details at very mico-level such as what is the exact population of the age group in that particular planning area. We can overcome by using a dynamic Heatmap using a package called heatmaply.
## 4.4 Making the Interactive Heatmap
Interactive 'heatmaps' allow the inspection of specific value by hovering the mouse over a cell, as well as zooming into a region of the 'heatmap' by dragging a rectangle around the relevant area. This work is based on the 'ggplot2' and 'plotly.js' engine. It produces similar 'heatmaps' as 'heatmap.2' or 'd3heatmap', with the advantage of speed, the ability to zoom from the 'dendrogram' panes, and the placing of factor variables in the sides of the 'heatmap'.
In order to use the heatmaply package to build the heat map we need to do the following steps-
***Data trasformation***:When analysing multivariate data set,in order to ensure that all the variables have comparable values, data transformation are commonly used before clustering.
Three main data transformation methods are supported by heatmaply(), namely: scale, normalise and percentilse.
***Clustering algorithm***: Heatmaply supports a variety of hierarchical clustering algorithm. The main arguments provided are:distfun,hclustfun,dist_method,hclust_method.
***Statistical approach to find the best Clustering method***: The best clustering method and number of cluster can be found out by using the dend_expend() and find_k() functions of dendextend package.
First, using the dend_expend() we determine the recommended clustering method-
```{r message=FALSE}
wh_d <- dist(normalize(wh_matrix), method = "euclidean")
dend_expend(wh_d)[[3]]
```
The output table shows that “mcquitty” method should be used because it gave the high optimum value.
Next, find_k() is used to determine the optimal number of cluster.
```{r message=FALSE}
wh_clust <- hclust(wh_d, method = "mcquitty")
num_k <- find_k(wh_clust)
plot(num_k)
```
From the above plot we can see that 6 is the most optimal number of clusters.
***The final output***
```{r warning=FALSE}
heatmaply(normalize(wh_matrix),
Colv=NA,
seriate = "none",
colors = viridis(900),
k_row = 6,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 4,
main="Distribution of Singapre Population \n By Age Group and Planning Area, 2019",
xlab = "Age Group\n Data Source: Sing Stat",
ylab = "Planning Area"
)
```
***Interpreation from the above Heat Map***: From the Heat Map, the area which is having a yellow shades indicates higher population while the areas which are shaded by purple has minimal population.
## 4.5 Making the Gender Pyramid
To Build the gender pyramid we first retain the data which is relevant for constructing the Gender Pyramid. There we retain the Age, Gender and Population column and drop the remaining field.
```{r warning=FALSE}
names(data2019)
datapy<-select(data2019,-PA,-SZ,-TOD,-Time) # Dropping the uwnated columns
names(datapy)<-c("Age","Gender","Pop")
```
In the next step we aggregate the population with respect to Age and Gender.
```{r warning=FALSE}
a <- datapy$Age
b <- datapy$Gender
c <- datapy$Pop
df2 = aggregate(c, by=list(a,b), FUN=sum)
names(df2)<-c("Age","Gender","Pop")
```
Now we split the Gender into separate columns into male and Female using the pivot function-
```{r warning=FALSE}
df<-df2 %>%
pivot_wider(names_from = Gender, values_from = Pop)
names(df) <- c("Age", "Male", "Female")
```
Now after getting the sum of population of male and female for each age group we again aggregate into Population, Gender and Age Group-
```{r warning=FALSE}
names(df) <- c("Age", "Male", "Female")
cols <- 2:3
df[,cols] <- apply(df[,cols], 2, function(x) as.numeric(as.character(gsub(",", "", x))))
df <- df[df$Age != 'Total', ]
df$Male <- -1 * df$Male
df$Age <- factor(df$Age, levels = df$Age, labels = df$Age)
df.melt <- melt(df,
value.name='Population',
variable.name = 'Gender',
id.vars='Age' )
df4<-df.melt
```
Now we have the data in the format for making the gender pyramid using GGPLOT package
```{r warning=FALSE}
gp <- ggplot(df4, aes(x = Age, y = Population, fill = Gender)) +
geom_bar(subset = .(Gender == "Female"), stat = "identity") +
geom_bar(subset = .(Gender == "Male"), stat = "identity") +
scale_y_continuous(breaks = seq(-15000000, 15000000, 5000000),
labels = paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), "m")) +
coord_flip() +
scale_fill_brewer(palette = "Set1") +
theme_bw()
gp1<-gp + scale_fill_manual(values = c("steelblue1","hotpink"))+labs(
x="Age Group",
y = "Population",
title = "Singapore's Gender Pyramid Structure, 2019",
subtitle = "The Male Population for the Active Age group seems higher than female",
caption = "Data Source: Singstat.com"
)
p <- ggplotly(gp1)
p
```
# 5.0 Conclusion
***Inferences from the Heat Map***:
1) From the Heat Map it is clear that the most of the Young Popluation is residing at Sengkang, Jurong West and Woodlands as the Yellow shades indicates denser population.
2) Likewise Bedok,Hougang and Tampines have a denser population with Bedok having a higher population of older Age Group.
***Inferences from the Gender Pyramid***:
1) From the gender pyramid we can see that the Male gender is slightly higher than the female gender across most of the age groups.
2) We can also see that the distribution of age-sex ratio is inclined towards middle Active group that is from 45-65 age group bracket.
# 6.0 Major advantages of building data visualisation in R as compared to Tableau
The following are the major advantages of Building Visuals in R over Tableau-
***Advantage 1: Ease of Making Visuals***: R has amazing data wrangling libraries such as ggplot2 , dplyr , and tidyr as a part of tidyverse package. These libraries have revolutionized how data manipulation and visualization is done. Code is easy to read, easy to write and usually works flawlessly. Most of the graphs/ visuals can be easily constructed with few lines of code. Few Graphics such as Ternary plot requires a lot of steps in Tableau whereas it can easily be implemented in R by a few lines of codes.
***Advantage 2: Open Source and ease of doing Statistical Analysis***:It’s the industry standard for statistics and data mining. It is faster and easier to identify patterns and build practical models using R.Moreover it is open source and most of the important packages are free unlike tableau which needs a paid license.
***Advantage 3: Ease of Making reports***:R Markdown provides an authoring framework for data science. We can use a single R Markdown file to both save and execute code and generate high quality reports that can be shared with an audience.Whereas in Tableau we can make dashboards but report making is not quite feasible as in the case with R.
# 7.0 References
In order to complete the assignement, the following websites have been used for references-
***https://rpubs.com/tskam/heatmap***
***https://www.datanovia.com/en/lessons/heatmap-in-r-static-and-interactive-visualization/#r-base-heatmap-heatmap***
***https://cran.r-project.org/web/packages/heatmaply/vignettes/heatmaply.html#changing-color-palettes***
***https://rpubs.com/walkerke/pyramids_ggplot2***