-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
443 lines (332 loc) · 18.4 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
---
title: "17th Lok Sabha - Winter Session Question Analysis"
author: "Lakshya Agarwal"
date: "23/12/2019"
output:
html_document:
highlight: zenburn
theme: paper
toc: true
toc_float: true
toc_depth: 4
number_sections: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
This document serves as an example of analysing the questions that were asked in the Winter Session of the *17th Lok Sabha.*
The Lok Sabha or House of the People is the lower house of India's bicameral Parliament.
*This Winter Session ran from 18th November, 2019 to 13th December, 2019.*
## Importing libraries
First, we import the necessary libraries.
```{r library, warning=FALSE, message=FALSE}
library(tidyverse)
library(lubridate)
library(dplyr)
library(bbplot)
library(ggthemes)
library(RColorBrewer)
library(ggwordcloud)
library(tidytext)
library(knitr)
library(kableExtra)
```
```{r, include=FALSE}
windowsFonts(Helvetica = "Product Sans")
```
`tidyverse` and its associated libraries are used to leverage the power of tidy data.
`bbplot` by BBC is a package that will be used to create `ggplot2` charts.
# Working with the data
## Reading the dataset
```{r}
questions <- read.csv('Winter_LokSabha17Questions.csv')
kable(sample_n(questions, 10)) %>% kable_styling(bootstrap_options = c('striped')) # 10 random entries from the dataset
str(questions)
```
As can be seen from the above output, the dataset is messy. More specifically,
* The column names need to be renamed to facilitate repeated usage.
* The `Date` column is not in the correct data type. Moreover, we need to filter only the questions for the Winter Session.
* The `Q. Type` column contains a lot of unnecessary text.
## Cleaning the dataset
We now turn to cleaning the dataset.
```{r}
questions <- questions %>% rename('Question Number' = "Q.NO.", "Type" = "Q.Type") %>%
mutate(Type = (str_replace(Type, pattern = "(PDF).*", replacement = "")) %>% str_trim(),
Date = as.Date(Date, format = "%d.%m.%Y"),
Ministry = str_trim(Ministry)) %>%
# Filtering the winter session questions only
filter(Date >= as.Date("2019-11-18")) %>%
# Adding actual link
mutate(Link = ifelse(Type == "STARRED",
paste0("http://164.100.24.220/loksabhaquestions/annex/172/AS",
`Question Number`,
'.pdf'),
paste0("http://164.100.24.220/loksabhaquestions/annex/172/AU",
`Question Number`,
'.pdf')))
kable(sample_n(questions, 10)) %>% kable_styling(bootstrap_options = c('striped')) # 10 random entries from the dataset
```
Note that I have also added a link to the actual question and the subsequent answer to that question. The files on the Lok Sabha server follow a pattern, which makes the task very simple.
# Data analysis and visualization
## Number of questions to Ministries
Let us see which ministry was asked the most number of questions in this session.
```{r eval=FALSE}
# Creating the dataset
ministry <- questions %>% group_by(Ministry) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
mutate(Ministry = factor(Ministry, levels = rev(Ministry)))
# Creating the plot
(
ggplot(ministry, aes(x = Ministry, y = Count, fill = Ministry)) + geom_bar(stat = 'identity') +
geom_hline(yintercept = 0, size = 1, colour="#333333") +
scale_fill_manual(values = rev(colorRampPalette(brewer.pal(8, "Dark2"))(length(ministry$Ministry)))) +
bbc_style() +
coord_flip() +
theme(legend.position = "none",
axis.title = element_text(size = 18),
panel.grid.major.x = element_line(color="#cbcbcb"),
panel.grid.major.y=element_blank()) +
labs(title = "Questions asked by each ministry",
subtitle = "Winter Session of 17th Lok Sabha",
y = "Number of questions") +
geom_label(aes(x = Ministry, y = Count, label = Count),
hjust = 1,
vjust = 0.5,
colour = "white",
fill = NA,
label.size = NA,
family="Helvetica",
size = 6)
) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal",
save_filepath = 'graphs/QuestionsByMinistry.jpg',
width = 1920, height = 1080)
```
![](graphs/ministry.jpg)
A lot of code above for the graph below. We shall go over it block by block.
### Understanding the code
```{r}
# Creating the dataset
ministry <- questions %>% group_by(Ministry) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
mutate(Ministry = factor(Ministry, levels = rev(Ministry)))
```
First, I create a new dataframe that consists of the number of questions asked to each ministry, and transform the **Ministry** column into a factor.
The process is:
1. Group the dataset by **Ministry** - using `group_by()`
2. `summarise` to get the number of occurences - using `n()`
3. Sort the dataframe in descending order of **Count** - using `arrange()`
4. Convert **Ministry** into a factor - using `mutate()`
This gives us the following output:
```{r echo=FALSE}
kable(head(ministry, 5)) %>% kable_styling(bootstrap_options = c('striped'))
```
Next, we look at the code for creating the plot.
```{r eval=FALSE}
# Creating the plot
ggplot(ministry, aes(x = Ministry, y = Count, fill = Ministry)) + geom_bar(stat = 'identity') +
geom_hline(yintercept = 0, size = 1, colour="#333333") +
scale_fill_manual(values = rev(colorRampPalette(brewer.pal(8, "Dark2"))(length(ministry$Ministry)))) +
bbc_style() +
coord_flip() +
# Removing legend, showing axis labels and adding the title
theme(legend.position = "none",
axis.title = element_text(size = 18),
panel.grid.major.x = element_line(color="#cbcbcb"),
panel.grid.major.y=element_blank()) +
labs(title = "Questions asked by each ministry",
subtitle = "Winter Session of 17th Lok Sabha",
y = "Number of questions") +
# Showing the number of questions in the graph
geom_label(aes(x = Ministry, y = Count, label = Count),
hjust = 1,
vjust = 0.5,
colour = "white",
fill = NA,
label.size = NA,
family="Helvetica",
size = 6)
```
The first section is the basic `ggplot2` code for creating a horizontal bar chart, with manual colors. Of particular importance here is: **`bbc_style()`**
`bbc_style()` (and subsequently, `finalise_plot()`) is a function from `bbplot` that makes the chart components follow BBC style, while allowing room for further manual customization. For more information, visit the **[bbplot GitHub repo](https://github.com/bbc/bbplot)**.
Then, I remove the legend, show the axis labels, add a title and subtitle to the plot in the next two sections. Post that, I add the number of questions as a label in the chart to facilitate easy comprehension.
`finalise_plot()` simply packages the graphic, adds a footnote and resizes it - producing an image - `QuestionsByMinistry.jpg`
### Understanding the graphic
![](graphs/ministry.jpg)
Now that we have understood how to create the above graphic, let us take a moment to interpret it.
**Health and Family Welfare** leads the race with 332 questions asked to it, with **Railways** trailing at 288. <br><br>It seems that most of the core ministries, such as, **Human Resource Development, Environment, Finance, Road Transport and Highways**, were asked questions in this session, which does that indicate that the House debated on some pertitnent topics. However, a closer analysis is required on the subjects of such questions.
## Weekly ranking of Ministries
Another interesting way to look at the performance of the Session would be to see whether there was a *shift in focus* of Lok Sabha Members from asking questions to one Ministry to another during the Session.
We can visualise this through a **bump chart**, that plots ranking of entites over time. The focus here is usually on comparing the position or performance of multiple observations with respect to each other rather than the actual values itself.([From R-bloggers](https://www.r-bloggers.com/bump-chart/))
```{r eval = FALSE}
# Has there been a shift in focus from one ministry to another (by week)?
# Creating the dataset
shift_data <- questions %>% group_by(Week = floor_date(Date, 'week'), Ministry) %>%
summarise(count = n()) %>%
top_n(5, wt = count) %>%
group_by(Week) %>%
arrange(Week, desc(count)) %>%
mutate(rank = row_number()) %>%
filter(rank <= 5) %>%
ungroup()
# Creating the plot
(
ggplot(shift_data, aes(x=Week, y=rank, group = Ministry)) +
geom_line(aes(color = Ministry), size = 2) +
geom_point(aes(color = Ministry), size = 5) +
scale_y_reverse(breaks = 1:5) +
bbc_style() +
theme(legend.position = 'right',
axis.title = element_text(size = 18),
panel.grid.major.y = element_line(color="#cbcbcb"),
panel.grid.major.x=element_blank()) +
labs(title = "Top 5 ministries by number of questions",
subtitle = "Winter Session of 17th Lok Sabha",
y = "Rank",
x = "Week")
) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal",
save_filepath = 'graphs/RankingMinistry.jpg',
width = 1600, height = 900)
```
![](graphs/rank.jpg)
Let's go over the code block by block again.
### Understanding the code
```{r}
# Creating the dataset
shift_data <- questions %>% group_by(Week = floor_date(Date, 'week'), Ministry) %>%
summarise(count = n()) %>%
top_n(5, wt = count) %>%
group_by(Week) %>%
arrange(Week, desc(count)) %>%
mutate(rank = row_number()) %>%
filter(rank <= 5) %>%
ungroup()
```
Here, we want to create rankings for each ministry based on the number of questions asked to them in each week. So, our workflow would be to convert the data from daily to weekly, count the number of questions, assign ranks and plot it. To create the dataset, we follow this process:
1. Group the dataset by **Week** and **Ministry** - using `group_by()` and [`floor_date()`](https://lubridate.tidyverse.org/reference/round_date.html)
2. `summarise` to get the count of number of questions - using `n()`
3. Keep only the **top 5** ministries in each week - using `top_n()`
4. Sort by descending order of number of questions asked to each week - using `arrange()`
5. Add a **rank** column for each week and ministry - using `mutate()` and `row_number()`
6. Remove any ministry with overlapping ranks - using `filter()`
This gives us the following output:
```{r echo=FALSE}
kable(head(shift_data, 10)) %>% kable_styling(bootstrap_options = c('striped'))
```
Next, we look at the code for creating the plot.
```{r eval=FALSE}
# Creating the plot
ggplot(shift_data, aes(x = Week, y = rank, group = Ministry)) +
geom_line(aes(color = Ministry), size = 2) +
geom_point(aes(color = Ministry), size = 5) +
scale_y_reverse(breaks = 1:5) +
bbc_style() +
scale_color_tableau() +
# Adding legend, showing axis labels and adding the title
theme(legend.position = 'right',
axis.title = element_text(size = 18),
panel.grid.major.y = element_line(color="#cbcbcb"),
panel.grid.major.x=element_blank()) +
labs(title = "Top 5 ministries by number of questions",
subtitle = "Winter Session of 17th Lok Sabha",
y = "Rank",
x = "Week")
```
Similar to the previous plot, this one also creates a basic `ggplot2` chart, adds `bbc_style()` and [`scale_color_tableau()`](https://rdrr.io/cran/ggthemes/man/scale_color_tableau.html) and the labels and titles. *Pretty standard stuff!*
### Understanding the graphic
![](graphs/rank.jpg)
As before, let us interpret this visualisation as well.
**Agriculture and Farmers' Welfare** dominated the questions in the first week of the Session, but disappeared in the next two weeks, only to come back at *second place* in the last week. <br><br>
**Health and Family Welfare** jumped to the top spot after first week and remained there, while **Human Resource Development** created a plateau at the 4th and 2nd place. **Road Transport and Highways, Home Affairs and Jal Shakti** made the top charts at least once. <br><br>
All in all, this visualisation seems to provide a bit more insight into how the focus shifted from one Ministry to another during the session.
## Subject of questions to Ministries
I would now like to move from analysing the Lok Sabha as a whole to looking into each ministry in depth. To this end, we can look at the ***subject of questions*** that were asked to a Ministry and gain some insights from that.
### Creating a wordcloud
To visualise textual data, let's create a wordcloud from the subject of questions asked.
```{r eval=FALSE}
# Ministry-wise - Questions wordlcloud
ministry_wordcloud <- function(ministry){
# Filtering selected ministry and creating the dataset
ministry_q <- questions %>% filter(Ministry == str_to_upper(ministry)) %>%
mutate(Subject = as.character(Subject)) %>%
select(Subject) %>%
unnest_tokens(word, Subject) %>%
anti_join(get_stopwords(), by = 'word') %>%
count(word, sort = T) %>%a
head(40)
# Plotting the dataset as a wordcloud
(
ggplot(ministry_q, aes(label = word, color = word, size = n)) +
geom_text_wordcloud_area(rm_outside = T, family = 'Helvetica') +
scale_size_area(max_size = 40) +
bbc_style() +
labs(title = paste0("Ministry of ", ministry, " - Subject of questions asked"),
subtitle = "Winter Session of 17th Lok Sabha")
) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal",
save_filepath = paste0('graphs/wordcloud/', ministry, '.jpg'),
width = 960, height = 540)
}
```
The above code creates a function - `ministry_wordcloud()` - that takes in a **Ministry Name** as an input and produces a wordcloud of the subject of questions asked to that ministry. For example, for **Health and Family Welfare**, we run
```{r eval=FALSE}
ministry_wordcloud("Health and Family Welfare")
```
which gives
![](graphs/wordcloud/Health And Family Welfare.jpg)
#### Understanding the function
So, what does the function do? Let's have a look.
*Note: I use two packages here - [`tidytext`](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) and [`ggwordcloud`](https://lepennec.github.io/ggwordcloud/articles/ggwordcloud.html).*
##### Creating the dataset
```{r eval=FALSE}
# Creating the dataset
ministry_q <- questions %>% filter(Ministry == str_to_upper(ministry)) %>%
mutate(Subject = as.character(Subject)) %>%
select(Subject) %>%
unnest_tokens(word, Subject) %>%
anti_join(get_stopwords(), by = 'word') %>%
count(word, sort = T) %>%a
head(40)
```
In order to create our dataset for the wordcloud, we need to get the **subject** of questions asked to the selected Ministry and get the number of times each word appears. The process for doing this is:
1. Filter the selected ministry - using `filter()`
2. Select the **Subject** column, after converting it into `character` - using `select()`
3. Generate a list of words from the **Subject** column - using `unnest_tokens()`
4. Remove the *[stopwords](https://en.wikipedia.org/wiki/Stop_words)* - using `anti_join()`
5. Count the number of times each word occurs and sort it in descending order - using `count()`
6. Select the top 40 words - using `head()`
This gives us the following dataset (created for **Health and Family Welfare**, trimmed to 10 words):
```{r echo=FALSE}
min_q <- questions %>% filter(Ministry == str_to_upper('Health and Family Welfare')) %>%
mutate(Subject = as.character(Subject)) %>%
select(Subject) %>%
unnest_tokens(word, Subject) %>%
anti_join(get_stopwords(), by = 'word') %>%
count(word, sort = T) %>%
head(40)
kable(head(min_q, 10)) %>% kable_styling(bootstrap_options = c("striped"))
```
##### Creating the wordcloud
```{r eval = FALSE}
ggplot(ministry_q, aes(label = word, color = word, size = n)) +
geom_text_wordcloud_area(rm_outside = T, family = 'Helvetica') +
scale_size_area(max_size = 40) +
bbc_style() +
labs(title = paste0("Ministry of ", ministry, " - Subject of questions asked"),
subtitle = "Winter Session of 17th Lok Sabha")
```
With the dataset created, the wordcloud can be easily generated using `geom_text_wordcloud_area()`. Post that, I add the usual `bbc_style()` and title and export it through `finalise_plot()`.
### Understading the wordcloud
Now that we have understood how the wordcloud was generated, we can turn to interpreting them and gleaning information.
#### Ministry of Human Resource Development
![](graphs/wordcloud/Human Resource Development.jpg)
A majority of the questions to the [MHRD](https://en.wikipedia.org/wiki/Ministry_of_Human_Resource_Development) were focused on topics such as *education, schools, institutes, and "kendriya vidyalayas"*. This is in line with the objective of the Ministry to ensure good, affordable, quality education to the citizens of the country.
#### Ministry of Environment, Forest and Climate Change
![](graphs/wordcloud/Environment, Forests And Climate Change.jpg)
Questions to the [MoEFCC](https://en.wikipedia.org/wiki/Ministry_of_Environment,_Forest_and_Climate_Change) were mainly targeted towards topics such as *pollution, waste, forests, air and plastic*. This is in line with the extremely poor air quality during the Session and growing concerns over deforestation and plastic waste management.
### Shiny app
I have also created an application for the above visualisation that can be accessed [here](https://lakshyaag.shinyapps.io/LokSabhaQuestions).
# Conclusion
I hope this was an informative article. I had a lot of fun working with this dataset and creating visualisations to test out my understanding.\
I will upload the source code to my [GitHub](https://github.com/lakshyaag/). If you have any queries, feel free to ask me at <lakshyagrwal12@gmail.com> or send me a message on [Twitter](https://twitter.com/lakshyaag).