index.Rmd

---
title: "17th Lok Sabha - Winter Session Question Analysis"
author: "Lakshya Agarwal"
date: "23/12/2019"
output: 
    html_document:
        highlight: zenburn
        theme: paper
        toc: true
        toc_float: true
        toc_depth: 4
        number_sections: true
        
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

This document serves as an example of analysing the questions that were asked in the Winter Session of the *17th Lok Sabha.* 
The Lok Sabha or House of the People is the lower house of India's bicameral Parliament. 

*This Winter Session ran from 18th November, 2019 to 13th December, 2019.*

## Importing libraries
First, we import the necessary libraries.
```{r library, warning=FALSE, message=FALSE}
library(tidyverse)
library(lubridate)
library(dplyr)
library(bbplot)
library(ggthemes)
library(RColorBrewer)
library(ggwordcloud)
library(tidytext)
library(knitr)
library(kableExtra)
```

```{r, include=FALSE}
windowsFonts(Helvetica = "Product Sans")
```

`tidyverse` and its associated libraries are used to leverage the power of tidy data.
`bbplot` by BBC is a package that will be used to create `ggplot2` charts.


# Working with the data

## Reading the dataset
```{r}
questions <- read.csv('Winter_LokSabha17Questions.csv')

kable(sample_n(questions, 10)) %>% kable_styling(bootstrap_options = c('striped')) # 10 random entries from the dataset

str(questions)
```

As can be seen from the above output, the dataset is messy. More specifically,

* The column names need to be renamed to facilitate repeated usage.
* The `Date` column is not in the correct data type. Moreover, we need to filter only the questions for the Winter Session.
* The `Q. Type` column contains a lot of unnecessary text.

## Cleaning the dataset

We now turn to cleaning the dataset. 

```{r}
questions <- questions %>% rename('Question Number' = "Q.NO.", "Type" = "Q.Type") %>%
    mutate(Type = (str_replace(Type, pattern = "(PDF).*", replacement = "")) %>% str_trim(), 
           Date = as.Date(Date, format = "%d.%m.%Y"),
           Ministry = str_trim(Ministry)) %>%
  
  # Filtering the winter session questions only
    filter(Date >= as.Date("2019-11-18")) %>% 
  
  # Adding actual link
    mutate(Link = ifelse(Type == "STARRED", 
                         paste0("http://164.100.24.220/loksabhaquestions/annex/172/AS", 
                                  `Question Number`, 
                                  '.pdf'), 
                         paste0("http://164.100.24.220/loksabhaquestions/annex/172/AU", 
                                  `Question Number`, 
                                  '.pdf'))) 
    

kable(sample_n(questions, 10)) %>% kable_styling(bootstrap_options = c('striped')) # 10 random entries from the dataset
```

Note that I have also added a link to the actual question and the subsequent answer to that question. The files on the Lok Sabha server follow a pattern, which makes the task very simple.

# Data analysis and visualization

## Number of questions to Ministries

Let us see which ministry was asked the most number of questions in this session. 

```{r eval=FALSE}
# Creating the dataset
ministry <- questions %>% group_by(Ministry) %>% 
    summarise(Count = n()) %>% 
    arrange(desc(Count)) %>%
    mutate(Ministry = factor(Ministry, levels = rev(Ministry)))

# Creating the plot
(
    ggplot(ministry, aes(x = Ministry, y = Count, fill = Ministry)) + geom_bar(stat = 'identity') + 
        geom_hline(yintercept = 0, size = 1, colour="#333333") +
        scale_fill_manual(values = rev(colorRampPalette(brewer.pal(8, "Dark2"))(length(ministry$Ministry)))) +
        bbc_style() +
        coord_flip() +
    
        theme(legend.position = "none", 
                axis.title = element_text(size = 18), 
                panel.grid.major.x = element_line(color="#cbcbcb"), 
                panel.grid.major.y=element_blank()) +
    
        labs(title = "Questions asked by each ministry", 
              subtitle = "Winter Session of 17th Lok Sabha", 
              y = "Number of questions") +
    
        geom_label(aes(x = Ministry, y = Count, label = Count),
             hjust = 1, 
             vjust = 0.5, 
             colour = "white", 
             fill = NA, 
             label.size = NA, 
             family="Helvetica", 
             size = 6)


) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal", 
                    save_filepath = 'graphs/QuestionsByMinistry.jpg', 
                    width = 1920, height = 1080)
```

![](graphs/ministry.jpg)

A lot of code above for the graph below. We shall go over it block by block. 

### Understanding the code

```{r}
# Creating the dataset
ministry <- questions %>% group_by(Ministry) %>% 
    summarise(Count = n()) %>% 
    arrange(desc(Count)) %>%
    mutate(Ministry = factor(Ministry, levels = rev(Ministry)))
```
First, I create a new dataframe that consists of the number of questions asked to each ministry, and transform the **Ministry** column into a factor.

The process is:

1. Group the dataset by **Ministry** - using `group_by()` 
2. `summarise` to get the number of occurences - using `n()`
3. Sort the dataframe in descending order of **Count** - using `arrange()`
4. Convert **Ministry** into a factor - using `mutate()`

This gives us the following output:
```{r echo=FALSE}
kable(head(ministry, 5)) %>% kable_styling(bootstrap_options = c('striped'))
```

Next, we look at the code for creating the plot.

```{r eval=FALSE}

# Creating the plot
ggplot(ministry, aes(x = Ministry, y = Count, fill = Ministry)) + geom_bar(stat = 'identity') + 
    geom_hline(yintercept = 0, size = 1, colour="#333333") +
    scale_fill_manual(values = rev(colorRampPalette(brewer.pal(8, "Dark2"))(length(ministry$Ministry)))) +
    bbc_style() +
    coord_flip() +

# Removing legend, showing axis labels and adding the title
    theme(legend.position = "none", 
            axis.title = element_text(size = 18), 
            panel.grid.major.x = element_line(color="#cbcbcb"), 
            panel.grid.major.y=element_blank()) +

    labs(title = "Questions asked by each ministry", 
          subtitle = "Winter Session of 17th Lok Sabha", 
          y = "Number of questions") +

# Showing the number of questions in the graph
    geom_label(aes(x = Ministry, y = Count, label = Count),
         hjust = 1, 
         vjust = 0.5, 
         colour = "white", 
         fill = NA, 
         label.size = NA, 
         family="Helvetica", 
         size = 6)
```

The first section is the basic `ggplot2` code for creating a horizontal bar chart, with manual colors. Of particular importance here is: **`bbc_style()`**

`bbc_style()` (and subsequently, `finalise_plot()`) is a function from `bbplot` that makes the chart components follow BBC style, while allowing room for further manual customization. For more information, visit the **[bbplot GitHub repo](https://github.com/bbc/bbplot)**.

Then, I remove the legend, show the axis labels, add a title and subtitle to the plot in the next two sections. Post that, I add the number of questions as a label in the chart to facilitate easy comprehension.

`finalise_plot()` simply packages the graphic, adds a footnote and resizes it - producing an image - `QuestionsByMinistry.jpg`

### Understanding the graphic
![](graphs/ministry.jpg)

Now that we have understood how to create the above graphic, let us take a moment to interpret it.

**Health and Family Welfare** leads the race with 332 questions asked to it, with **Railways** trailing at 288. <br><br>It seems that most of the core ministries, such as, **Human Resource Development, Environment, Finance, Road Transport and Highways**, were asked questions in this session, which does that indicate that the House debated on some pertitnent topics. However, a closer analysis is required on the subjects of such questions.


## Weekly ranking of Ministries

Another interesting way to look at the performance of the Session would be to see whether there was a *shift in focus* of Lok Sabha Members from asking questions to one Ministry to another during the Session. 

We can visualise this through a **bump chart**, that plots ranking of entites over time. The focus here is usually on comparing the position or performance of multiple observations with respect to each other rather than the actual values itself.([From R-bloggers](https://www.r-bloggers.com/bump-chart/))

```{r eval = FALSE}
# Has there been a shift in focus from one ministry to another (by week)?

# Creating the dataset
shift_data <- questions %>% group_by(Week = floor_date(Date, 'week'), Ministry) %>% 
    summarise(count = n()) %>% 
    top_n(5, wt = count) %>% 
    group_by(Week) %>%
    arrange(Week, desc(count)) %>%
    mutate(rank = row_number()) %>%
    filter(rank <= 5) %>%
    ungroup()

# Creating the plot
(
    ggplot(shift_data, aes(x=Week, y=rank, group = Ministry)) +
        geom_line(aes(color = Ministry), size = 2) +
        geom_point(aes(color = Ministry), size = 5) +
        scale_y_reverse(breaks = 1:5) +
        bbc_style() +
        
        theme(legend.position = 'right', 
                axis.title = element_text(size = 18), 
                panel.grid.major.y = element_line(color="#cbcbcb"), 
                panel.grid.major.x=element_blank()) +
        
        labs(title = "Top 5 ministries by number of questions", 
                subtitle = "Winter Session of 17th Lok Sabha",
                y = "Rank", 
                x = "Week")
        
) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal", 
                    save_filepath = 'graphs/RankingMinistry.jpg', 
                    width = 1600, height = 900)
```

![](graphs/rank.jpg)

Let's go over the code block by block again.

### Understanding the code
```{r}
# Creating the dataset

shift_data <- questions %>% group_by(Week = floor_date(Date, 'week'), Ministry) %>% 
    summarise(count = n()) %>% 
    top_n(5, wt = count) %>% 
    group_by(Week) %>%
    arrange(Week, desc(count)) %>%
    mutate(rank = row_number()) %>%
    filter(rank <= 5) %>%
    ungroup()
```

Here, we want to create rankings for each ministry based on the number of questions asked to them in each week. So, our workflow would be to convert the data from daily to weekly, count the number of questions, assign ranks and plot it. To create the dataset, we follow this process:

1. Group the dataset by **Week** and **Ministry** - using `group_by()` and [`floor_date()`](https://lubridate.tidyverse.org/reference/round_date.html)
2. `summarise` to get the count of number of questions - using `n()`
3. Keep only the **top 5** ministries in each week - using `top_n()`
4. Sort by descending order of number of questions asked to each week - using `arrange()`
5. Add a **rank** column for each week and ministry - using `mutate()` and `row_number()`
6. Remove any ministry with overlapping ranks - using `filter()`

This gives us the following output:
```{r echo=FALSE}
kable(head(shift_data, 10)) %>% kable_styling(bootstrap_options = c('striped'))
```

Next, we look at the code for creating the plot.
```{r eval=FALSE}

# Creating the plot
ggplot(shift_data, aes(x = Week, y = rank, group = Ministry)) +
    geom_line(aes(color = Ministry), size = 2) +
    geom_point(aes(color = Ministry), size = 5) +
    scale_y_reverse(breaks = 1:5) +
    bbc_style() +
    scale_color_tableau() +
    
# Adding legend, showing axis labels and adding the title
    theme(legend.position = 'right', 
            axis.title = element_text(size = 18), 
            panel.grid.major.y = element_line(color="#cbcbcb"), 
            panel.grid.major.x=element_blank()) +
    
    labs(title = "Top 5 ministries by number of questions", 
            subtitle = "Winter Session of 17th Lok Sabha",
            y = "Rank", 
            x = "Week")
```

Similar to the previous plot, this one also creates a basic `ggplot2` chart, adds `bbc_style()` and [`scale_color_tableau()`](https://rdrr.io/cran/ggthemes/man/scale_color_tableau.html) and the labels and titles. *Pretty standard stuff!*

### Understanding the graphic

![](graphs/rank.jpg)

As before, let us interpret this visualisation as well.

**Agriculture and Farmers' Welfare** dominated the questions in the first week of the Session, but disappeared in the next two weeks, only to come back at *second place* in the last week. <br><br>
**Health and Family Welfare** jumped to the top spot after first week and remained there, while **Human Resource Development** created a plateau at the 4th and 2nd place. **Road Transport and Highways, Home Affairs and Jal Shakti** made the top charts at least once. <br><br>
All in all, this visualisation seems to provide a bit more insight into how the focus shifted from one Ministry to another during the session. 

## Subject of questions to Ministries

I would now like to move from analysing the Lok Sabha as a whole to looking into each ministry in depth. To this end, we can look at the ***subject of questions*** that were asked to a Ministry and gain some insights from that. 

### Creating a wordcloud
To visualise textual data, let's create a wordcloud from the subject of questions asked. 

```{r eval=FALSE}
# Ministry-wise - Questions wordlcloud

ministry_wordcloud <- function(ministry){
    
    # Filtering selected ministry and creating the dataset
    ministry_q <- questions %>% filter(Ministry == str_to_upper(ministry)) %>%
        mutate(Subject = as.character(Subject)) %>%
        select(Subject) %>%
        unnest_tokens(word, Subject) %>%
        anti_join(get_stopwords(), by = 'word') %>%
        count(word, sort = T) %>%a
        head(40)
        
    # Plotting the dataset as a wordcloud
    (
        ggplot(ministry_q, aes(label = word, color = word, size = n)) + 
            geom_text_wordcloud_area(rm_outside = T, family = 'Helvetica') +
            scale_size_area(max_size = 40) +
            bbc_style() +
            labs(title = paste0("Ministry of ", ministry, " - Subject of questions asked"), 
                 subtitle = "Winter Session of 17th Lok Sabha")

    ) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal", 
                        save_filepath = paste0('graphs/wordcloud/', ministry, '.jpg'), 
                        width = 960, height = 540)
}
```

The above code creates a function - `ministry_wordcloud()` - that takes in a **Ministry Name** as an input and produces a wordcloud of the subject of questions asked to that ministry. For example, for **Health and Family Welfare**, we run

```{r eval=FALSE}
ministry_wordcloud("Health and Family Welfare")
```

which gives

![](graphs/wordcloud/Health And Family Welfare.jpg)


#### Understanding the function

So, what does the function do? Let's have a look.

*Note: I use two packages here - [`tidytext`](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) and [`ggwordcloud`](https://lepennec.github.io/ggwordcloud/articles/ggwordcloud.html).*

##### Creating the dataset

```{r eval=FALSE}
# Creating the dataset
ministry_q <- questions %>% filter(Ministry == str_to_upper(ministry)) %>%
        mutate(Subject = as.character(Subject)) %>%
        select(Subject) %>%
        unnest_tokens(word, Subject) %>%
        anti_join(get_stopwords(), by = 'word') %>%
        count(word, sort = T) %>%a
        head(40)
```

In order to create our dataset for the wordcloud, we need to get the **subject** of questions asked to the selected Ministry and get the number of times each word appears. The process for doing this is:

1. Filter the selected ministry - using `filter()`
2. Select the **Subject** column, after converting it into `character` - using `select()`
3. Generate a list of words from the **Subject** column - using `unnest_tokens()`
4. Remove the *[stopwords](https://en.wikipedia.org/wiki/Stop_words)* - using `anti_join()`
5. Count the number of times each word occurs and sort it in descending order - using `count()`
6. Select the top 40 words - using `head()`

This gives us the following dataset (created for **Health and Family Welfare**, trimmed to 10 words):
```{r echo=FALSE}
min_q <- questions %>% filter(Ministry == str_to_upper('Health and Family Welfare')) %>%
            mutate(Subject = as.character(Subject)) %>%
            select(Subject) %>%
            unnest_tokens(word, Subject) %>%
            anti_join(get_stopwords(), by = 'word') %>%
            count(word, sort = T) %>%
            head(40)
  
kable(head(min_q, 10)) %>% kable_styling(bootstrap_options = c("striped"))
```

##### Creating the wordcloud
```{r eval = FALSE}
ggplot(ministry_q, aes(label = word, color = word, size = n)) + 
            geom_text_wordcloud_area(rm_outside = T, family = 'Helvetica') +
            scale_size_area(max_size = 40) +
            bbc_style() +
            labs(title = paste0("Ministry of ", ministry, " - Subject of questions asked"), 
                 subtitle = "Winter Session of 17th Lok Sabha")
```

With the dataset created, the wordcloud can be easily generated using `geom_text_wordcloud_area()`. Post that, I add the usual `bbc_style()` and title and export it through `finalise_plot()`.

### Understading the wordcloud

Now that we have understood how the wordcloud was generated, we can turn to interpreting them and gleaning information.

#### Ministry of Human Resource Development
![](graphs/wordcloud/Human Resource Development.jpg)

A majority of the questions to the [MHRD](https://en.wikipedia.org/wiki/Ministry_of_Human_Resource_Development) were focused on topics such as *education, schools, institutes, and "kendriya vidyalayas"*. This is in line with the objective of the Ministry to ensure good, affordable, quality education to the citizens of the country. 

#### Ministry of Environment, Forest and Climate Change
![](graphs/wordcloud/Environment, Forests And Climate Change.jpg)

Questions to the [MoEFCC](https://en.wikipedia.org/wiki/Ministry_of_Environment,_Forest_and_Climate_Change) were mainly targeted towards topics such as *pollution, waste, forests, air and plastic*. This is in line with the extremely poor air quality during the Session and growing concerns over deforestation and plastic waste management.

### Shiny app
I have also created an application for the above visualisation that can be accessed [here](https://lakshyaag.shinyapps.io/LokSabhaQuestions).

# Conclusion
I hope this was an informative article. I had a lot of fun working with this dataset and creating visualisations to test out my understanding.\
I will upload the source code to my [GitHub](https://github.com/lakshyaag/). If you have any queries, feel free to ask me at <lakshyagrwal12@gmail.com> or send me a message on [Twitter](https://twitter.com/lakshyaag).