1 Introduction

This document serves as an example of analysing the questions that were asked in the Winter Session of the 17th Lok Sabha. The Lok Sabha or House of the People is the lower house of India’s bicameral Parliament.

This Winter Session ran from 18th November, 2019 to 13th December, 2019.

1.1 Importing libraries

First, we import the necessary libraries.

library(tidyverse)
library(lubridate)
library(dplyr)
library(bbplot)
library(ggthemes)
library(RColorBrewer)
library(ggwordcloud)
library(tidytext)
library(knitr)
library(kableExtra)

tidyverse and its associated libraries are used to leverage the power of tidy data. bbplot by BBC is a package that will be used to create ggplot2 charts.

2 Working with the data

2.1 Reading the dataset

questions <- read.csv('Winter_LokSabha17Questions.csv')

kable(sample_n(questions, 10)) %>% kable_styling(bootstrap_options = c('striped')) # 10 random entries from the dataset

Q.NO.	Q.Type	Date	Ministry	Member	Subject
3737	UNSTARRED PDF/WORD PDF/WORD(Hindi)	11.12.2019	RAILWAYS	Jai Prakash,Shri	Incidents of Animals Hit by Trains
760	UNSTARRED PDF/WORD PDF/WORD(Hindi)	21.11.2019	HOUSING AND URBAN AFFAIRS	Bhoumik,Ms. Pratima,Patel,Shri Devji Mansingram,Shrangre,Shri Sudhakar Tukaram	Status of NURHP
2401	UNSTARRED PDF/WORD PDF/WORD(Hindi)	03.12.2019	HOME AFFAIRS	Adhikari, Shri Deepak (Dev)	National Fingerprint Database
1324	UNSTARRED PDF/WORD PDF/WORD(Hindi)	25.11.2019	HUMAN RESOURCE DEVELOPMENT	Rajoria,Dr. Manoj	Higher Education
97	STARRED PDF/WORD PDF/WORD(Hindi)	22.11.2019	ENVIRONMENT, FORESTS AND CLIMATE CHANGE	Gogoi,Shri Gaurav	Temporary use of Forest Land
1747	UNSTARRED PDF/WORD PDF/WORD(Hindi)	28.11.2019	JAL SHAKTI	Singh,Shri Shyam Yadav	Abatement of Pollution in Ganga River
950	UNSTARRED PDF/WORD PDF/WORD(Hindi)	22.11.2019	HEALTH AND FAMILY WELFARE	Senthilkumar. S.,Shri DNV,Raut,Shri Vinayak Bhaurao,Barne,Shri Shrirang Appa,Patil,Shri Hemant	Assault on Doctor
1542	UNSTARRED PDF/WORD PDF/WORD(Hindi)	27.11.2019	PLANNING	Singh,Shri Shyam Yadav	Nomenclature of Planning Commission as NITI Aayog
3726	UNSTARRED PDF/WORD PDF/WORD(Hindi)	11.12.2019	RAILWAYS	Kaswan,Shri Rahul	Priority in Reservation under Special Quota
2212	UNSTARRED PDF/WORD PDF/WORD(Hindi)	02.12.2019	HUMAN RESOURCE DEVELOPMENT	Misra,Shri Ajay (Teni)	IIIT

str(questions)

## 'data.frame':    4740 obs. of  6 variables:
##  $ Q.NO.   : int  380 379 378 377 376 375 374 373 372 371 ...
##  $ Q.Type  : Factor w/ 7 levels "STARRED PDF/WORD",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Date    : Factor w/ 20 levels "02.12.2019","03.12.2019",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Ministry: Factor w/ 54 levels "AGRICULTURE AND FARMERS WELFARE",..: 21 21 16 50 16 3 16 21 53 21 ...
##  $ Member  : Factor w/ 1116 levels "","Abbaiah,Shri Narayana Swamy",..: 835 569 794 826 1018 873 548 774 734 925 ...
##  $ Subject : Factor w/ 4569 levels "12th Five Year Plan for TSP",..: 2576 1788 4069 3864 1209 495 675 85 468 433 ...

As can be seen from the above output, the dataset is messy. More specifically,

The column names need to be renamed to facilitate repeated usage.
The Date column is not in the correct data type. Moreover, we need to filter only the questions for the Winter Session.
The Q. Type column contains a lot of unnecessary text.

2.2 Cleaning the dataset

We now turn to cleaning the dataset.

questions <- questions %>% rename('Question Number' = "Q.NO.", "Type" = "Q.Type") %>%
    mutate(Type = (str_replace(Type, pattern = "(PDF).*", replacement = "")) %>% str_trim(), 
           Date = as.Date(Date, format = "%d.%m.%Y"),
           Ministry = str_trim(Ministry)) %>%
  
  # Filtering the winter session questions only
    filter(Date >= as.Date("2019-11-18")) %>% 
  
  # Adding actual link
    mutate(Link = ifelse(Type == "STARRED", 
                         paste0("http://164.100.24.220/loksabhaquestions/annex/172/AS", 
                                  `Question Number`, 
                                  '.pdf'), 
                         paste0("http://164.100.24.220/loksabhaquestions/annex/172/AU", 
                                  `Question Number`, 
                                  '.pdf'))) 
    

kable(sample_n(questions, 10)) %>% kable_styling(bootstrap_options = c('striped')) # 10 random entries from the dataset

Question Number	Type	Date	Ministry	Member	Subject	Link
3541	UNSTARRED	2019-12-10	AGRICULTURE AND FARMERS WELFARE	Kuriakose,Adv. Dean	Remunerative Support Price	http://164.100.24.220/loksabhaquestions/annex/172/AU3541.pdf
3274	UNSTARRED	2019-12-09	PETROLEUM AND NATURAL GAS	Bista,Shri Raju	Home Delivery Charges	http://164.100.24.220/loksabhaquestions/annex/172/AU3274.pdf
4033	UNSTARRED	2019-12-12	HOUSING AND URBAN AFFAIRS	Bista,Shri Raju	PMAY Houses in West Bengal	http://164.100.24.220/loksabhaquestions/annex/172/AU4033.pdf
245	STARRED	2019-12-05	CIVIL AVIATION	Chavda,Shri Vinod	Expansion of Airports	http://164.100.24.220/loksabhaquestions/annex/172/AS245.pdf
4247	UNSTARRED	2019-12-13	HEALTH AND FAMILY WELFARE	Hegde,Shri Anantkumar ,Reddy,Shri Komati Reddy Venkat,Kanumuru,Shri Raghu Ramakrishna Raju	Cancer Deaths	http://164.100.24.220/loksabhaquestions/annex/172/AU4247.pdf
3813	UNSTARRED	2019-12-11	RAILWAYS	Kodikunnil,Shri Suresh	Facilities for Chengannur Station	http://164.100.24.220/loksabhaquestions/annex/172/AU3813.pdf
1005	UNSTARRED	2019-11-22	AYURVEDA,YOGA & NATUROPATHY,UNANI,SIDDHA AND HOMEOPATHY (AYUSH)	Kesineni,Shri Srinivas,Sreekandan,Shri Vellalath Kochukrishnan Nair.	Medicinal Plants	http://164.100.24.220/loksabhaquestions/annex/172/AU1005.pdf
90	UNSTARRED	2019-11-18	PETROLEUM AND NATURAL GAS	Ariff,Adv. Abdul Majeed	Profits of Public Sector Oil Companies	http://164.100.24.220/loksabhaquestions/annex/172/AU90.pdf
2074	UNSTARRED	2019-12-02	HUMAN RESOURCE DEVELOPMENT	Kumar,Shri Kaushalendra	Primary Education	http://164.100.24.220/loksabhaquestions/annex/172/AU2074.pdf
283	UNSTARRED	2019-11-19	CHEMICALS AND FERTILIZERS	Mohan,Shri P. C.	BIS Standards for Chemicals	http://164.100.24.220/loksabhaquestions/annex/172/AU283.pdf

Note that I have also added a link to the actual question and the subsequent answer to that question. The files on the Lok Sabha server follow a pattern, which makes the task very simple.

3 Data analysis and visualization

3.1 Number of questions to Ministries

Let us see which ministry was asked the most number of questions in this session.

# Creating the dataset
ministry <- questions %>% group_by(Ministry) %>% 
    summarise(Count = n()) %>% 
    arrange(desc(Count)) %>%
    mutate(Ministry = factor(Ministry, levels = rev(Ministry)))

# Creating the plot
(
    ggplot(ministry, aes(x = Ministry, y = Count, fill = Ministry)) + geom_bar(stat = 'identity') + 
        geom_hline(yintercept = 0, size = 1, colour="#333333") +
        scale_fill_manual(values = rev(colorRampPalette(brewer.pal(8, "Dark2"))(length(ministry$Ministry)))) +
        bbc_style() +
        coord_flip() +
    
        theme(legend.position = "none", 
                axis.title = element_text(size = 18), 
                panel.grid.major.x = element_line(color="#cbcbcb"), 
                panel.grid.major.y=element_blank()) +
    
        labs(title = "Questions asked by each ministry", 
              subtitle = "Winter Session of 17th Lok Sabha", 
              y = "Number of questions") +
    
        geom_label(aes(x = Ministry, y = Count, label = Count),
             hjust = 1, 
             vjust = 0.5, 
             colour = "white", 
             fill = NA, 
             label.size = NA, 
             family="Helvetica", 
             size = 6)


) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal", 
                    save_filepath = 'graphs/QuestionsByMinistry.jpg', 
                    width = 1920, height = 1080)

A lot of code above for the graph below. We shall go over it block by block.

3.1.1 Understanding the code

# Creating the dataset
ministry <- questions %>% group_by(Ministry) %>% 
    summarise(Count = n()) %>% 
    arrange(desc(Count)) %>%
    mutate(Ministry = factor(Ministry, levels = rev(Ministry)))

First, I create a new dataframe that consists of the number of questions asked to each ministry, and transform the Ministry column into a factor.

The process is:

Group the dataset by Ministry - using group_by()
summarise to get the number of occurences - using n()
Sort the dataframe in descending order of Count - using arrange()
Convert Ministry into a factor - using mutate()

This gives us the following output:

Ministry	Count
HEALTH AND FAMILY WELFARE	332
RAILWAYS	288
HUMAN RESOURCE DEVELOPMENT	266
ENVIRONMENT, FORESTS AND CLIMATE CHANGE	219
AGRICULTURE AND FARMERS WELFARE	200

Next, we look at the code for creating the plot.

# Creating the plot
ggplot(ministry, aes(x = Ministry, y = Count, fill = Ministry)) + geom_bar(stat = 'identity') + 
    geom_hline(yintercept = 0, size = 1, colour="#333333") +
    scale_fill_manual(values = rev(colorRampPalette(brewer.pal(8, "Dark2"))(length(ministry$Ministry)))) +
    bbc_style() +
    coord_flip() +

# Removing legend, showing axis labels and adding the title
    theme(legend.position = "none", 
            axis.title = element_text(size = 18), 
            panel.grid.major.x = element_line(color="#cbcbcb"), 
            panel.grid.major.y=element_blank()) +

    labs(title = "Questions asked by each ministry", 
          subtitle = "Winter Session of 17th Lok Sabha", 
          y = "Number of questions") +

# Showing the number of questions in the graph
    geom_label(aes(x = Ministry, y = Count, label = Count),
         hjust = 1, 
         vjust = 0.5, 
         colour = "white", 
         fill = NA, 
         label.size = NA, 
         family="Helvetica", 
         size = 6)

The first section is the basic ggplot2 code for creating a horizontal bar chart, with manual colors. Of particular importance here is: bbc_style()

bbc_style() (and subsequently, finalise_plot()) is a function from bbplot that makes the chart components follow BBC style, while allowing room for further manual customization. For more information, visit the bbplot GitHub repo.

Then, I remove the legend, show the axis labels, add a title and subtitle to the plot in the next two sections. Post that, I add the number of questions as a label in the chart to facilitate easy comprehension.

finalise_plot() simply packages the graphic, adds a footnote and resizes it - producing an image - QuestionsByMinistry.jpg

3.1.2 Understanding the graphic

Now that we have understood how to create the above graphic, let us take a moment to interpret it.

Health and Family Welfare leads the race with 332 questions asked to it, with Railways trailing at 288.

It seems that most of the core ministries, such as, Human Resource Development, Environment, Finance, Road Transport and Highways, were asked questions in this session, which does that indicate that the House debated on some pertitnent topics. However, a closer analysis is required on the subjects of such questions.

3.2 Weekly ranking of Ministries

Another interesting way to look at the performance of the Session would be to see whether there was a shift in focus of Lok Sabha Members from asking questions to one Ministry to another during the Session.

We can visualise this through a bump chart, that plots ranking of entites over time. The focus here is usually on comparing the position or performance of multiple observations with respect to each other rather than the actual values itself.(From R-bloggers)

# Has there been a shift in focus from one ministry to another (by week)?

# Creating the dataset
shift_data <- questions %>% group_by(Week = floor_date(Date, 'week'), Ministry) %>% 
    summarise(count = n()) %>% 
    top_n(5, wt = count) %>% 
    group_by(Week) %>%
    arrange(Week, desc(count)) %>%
    mutate(rank = row_number()) %>%
    filter(rank <= 5) %>%
    ungroup()

# Creating the plot
(
    ggplot(shift_data, aes(x=Week, y=rank, group = Ministry)) +
        geom_line(aes(color = Ministry), size = 2) +
        geom_point(aes(color = Ministry), size = 5) +
        scale_y_reverse(breaks = 1:5) +
        bbc_style() +
        
        theme(legend.position = 'right', 
                axis.title = element_text(size = 18), 
                panel.grid.major.y = element_line(color="#cbcbcb"), 
                panel.grid.major.x=element_blank()) +
        
        labs(title = "Top 5 ministries by number of questions", 
                subtitle = "Winter Session of 17th Lok Sabha",
                y = "Rank", 
                x = "Week")
        
) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal", 
                    save_filepath = 'graphs/RankingMinistry.jpg', 
                    width = 1600, height = 900)

Let’s go over the code block by block again.

3.2.1 Understanding the code

# Creating the dataset

shift_data <- questions %>% group_by(Week = floor_date(Date, 'week'), Ministry) %>% 
    summarise(count = n()) %>% 
    top_n(5, wt = count) %>% 
    group_by(Week) %>%
    arrange(Week, desc(count)) %>%
    mutate(rank = row_number()) %>%
    filter(rank <= 5) %>%
    ungroup()

Here, we want to create rankings for each ministry based on the number of questions asked to them in each week. So, our workflow would be to convert the data from daily to weekly, count the number of questions, assign ranks and plot it. To create the dataset, we follow this process:

Group the dataset by Week and Ministry - using group_by() and floor_date()
summarise to get the count of number of questions - using n()
Keep only the top 5 ministries in each week - using top_n()
Sort by descending order of number of questions asked to each week - using arrange()
Add a rank column for each week and ministry - using mutate() and row_number()
Remove any ministry with overlapping ranks - using filter()

This gives us the following output:

Week	Ministry	count	rank
2019-11-17	AGRICULTURE AND FARMERS WELFARE	85	1
2019-11-17	RAILWAYS	85	2
2019-11-17	HEALTH AND FAMILY WELFARE	81	3
2019-11-17	HUMAN RESOURCE DEVELOPMENT	62	4
2019-11-17	ENVIRONMENT, FORESTS AND CLIMATE CHANGE	59	5
2019-11-24	HEALTH AND FAMILY WELFARE	92	1
2019-11-24	HUMAN RESOURCE DEVELOPMENT	79	2
2019-11-24	RAILWAYS	79	3
2019-11-24	ENVIRONMENT, FORESTS AND CLIMATE CHANGE	55	4
2019-11-24	ROAD TRANSPORT AND HIGHWAYS	45	5

Next, we look at the code for creating the plot.

# Creating the plot
ggplot(shift_data, aes(x = Week, y = rank, group = Ministry)) +
    geom_line(aes(color = Ministry), size = 2) +
    geom_point(aes(color = Ministry), size = 5) +
    scale_y_reverse(breaks = 1:5) +
    bbc_style() +
    scale_color_tableau() +
    
# Adding legend, showing axis labels and adding the title
    theme(legend.position = 'right', 
            axis.title = element_text(size = 18), 
            panel.grid.major.y = element_line(color="#cbcbcb"), 
            panel.grid.major.x=element_blank()) +
    
    labs(title = "Top 5 ministries by number of questions", 
            subtitle = "Winter Session of 17th Lok Sabha",
            y = "Rank", 
            x = "Week")

Similar to the previous plot, this one also creates a basic ggplot2 chart, adds bbc_style() and scale_color_tableau() and the labels and titles. Pretty standard stuff!

3.2.2 Understanding the graphic

As before, let us interpret this visualisation as well.

Agriculture and Farmers’ Welfare dominated the questions in the first week of the Session, but disappeared in the next two weeks, only to come back at second place in the last week.

Health and Family Welfare jumped to the top spot after first week and remained there, while Human Resource Development created a plateau at the 4th and 2nd place. Road Transport and Highways, Home Affairs and Jal Shakti made the top charts at least once.

All in all, this visualisation seems to provide a bit more insight into how the focus shifted from one Ministry to another during the session.

3.3 Subject of questions to Ministries

I would now like to move from analysing the Lok Sabha as a whole to looking into each ministry in depth. To this end, we can look at the subject of questions that were asked to a Ministry and gain some insights from that.

3.3.1 Creating a wordcloud

To visualise textual data, let’s create a wordcloud from the subject of questions asked.

# Ministry-wise - Questions wordlcloud

ministry_wordcloud <- function(ministry){
    
    # Filtering selected ministry and creating the dataset
    ministry_q <- questions %>% filter(Ministry == str_to_upper(ministry)) %>%
        mutate(Subject = as.character(Subject)) %>%
        select(Subject) %>%
        unnest_tokens(word, Subject) %>%
        anti_join(get_stopwords(), by = 'word') %>%
        count(word, sort = T) %>%a
        head(40)
        
    # Plotting the dataset as a wordcloud
    (
        ggplot(ministry_q, aes(label = word, color = word, size = n)) + 
            geom_text_wordcloud_area(rm_outside = T, family = 'Helvetica') +
            scale_size_area(max_size = 40) +
            bbc_style() +
            labs(title = paste0("Ministry of ", ministry, " - Subject of questions asked"), 
                 subtitle = "Winter Session of 17th Lok Sabha")

    ) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal", 
                        save_filepath = paste0('graphs/wordcloud/', ministry, '.jpg'), 
                        width = 960, height = 540)
}

The above code creates a function - ministry_wordcloud() - that takes in a Ministry Name as an input and produces a wordcloud of the subject of questions asked to that ministry. For example, for Health and Family Welfare, we run

ministry_wordcloud("Health and Family Welfare")

which gives

3.3.1.1 Understanding the function

So, what does the function do? Let’s have a look.

Note: I use two packages here - tidytext and ggwordcloud.

3.3.1.1.1 Creating the dataset

# Creating the dataset
ministry_q <- questions %>% filter(Ministry == str_to_upper(ministry)) %>%
        mutate(Subject = as.character(Subject)) %>%
        select(Subject) %>%
        unnest_tokens(word, Subject) %>%
        anti_join(get_stopwords(), by = 'word') %>%
        count(word, sort = T) %>%a
        head(40)

In order to create our dataset for the wordcloud, we need to get the subject of questions asked to the selected Ministry and get the number of times each word appears. The process for doing this is:

Filter the selected ministry - using filter()
Select the Subject column, after converting it into character - using select()
Generate a list of words from the Subject column - using unnest_tokens()
Remove the stopwords - using anti_join()
Count the number of times each word occurs and sort it in descending order - using count()
Select the top 40 words - using head()

This gives us the following dataset (created for Health and Family Welfare, trimmed to 10 words):

word	n
medical	32
health	30
hospitals	20
ayushman	18
cghs	18
bharat	17
centres	13
food	13
cancer	12
facilities	12

3.3.1.1.2 Creating the wordcloud

ggplot(ministry_q, aes(label = word, color = word, size = n)) + 
            geom_text_wordcloud_area(rm_outside = T, family = 'Helvetica') +
            scale_size_area(max_size = 40) +
            bbc_style() +
            labs(title = paste0("Ministry of ", ministry, " - Subject of questions asked"), 
                 subtitle = "Winter Session of 17th Lok Sabha")

With the dataset created, the wordcloud can be easily generated using geom_text_wordcloud_area(). Post that, I add the usual bbc_style() and title and export it through finalise_plot().

3.3.2 Understading the wordcloud

Now that we have understood how the wordcloud was generated, we can turn to interpreting them and gleaning information.

3.3.2.1 Ministry of Human Resource Development

A majority of the questions to the MHRD were focused on topics such as education, schools, institutes, and “kendriya vidyalayas”. This is in line with the objective of the Ministry to ensure good, affordable, quality education to the citizens of the country.

3.3.2.2 Ministry of Environment, Forest and Climate Change

Questions to the MoEFCC were mainly targeted towards topics such as pollution, waste, forests, air and plastic. This is in line with the extremely poor air quality during the Session and growing concerns over deforestation and plastic waste management.

3.3.3 Shiny app

I have also created an application for the above visualisation that can be accessed here.

4 Conclusion

I hope this was an informative article. I had a lot of fun working with this dataset and creating visualisations to test out my understanding.
I will upload the source code to my GitHub. If you have any queries, feel free to ask me at lakshyagrwal12@gmail.com or send me a message on Twitter.

17th Lok Sabha - Winter Session Question Analysis

Lakshya Agarwal

23/12/2019