This document serves as an example of analysing the questions that were asked in the Winter Session of the 17th Lok Sabha. The Lok Sabha or House of the People is the lower house of India’s bicameral Parliament.
This Winter Session ran from 18th November, 2019 to 13th December, 2019.
First, we import the necessary libraries.
library(tidyverse)
library(lubridate)
library(dplyr)
library(bbplot)
library(ggthemes)
library(RColorBrewer)
library(ggwordcloud)
library(tidytext)
library(knitr)
library(kableExtra)
tidyverse
and its associated libraries are used to leverage the power of tidy data. bbplot
by BBC is a package that will be used to create ggplot2
charts.
questions <- read.csv('Winter_LokSabha17Questions.csv')
kable(sample_n(questions, 10)) %>% kable_styling(bootstrap_options = c('striped')) # 10 random entries from the dataset
Q.NO. | Q.Type | Date | Ministry | Member | Subject |
---|---|---|---|---|---|
3737 | UNSTARRED PDF/WORD PDF/WORD(Hindi) | 11.12.2019 | RAILWAYS | Jai Prakash,Shri | Incidents of Animals Hit by Trains |
760 | UNSTARRED PDF/WORD PDF/WORD(Hindi) | 21.11.2019 | HOUSING AND URBAN AFFAIRS | Bhoumik,Ms. Pratima,Patel,Shri Devji Mansingram,Shrangre,Shri Sudhakar Tukaram | Status of NURHP |
2401 | UNSTARRED PDF/WORD PDF/WORD(Hindi) | 03.12.2019 | HOME AFFAIRS | Adhikari, Shri Deepak (Dev) | National Fingerprint Database |
1324 | UNSTARRED PDF/WORD PDF/WORD(Hindi) | 25.11.2019 | HUMAN RESOURCE DEVELOPMENT | Rajoria,Dr. Manoj | Higher Education |
97 | STARRED PDF/WORD PDF/WORD(Hindi) | 22.11.2019 | ENVIRONMENT, FORESTS AND CLIMATE CHANGE | Gogoi,Shri Gaurav | Temporary use of Forest Land |
1747 | UNSTARRED PDF/WORD PDF/WORD(Hindi) | 28.11.2019 | JAL SHAKTI | Singh,Shri Shyam Yadav | Abatement of Pollution in Ganga River |
950 | UNSTARRED PDF/WORD PDF/WORD(Hindi) | 22.11.2019 | HEALTH AND FAMILY WELFARE | Senthilkumar. S.,Shri DNV,Raut,Shri Vinayak Bhaurao,Barne,Shri Shrirang Appa,Patil,Shri Hemant | Assault on Doctor |
1542 | UNSTARRED PDF/WORD PDF/WORD(Hindi) | 27.11.2019 | PLANNING | Singh,Shri Shyam Yadav | Nomenclature of Planning Commission as NITI Aayog |
3726 | UNSTARRED PDF/WORD PDF/WORD(Hindi) | 11.12.2019 | RAILWAYS | Kaswan,Shri Rahul | Priority in Reservation under Special Quota |
2212 | UNSTARRED PDF/WORD PDF/WORD(Hindi) | 02.12.2019 | HUMAN RESOURCE DEVELOPMENT | Misra,Shri Ajay (Teni) | IIIT |
## 'data.frame': 4740 obs. of 6 variables:
## $ Q.NO. : int 380 379 378 377 376 375 374 373 372 371 ...
## $ Q.Type : Factor w/ 7 levels "STARRED PDF/WORD",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Date : Factor w/ 20 levels "02.12.2019","03.12.2019",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ Ministry: Factor w/ 54 levels "AGRICULTURE AND FARMERS WELFARE",..: 21 21 16 50 16 3 16 21 53 21 ...
## $ Member : Factor w/ 1116 levels "","Abbaiah,Shri Narayana Swamy",..: 835 569 794 826 1018 873 548 774 734 925 ...
## $ Subject : Factor w/ 4569 levels "12th Five Year Plan for TSP",..: 2576 1788 4069 3864 1209 495 675 85 468 433 ...
As can be seen from the above output, the dataset is messy. More specifically,
Date
column is not in the correct data type. Moreover, we need to filter only the questions for the Winter Session.Q. Type
column contains a lot of unnecessary text.We now turn to cleaning the dataset.
questions <- questions %>% rename('Question Number' = "Q.NO.", "Type" = "Q.Type") %>%
mutate(Type = (str_replace(Type, pattern = "(PDF).*", replacement = "")) %>% str_trim(),
Date = as.Date(Date, format = "%d.%m.%Y"),
Ministry = str_trim(Ministry)) %>%
# Filtering the winter session questions only
filter(Date >= as.Date("2019-11-18")) %>%
# Adding actual link
mutate(Link = ifelse(Type == "STARRED",
paste0("http://164.100.24.220/loksabhaquestions/annex/172/AS",
`Question Number`,
'.pdf'),
paste0("http://164.100.24.220/loksabhaquestions/annex/172/AU",
`Question Number`,
'.pdf')))
kable(sample_n(questions, 10)) %>% kable_styling(bootstrap_options = c('striped')) # 10 random entries from the dataset
Question Number | Type | Date | Ministry | Member | Subject | Link |
---|---|---|---|---|---|---|
3541 | UNSTARRED | 2019-12-10 | AGRICULTURE AND FARMERS WELFARE | Kuriakose,Adv. Dean | Remunerative Support Price | http://164.100.24.220/loksabhaquestions/annex/172/AU3541.pdf |
3274 | UNSTARRED | 2019-12-09 | PETROLEUM AND NATURAL GAS | Bista,Shri Raju | Home Delivery Charges | http://164.100.24.220/loksabhaquestions/annex/172/AU3274.pdf |
4033 | UNSTARRED | 2019-12-12 | HOUSING AND URBAN AFFAIRS | Bista,Shri Raju | PMAY Houses in West Bengal | http://164.100.24.220/loksabhaquestions/annex/172/AU4033.pdf |
245 | STARRED | 2019-12-05 | CIVIL AVIATION | Chavda,Shri Vinod | Expansion of Airports | http://164.100.24.220/loksabhaquestions/annex/172/AS245.pdf |
4247 | UNSTARRED | 2019-12-13 | HEALTH AND FAMILY WELFARE | Hegde,Shri Anantkumar ,Reddy,Shri Komati Reddy Venkat,Kanumuru,Shri Raghu Ramakrishna Raju | Cancer Deaths | http://164.100.24.220/loksabhaquestions/annex/172/AU4247.pdf |
3813 | UNSTARRED | 2019-12-11 | RAILWAYS | Kodikunnil,Shri Suresh | Facilities for Chengannur Station | http://164.100.24.220/loksabhaquestions/annex/172/AU3813.pdf |
1005 | UNSTARRED | 2019-11-22 | AYURVEDA,YOGA & NATUROPATHY,UNANI,SIDDHA AND HOMEOPATHY (AYUSH) | Kesineni,Shri Srinivas,Sreekandan,Shri Vellalath Kochukrishnan Nair. | Medicinal Plants | http://164.100.24.220/loksabhaquestions/annex/172/AU1005.pdf |
90 | UNSTARRED | 2019-11-18 | PETROLEUM AND NATURAL GAS | Ariff,Adv. Abdul Majeed | Profits of Public Sector Oil Companies | http://164.100.24.220/loksabhaquestions/annex/172/AU90.pdf |
2074 | UNSTARRED | 2019-12-02 | HUMAN RESOURCE DEVELOPMENT | Kumar,Shri Kaushalendra | Primary Education | http://164.100.24.220/loksabhaquestions/annex/172/AU2074.pdf |
283 | UNSTARRED | 2019-11-19 | CHEMICALS AND FERTILIZERS | Mohan,Shri P. C. | BIS Standards for Chemicals | http://164.100.24.220/loksabhaquestions/annex/172/AU283.pdf |
Note that I have also added a link to the actual question and the subsequent answer to that question. The files on the Lok Sabha server follow a pattern, which makes the task very simple.
Let us see which ministry was asked the most number of questions in this session.
# Creating the dataset
ministry <- questions %>% group_by(Ministry) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
mutate(Ministry = factor(Ministry, levels = rev(Ministry)))
# Creating the plot
(
ggplot(ministry, aes(x = Ministry, y = Count, fill = Ministry)) + geom_bar(stat = 'identity') +
geom_hline(yintercept = 0, size = 1, colour="#333333") +
scale_fill_manual(values = rev(colorRampPalette(brewer.pal(8, "Dark2"))(length(ministry$Ministry)))) +
bbc_style() +
coord_flip() +
theme(legend.position = "none",
axis.title = element_text(size = 18),
panel.grid.major.x = element_line(color="#cbcbcb"),
panel.grid.major.y=element_blank()) +
labs(title = "Questions asked by each ministry",
subtitle = "Winter Session of 17th Lok Sabha",
y = "Number of questions") +
geom_label(aes(x = Ministry, y = Count, label = Count),
hjust = 1,
vjust = 0.5,
colour = "white",
fill = NA,
label.size = NA,
family="Helvetica",
size = 6)
) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal",
save_filepath = 'graphs/QuestionsByMinistry.jpg',
width = 1920, height = 1080)
A lot of code above for the graph below. We shall go over it block by block.
# Creating the dataset
ministry <- questions %>% group_by(Ministry) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
mutate(Ministry = factor(Ministry, levels = rev(Ministry)))
First, I create a new dataframe that consists of the number of questions asked to each ministry, and transform the Ministry column into a factor.
The process is:
group_by()
summarise
to get the number of occurences - using n()
arrange()
mutate()
Ministry | Count |
---|---|
HEALTH AND FAMILY WELFARE | 332 |
RAILWAYS | 288 |
HUMAN RESOURCE DEVELOPMENT | 266 |
ENVIRONMENT, FORESTS AND CLIMATE CHANGE | 219 |
AGRICULTURE AND FARMERS WELFARE | 200 |
Next, we look at the code for creating the plot.
# Creating the plot
ggplot(ministry, aes(x = Ministry, y = Count, fill = Ministry)) + geom_bar(stat = 'identity') +
geom_hline(yintercept = 0, size = 1, colour="#333333") +
scale_fill_manual(values = rev(colorRampPalette(brewer.pal(8, "Dark2"))(length(ministry$Ministry)))) +
bbc_style() +
coord_flip() +
# Removing legend, showing axis labels and adding the title
theme(legend.position = "none",
axis.title = element_text(size = 18),
panel.grid.major.x = element_line(color="#cbcbcb"),
panel.grid.major.y=element_blank()) +
labs(title = "Questions asked by each ministry",
subtitle = "Winter Session of 17th Lok Sabha",
y = "Number of questions") +
# Showing the number of questions in the graph
geom_label(aes(x = Ministry, y = Count, label = Count),
hjust = 1,
vjust = 0.5,
colour = "white",
fill = NA,
label.size = NA,
family="Helvetica",
size = 6)
The first section is the basic ggplot2
code for creating a horizontal bar chart, with manual colors. Of particular importance here is: bbc_style()
bbc_style()
(and subsequently, finalise_plot()
) is a function from bbplot
that makes the chart components follow BBC style, while allowing room for further manual customization. For more information, visit the bbplot GitHub repo.
Then, I remove the legend, show the axis labels, add a title and subtitle to the plot in the next two sections. Post that, I add the number of questions as a label in the chart to facilitate easy comprehension.
finalise_plot()
simply packages the graphic, adds a footnote and resizes it - producing an image - QuestionsByMinistry.jpg
Now that we have understood how to create the above graphic, let us take a moment to interpret it.
Health and Family Welfare leads the race with 332 questions asked to it, with Railways trailing at 288.
It seems that most of the core ministries, such as, Human Resource Development, Environment, Finance, Road Transport and Highways, were asked questions in this session, which does that indicate that the House debated on some pertitnent topics. However, a closer analysis is required on the subjects of such questions.
Another interesting way to look at the performance of the Session would be to see whether there was a shift in focus of Lok Sabha Members from asking questions to one Ministry to another during the Session.
We can visualise this through a bump chart, that plots ranking of entites over time. The focus here is usually on comparing the position or performance of multiple observations with respect to each other rather than the actual values itself.(From R-bloggers)
# Has there been a shift in focus from one ministry to another (by week)?
# Creating the dataset
shift_data <- questions %>% group_by(Week = floor_date(Date, 'week'), Ministry) %>%
summarise(count = n()) %>%
top_n(5, wt = count) %>%
group_by(Week) %>%
arrange(Week, desc(count)) %>%
mutate(rank = row_number()) %>%
filter(rank <= 5) %>%
ungroup()
# Creating the plot
(
ggplot(shift_data, aes(x=Week, y=rank, group = Ministry)) +
geom_line(aes(color = Ministry), size = 2) +
geom_point(aes(color = Ministry), size = 5) +
scale_y_reverse(breaks = 1:5) +
bbc_style() +
theme(legend.position = 'right',
axis.title = element_text(size = 18),
panel.grid.major.y = element_line(color="#cbcbcb"),
panel.grid.major.x=element_blank()) +
labs(title = "Top 5 ministries by number of questions",
subtitle = "Winter Session of 17th Lok Sabha",
y = "Rank",
x = "Week")
) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal",
save_filepath = 'graphs/RankingMinistry.jpg',
width = 1600, height = 900)
Let’s go over the code block by block again.
# Creating the dataset
shift_data <- questions %>% group_by(Week = floor_date(Date, 'week'), Ministry) %>%
summarise(count = n()) %>%
top_n(5, wt = count) %>%
group_by(Week) %>%
arrange(Week, desc(count)) %>%
mutate(rank = row_number()) %>%
filter(rank <= 5) %>%
ungroup()
Here, we want to create rankings for each ministry based on the number of questions asked to them in each week. So, our workflow would be to convert the data from daily to weekly, count the number of questions, assign ranks and plot it. To create the dataset, we follow this process:
group_by()
and floor_date()
summarise
to get the count of number of questions - using n()
top_n()
arrange()
mutate()
and row_number()
filter()
Week | Ministry | count | rank |
---|---|---|---|
2019-11-17 | AGRICULTURE AND FARMERS WELFARE | 85 | 1 |
2019-11-17 | RAILWAYS | 85 | 2 |
2019-11-17 | HEALTH AND FAMILY WELFARE | 81 | 3 |
2019-11-17 | HUMAN RESOURCE DEVELOPMENT | 62 | 4 |
2019-11-17 | ENVIRONMENT, FORESTS AND CLIMATE CHANGE | 59 | 5 |
2019-11-24 | HEALTH AND FAMILY WELFARE | 92 | 1 |
2019-11-24 | HUMAN RESOURCE DEVELOPMENT | 79 | 2 |
2019-11-24 | RAILWAYS | 79 | 3 |
2019-11-24 | ENVIRONMENT, FORESTS AND CLIMATE CHANGE | 55 | 4 |
2019-11-24 | ROAD TRANSPORT AND HIGHWAYS | 45 | 5 |
Next, we look at the code for creating the plot.
# Creating the plot
ggplot(shift_data, aes(x = Week, y = rank, group = Ministry)) +
geom_line(aes(color = Ministry), size = 2) +
geom_point(aes(color = Ministry), size = 5) +
scale_y_reverse(breaks = 1:5) +
bbc_style() +
scale_color_tableau() +
# Adding legend, showing axis labels and adding the title
theme(legend.position = 'right',
axis.title = element_text(size = 18),
panel.grid.major.y = element_line(color="#cbcbcb"),
panel.grid.major.x=element_blank()) +
labs(title = "Top 5 ministries by number of questions",
subtitle = "Winter Session of 17th Lok Sabha",
y = "Rank",
x = "Week")
Similar to the previous plot, this one also creates a basic ggplot2
chart, adds bbc_style()
and scale_color_tableau()
and the labels and titles. Pretty standard stuff!
As before, let us interpret this visualisation as well.
Agriculture and Farmers’ Welfare dominated the questions in the first week of the Session, but disappeared in the next two weeks, only to come back at second place in the last week.
Health and Family Welfare jumped to the top spot after first week and remained there, while Human Resource Development created a plateau at the 4th and 2nd place. Road Transport and Highways, Home Affairs and Jal Shakti made the top charts at least once.
All in all, this visualisation seems to provide a bit more insight into how the focus shifted from one Ministry to another during the session.
I would now like to move from analysing the Lok Sabha as a whole to looking into each ministry in depth. To this end, we can look at the subject of questions that were asked to a Ministry and gain some insights from that.
To visualise textual data, let’s create a wordcloud from the subject of questions asked.
# Ministry-wise - Questions wordlcloud
ministry_wordcloud <- function(ministry){
# Filtering selected ministry and creating the dataset
ministry_q <- questions %>% filter(Ministry == str_to_upper(ministry)) %>%
mutate(Subject = as.character(Subject)) %>%
select(Subject) %>%
unnest_tokens(word, Subject) %>%
anti_join(get_stopwords(), by = 'word') %>%
count(word, sort = T) %>%a
head(40)
# Plotting the dataset as a wordcloud
(
ggplot(ministry_q, aes(label = word, color = word, size = n)) +
geom_text_wordcloud_area(rm_outside = T, family = 'Helvetica') +
scale_size_area(max_size = 40) +
bbc_style() +
labs(title = paste0("Ministry of ", ministry, " - Subject of questions asked"),
subtitle = "Winter Session of 17th Lok Sabha")
) %>% finalise_plot(source_name = "Data: Lok Sabha; Created by Lakshya Agarwal",
save_filepath = paste0('graphs/wordcloud/', ministry, '.jpg'),
width = 960, height = 540)
}
The above code creates a function - ministry_wordcloud()
- that takes in a Ministry Name as an input and produces a wordcloud of the subject of questions asked to that ministry. For example, for Health and Family Welfare, we run
which gives
So, what does the function do? Let’s have a look.
Note: I use two packages here - tidytext
and ggwordcloud
.
# Creating the dataset
ministry_q <- questions %>% filter(Ministry == str_to_upper(ministry)) %>%
mutate(Subject = as.character(Subject)) %>%
select(Subject) %>%
unnest_tokens(word, Subject) %>%
anti_join(get_stopwords(), by = 'word') %>%
count(word, sort = T) %>%a
head(40)
In order to create our dataset for the wordcloud, we need to get the subject of questions asked to the selected Ministry and get the number of times each word appears. The process for doing this is:
filter()
character
- using select()
unnest_tokens()
anti_join()
count()
head()
word | n |
---|---|
medical | 32 |
health | 30 |
hospitals | 20 |
ayushman | 18 |
cghs | 18 |
bharat | 17 |
centres | 13 |
food | 13 |
cancer | 12 |
facilities | 12 |
ggplot(ministry_q, aes(label = word, color = word, size = n)) +
geom_text_wordcloud_area(rm_outside = T, family = 'Helvetica') +
scale_size_area(max_size = 40) +
bbc_style() +
labs(title = paste0("Ministry of ", ministry, " - Subject of questions asked"),
subtitle = "Winter Session of 17th Lok Sabha")
With the dataset created, the wordcloud can be easily generated using geom_text_wordcloud_area()
. Post that, I add the usual bbc_style()
and title and export it through finalise_plot()
.
Now that we have understood how the wordcloud was generated, we can turn to interpreting them and gleaning information.
A majority of the questions to the MHRD were focused on topics such as education, schools, institutes, and “kendriya vidyalayas”. This is in line with the objective of the Ministry to ensure good, affordable, quality education to the citizens of the country.
Questions to the MoEFCC were mainly targeted towards topics such as pollution, waste, forests, air and plastic. This is in line with the extremely poor air quality during the Session and growing concerns over deforestation and plastic waste management.
I have also created an application for the above visualisation that can be accessed here.
I hope this was an informative article. I had a lot of fun working with this dataset and creating visualisations to test out my understanding.
I will upload the source code to my GitHub. If you have any queries, feel free to ask me at lakshyagrwal12@gmail.com or send me a message on Twitter.