1 Introduction

This document serves as an example of analysing the questions that were asked in the Winter Session of the 17th Lok Sabha. The Lok Sabha or House of the People is the lower house of India’s bicameral Parliament.

This Winter Session ran from 18th November, 2019 to 13th December, 2019.

1.1 Importing libraries

First, we import the necessary libraries.

tidyverse and its associated libraries are used to leverage the power of tidy data. bbplot by BBC is a package that will be used to create ggplot2 charts.

2 Working with the data

2.1 Reading the dataset

Q.NO. Q.Type Date Ministry Member Subject
3737 UNSTARRED PDF/WORD PDF/WORD(Hindi) 11.12.2019 RAILWAYS Jai Prakash,Shri Incidents of Animals Hit by Trains
760 UNSTARRED PDF/WORD PDF/WORD(Hindi) 21.11.2019 HOUSING AND URBAN AFFAIRS Bhoumik,Ms. Pratima,Patel,Shri Devji Mansingram,Shrangre,Shri Sudhakar Tukaram Status of NURHP
2401 UNSTARRED PDF/WORD PDF/WORD(Hindi) 03.12.2019 HOME AFFAIRS Adhikari, Shri Deepak (Dev) National Fingerprint Database
1324 UNSTARRED PDF/WORD PDF/WORD(Hindi) 25.11.2019 HUMAN RESOURCE DEVELOPMENT Rajoria,Dr. Manoj Higher Education
97 STARRED PDF/WORD PDF/WORD(Hindi) 22.11.2019 ENVIRONMENT, FORESTS AND CLIMATE CHANGE Gogoi,Shri Gaurav Temporary use of Forest Land
1747 UNSTARRED PDF/WORD PDF/WORD(Hindi) 28.11.2019 JAL SHAKTI Singh,Shri Shyam Yadav Abatement of Pollution in Ganga River
950 UNSTARRED PDF/WORD PDF/WORD(Hindi) 22.11.2019 HEALTH AND FAMILY WELFARE Senthilkumar. S.,Shri DNV,Raut,Shri Vinayak Bhaurao,Barne,Shri Shrirang Appa,Patil,Shri Hemant Assault on Doctor
1542 UNSTARRED PDF/WORD PDF/WORD(Hindi) 27.11.2019 PLANNING Singh,Shri Shyam Yadav Nomenclature of Planning Commission as NITI Aayog
3726 UNSTARRED PDF/WORD PDF/WORD(Hindi) 11.12.2019 RAILWAYS Kaswan,Shri Rahul Priority in Reservation under Special Quota
2212 UNSTARRED PDF/WORD PDF/WORD(Hindi) 02.12.2019 HUMAN RESOURCE DEVELOPMENT Misra,Shri Ajay (Teni) IIIT
## 'data.frame':    4740 obs. of  6 variables:
##  $ Q.NO.   : int  380 379 378 377 376 375 374 373 372 371 ...
##  $ Q.Type  : Factor w/ 7 levels "STARRED PDF/WORD",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Date    : Factor w/ 20 levels "02.12.2019","03.12.2019",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Ministry: Factor w/ 54 levels "AGRICULTURE AND FARMERS WELFARE",..: 21 21 16 50 16 3 16 21 53 21 ...
##  $ Member  : Factor w/ 1116 levels "","Abbaiah,Shri Narayana Swamy",..: 835 569 794 826 1018 873 548 774 734 925 ...
##  $ Subject : Factor w/ 4569 levels "12th Five Year Plan for TSP",..: 2576 1788 4069 3864 1209 495 675 85 468 433 ...

As can be seen from the above output, the dataset is messy. More specifically,

  • The column names need to be renamed to facilitate repeated usage.
  • The Date column is not in the correct data type. Moreover, we need to filter only the questions for the Winter Session.
  • The Q. Type column contains a lot of unnecessary text.

2.2 Cleaning the dataset

We now turn to cleaning the dataset.

Question Number Type Date Ministry Member Subject Link
3541 UNSTARRED 2019-12-10 AGRICULTURE AND FARMERS WELFARE Kuriakose,Adv. Dean Remunerative Support Price http://164.100.24.220/loksabhaquestions/annex/172/AU3541.pdf
3274 UNSTARRED 2019-12-09 PETROLEUM AND NATURAL GAS Bista,Shri Raju Home Delivery Charges http://164.100.24.220/loksabhaquestions/annex/172/AU3274.pdf
4033 UNSTARRED 2019-12-12 HOUSING AND URBAN AFFAIRS Bista,Shri Raju PMAY Houses in West Bengal http://164.100.24.220/loksabhaquestions/annex/172/AU4033.pdf
245 STARRED 2019-12-05 CIVIL AVIATION Chavda,Shri Vinod Expansion of Airports http://164.100.24.220/loksabhaquestions/annex/172/AS245.pdf
4247 UNSTARRED 2019-12-13 HEALTH AND FAMILY WELFARE Hegde,Shri Anantkumar ,Reddy,Shri Komati Reddy Venkat,Kanumuru,Shri Raghu Ramakrishna Raju Cancer Deaths http://164.100.24.220/loksabhaquestions/annex/172/AU4247.pdf
3813 UNSTARRED 2019-12-11 RAILWAYS Kodikunnil,Shri Suresh Facilities for Chengannur Station http://164.100.24.220/loksabhaquestions/annex/172/AU3813.pdf
1005 UNSTARRED 2019-11-22 AYURVEDA,YOGA & NATUROPATHY,UNANI,SIDDHA AND HOMEOPATHY (AYUSH) Kesineni,Shri Srinivas,Sreekandan,Shri Vellalath Kochukrishnan Nair. Medicinal Plants http://164.100.24.220/loksabhaquestions/annex/172/AU1005.pdf
90 UNSTARRED 2019-11-18 PETROLEUM AND NATURAL GAS Ariff,Adv. Abdul Majeed Profits of Public Sector Oil Companies http://164.100.24.220/loksabhaquestions/annex/172/AU90.pdf
2074 UNSTARRED 2019-12-02 HUMAN RESOURCE DEVELOPMENT Kumar,Shri Kaushalendra Primary Education http://164.100.24.220/loksabhaquestions/annex/172/AU2074.pdf
283 UNSTARRED 2019-11-19 CHEMICALS AND FERTILIZERS Mohan,Shri P. C. BIS Standards for Chemicals http://164.100.24.220/loksabhaquestions/annex/172/AU283.pdf

Note that I have also added a link to the actual question and the subsequent answer to that question. The files on the Lok Sabha server follow a pattern, which makes the task very simple.

3 Data analysis and visualization

3.1 Number of questions to Ministries

Let us see which ministry was asked the most number of questions in this session.

A lot of code above for the graph below. We shall go over it block by block.

3.1.1 Understanding the code

First, I create a new dataframe that consists of the number of questions asked to each ministry, and transform the Ministry column into a factor.

The process is:

  1. Group the dataset by Ministry - using group_by()
  2. summarise to get the number of occurences - using n()
  3. Sort the dataframe in descending order of Count - using arrange()
  4. Convert Ministry into a factor - using mutate()
This gives us the following output:
Ministry Count
HEALTH AND FAMILY WELFARE 332
RAILWAYS 288
HUMAN RESOURCE DEVELOPMENT 266
ENVIRONMENT, FORESTS AND CLIMATE CHANGE 219
AGRICULTURE AND FARMERS WELFARE 200

Next, we look at the code for creating the plot.

The first section is the basic ggplot2 code for creating a horizontal bar chart, with manual colors. Of particular importance here is: bbc_style()

bbc_style() (and subsequently, finalise_plot()) is a function from bbplot that makes the chart components follow BBC style, while allowing room for further manual customization. For more information, visit the bbplot GitHub repo.

Then, I remove the legend, show the axis labels, add a title and subtitle to the plot in the next two sections. Post that, I add the number of questions as a label in the chart to facilitate easy comprehension.

finalise_plot() simply packages the graphic, adds a footnote and resizes it - producing an image - QuestionsByMinistry.jpg

3.1.2 Understanding the graphic

Now that we have understood how to create the above graphic, let us take a moment to interpret it.

Health and Family Welfare leads the race with 332 questions asked to it, with Railways trailing at 288.

It seems that most of the core ministries, such as, Human Resource Development, Environment, Finance, Road Transport and Highways, were asked questions in this session, which does that indicate that the House debated on some pertitnent topics. However, a closer analysis is required on the subjects of such questions.

3.2 Weekly ranking of Ministries

Another interesting way to look at the performance of the Session would be to see whether there was a shift in focus of Lok Sabha Members from asking questions to one Ministry to another during the Session.

We can visualise this through a bump chart, that plots ranking of entites over time. The focus here is usually on comparing the position or performance of multiple observations with respect to each other rather than the actual values itself.(From R-bloggers)

Let’s go over the code block by block again.

3.2.1 Understanding the code

Here, we want to create rankings for each ministry based on the number of questions asked to them in each week. So, our workflow would be to convert the data from daily to weekly, count the number of questions, assign ranks and plot it. To create the dataset, we follow this process:

  1. Group the dataset by Week and Ministry - using group_by() and floor_date()
  2. summarise to get the count of number of questions - using n()
  3. Keep only the top 5 ministries in each week - using top_n()
  4. Sort by descending order of number of questions asked to each week - using arrange()
  5. Add a rank column for each week and ministry - using mutate() and row_number()
  6. Remove any ministry with overlapping ranks - using filter()
This gives us the following output:
Week Ministry count rank
2019-11-17 AGRICULTURE AND FARMERS WELFARE 85 1
2019-11-17 RAILWAYS 85 2
2019-11-17 HEALTH AND FAMILY WELFARE 81 3
2019-11-17 HUMAN RESOURCE DEVELOPMENT 62 4
2019-11-17 ENVIRONMENT, FORESTS AND CLIMATE CHANGE 59 5
2019-11-24 HEALTH AND FAMILY WELFARE 92 1
2019-11-24 HUMAN RESOURCE DEVELOPMENT 79 2
2019-11-24 RAILWAYS 79 3
2019-11-24 ENVIRONMENT, FORESTS AND CLIMATE CHANGE 55 4
2019-11-24 ROAD TRANSPORT AND HIGHWAYS 45 5

Next, we look at the code for creating the plot.

Similar to the previous plot, this one also creates a basic ggplot2 chart, adds bbc_style() and scale_color_tableau() and the labels and titles. Pretty standard stuff!

3.2.2 Understanding the graphic

As before, let us interpret this visualisation as well.

Agriculture and Farmers’ Welfare dominated the questions in the first week of the Session, but disappeared in the next two weeks, only to come back at second place in the last week.

Health and Family Welfare jumped to the top spot after first week and remained there, while Human Resource Development created a plateau at the 4th and 2nd place. Road Transport and Highways, Home Affairs and Jal Shakti made the top charts at least once.

All in all, this visualisation seems to provide a bit more insight into how the focus shifted from one Ministry to another during the session.

3.3 Subject of questions to Ministries

I would now like to move from analysing the Lok Sabha as a whole to looking into each ministry in depth. To this end, we can look at the subject of questions that were asked to a Ministry and gain some insights from that.

3.3.1 Creating a wordcloud

To visualise textual data, let’s create a wordcloud from the subject of questions asked.

The above code creates a function - ministry_wordcloud() - that takes in a Ministry Name as an input and produces a wordcloud of the subject of questions asked to that ministry. For example, for Health and Family Welfare, we run

which gives

3.3.1.1 Understanding the function

So, what does the function do? Let’s have a look.

Note: I use two packages here - tidytext and ggwordcloud.

3.3.1.1.1 Creating the dataset

In order to create our dataset for the wordcloud, we need to get the subject of questions asked to the selected Ministry and get the number of times each word appears. The process for doing this is:

  1. Filter the selected ministry - using filter()
  2. Select the Subject column, after converting it into character - using select()
  3. Generate a list of words from the Subject column - using unnest_tokens()
  4. Remove the stopwords - using anti_join()
  5. Count the number of times each word occurs and sort it in descending order - using count()
  6. Select the top 40 words - using head()
This gives us the following dataset (created for Health and Family Welfare, trimmed to 10 words):
word n
medical 32
health 30
hospitals 20
ayushman 18
cghs 18
bharat 17
centres 13
food 13
cancer 12
facilities 12
3.3.1.1.2 Creating the wordcloud

With the dataset created, the wordcloud can be easily generated using geom_text_wordcloud_area(). Post that, I add the usual bbc_style() and title and export it through finalise_plot().

3.3.2 Understading the wordcloud

Now that we have understood how the wordcloud was generated, we can turn to interpreting them and gleaning information.

3.3.2.1 Ministry of Human Resource Development

A majority of the questions to the MHRD were focused on topics such as education, schools, institutes, and “kendriya vidyalayas”. This is in line with the objective of the Ministry to ensure good, affordable, quality education to the citizens of the country.

3.3.2.2 Ministry of Environment, Forest and Climate Change

Questions to the MoEFCC were mainly targeted towards topics such as pollution, waste, forests, air and plastic. This is in line with the extremely poor air quality during the Session and growing concerns over deforestation and plastic waste management.

3.3.3 Shiny app

I have also created an application for the above visualisation that can be accessed here.

4 Conclusion

I hope this was an informative article. I had a lot of fun working with this dataset and creating visualisations to test out my understanding.
I will upload the source code to my GitHub. If you have any queries, feel free to ask me at or send me a message on Twitter.