Predicting upper respiratory track admissions given weather information

This idea came about as a project deliverable for one of my data science courses that I was taking. The aim was to identify if there was any relationship between the number of rainy days observed in a month and the number of polyclinic admissions for upper respiratory tract infections

TLDR: Mom tells me don’t get caught in the rain or you will get sick, but do the numbers agree ?

Results: quoting Jack Sparrow, its not the destination so much as the journey that matters.

The simple decision tree based model achieved about 50% accuracy in predicting the number of cases (this was generalized to low, moderate and high instead of absolute number of cases). But there was much to learn from the journey.

I used data sets available from Data.gov.sg a store of Singapore’s public data.

  1. Average Daily Polyclinic Attendances for Selected Diseases — This data set indicates the number of the different diseases contracted in Singapore by Polyclinic attendees. https://data.gov.sg/dataset/average-daily-polyclinic-attendances-selected-diseases?view_id=8fb8637d-c1c5-4c5e-9fbe-3f46785804b7&resource_id=dd4dcaac-aa8d-49de-a96a-b809f8d3ae0d

Timeframe: January 1, 2012 to January 3, 2021

Frequency: Weekly

CSV format

2. Rainfall (Monthly Number of Rain Days) — This data set indicates the number of days it rains in Singapore on a monthly basis https://data.gov.sg/dataset/rainfall-monthly-number-of-rain-days?view_id=8d575155-ac43-4a7b-8b0a-45e2c697c92c&resource_id=8b94f596-91fd-4545-bf9e-7a426493b674

Timeframe: January 1, 1982 to February 28, 2021

Frequency: Monthly

CSV format

3. Relative Humidity (Monthly Mean) — This data set indicates the monthly average percentage of humidity in Singapore. https://data.gov.sg/dataset/relative-humidity-monthly-mean

Timeframe: January 1, 1982 to February 28, 2021

Frequency: Monthly mean

CSV format

I added another data set later on, which described the monthly average of sunlight.

This was actually the hardest and most painful phase. The datasets were of unequal timeframe, this i thought was not such a huge problem. However I then realized that the respiratory dataset spanned from week 1 — week 52 (or 53) in a given year. This week was considered epidemic weeks and started on a Sunday and ended on a Saturday. on the on the other hand, the weather datasets consisted of the 12 months a year.

Now I had to figure out how many weeks were there in each month in a given year, and NO its not always 4 weeks! I have described the function below, its in python so be mindful of the indentation

Image of code I wrote for aligning the number of weeks in a month given a year. I don't know how to include code into medium yet.

Number of respiratory cases in 2020 was extremely low to the extent it was an outlier, thus I excluded it from my model. After aligning the datasets to a monthly timeframe, to my horror I found the remaining data points to be veryy small. I had little confidence in the model predicting absolute numbers based on such a small training dataset. Thus I split the targets into a range of ‘low’, moderate and ‘high’ respiratory cases.

This is my first attempt at documenting my journey, I hope to improve one project at a time!

This is where I try to use numbers to tell a story