Predicting upper respiratory track admissions given weather information

This idea came about as a project deliverable for one of my data science courses that I was taking. The aim was to identify if there was any relationship between the number of rainy days observed in a month and the number of polyclinic admissions for upper respiratory tract infections

TLDR: Mom tells me don’t get caught in the rain or you will get sick, but do the numbers agree ?

Results: quoting Jack Sparrow, its not the destination so much as the journey that matters.

The simple decision tree based model achieved about 50% accuracy in predicting the number of cases (this was generalized to low, moderate and high instead of absolute number of cases). But there was much to learn from the journey.

I used data sets available from a store of Singapore’s public data.

  1. Average Daily Polyclinic Attendances for Selected Diseases — This data set indicates the number of the different diseases contracted in Singapore by Polyclinic attendees.

Timeframe: January 1, 2012 to January 3, 2021

Frequency: Weekly

CSV format

2. Rainfall (Monthly Number of Rain Days) — This data set indicates the number of days it rains in Singapore on a monthly basis

Timeframe: January 1, 1982 to February 28, 2021

Frequency: Monthly

CSV format

3. Relative Humidity (Monthly Mean) — This data set indicates the monthly average percentage of humidity in Singapore.

Timeframe: January 1, 1982 to February 28, 2021

Frequency: Monthly mean

CSV format

I added another data set later on, which described the monthly average of sunlight.

This was actually the hardest and most painful phase. The datasets were of unequal timeframe, this i thought was not such a huge problem. However I then realized that the respiratory dataset spanned from week 1 — week 52 (or 53) in a given year. This week was considered epidemic weeks and started on a Sunday and ended on a Saturday. on the on the other hand, the weather datasets consisted of the 12 months a year.

Now I had to figure out how many weeks were there in each month in a given year, and NO its not always 4 weeks! I have described the function below, its in python so be mindful of the indentation

Image of code I wrote for aligning the number of weeks in a month given a year. I don't know how to include code into medium yet.

Number of respiratory cases in 2020 was extremely low to the extent it was an outlier, thus I excluded it from my model. After aligning the datasets to a monthly timeframe, to my horror I found the remaining data points to be veryy small. I had little confidence in the model predicting absolute numbers based on such a small training dataset. Thus I split the targets into a range of ‘low’, moderate and ‘high’ respiratory cases.

This is my first attempt at documenting my journey, I hope to improve one project at a time!

This is where I try to use numbers to tell a story