Predicting upper respiratory track admissions given weather information

Washeem Mohamed
3 min readApr 16, 2021

This idea came about as a project deliverable for one of my data science courses that I was taking. The aim was to identify if there was any relationship between the number of rainy days observed in a month and the number of polyclinic admissions for upper respiratory tract infections

TLDR: Mom tells me don’t get caught in the rain or you will get sick, but do the numbers agree ?

https://github.com/mohamedwasheem/Predict_respiratory_cases_in_sg

Results: quoting Jack Sparrow, its not the destination so much as the journey that matters.

The simple decision tree based model achieved about 50% accuracy in predicting the number of cases (this was generalized to low, moderate and high instead of absolute number of cases). But there was much to learn from the journey.

The Data

I used data sets available from Data.gov.sg a store of Singapore’s public data.

  1. Average Daily Polyclinic Attendances for Selected Diseases — This data set indicates the number of the different diseases contracted in Singapore by Polyclinic attendees. https://data.gov.sg/dataset/average-daily-polyclinic-attendances-selected-diseases?view_id=8fb8637d-c1c5-4c5e-9fbe-3f46785804b7&resource_id=dd4dcaac-aa8d-49de-a96a-b809f8d3ae0d

Timeframe: January 1, 2012 to January 3, 2021

Frequency: Weekly

CSV format

2. Rainfall (Monthly Number of Rain Days) — This data set indicates the number of days it rains in Singapore on a monthly basis https://data.gov.sg/dataset/rainfall-monthly-number-of-rain-days?view_id=8d575155-ac43-4a7b-8b0a-45e2c697c92c&resource_id=8b94f596-91fd-4545-bf9e-7a426493b674

Timeframe: January 1, 1982 to February 28, 2021

Frequency: Monthly

CSV format

3. Relative Humidity (Monthly Mean) — This data set indicates the monthly average percentage of humidity in Singapore. https://data.gov.sg/dataset/relative-humidity-monthly-mean

Timeframe: January 1, 1982 to February 28, 2021

Frequency: Monthly mean

CSV format

I added another data set later on, which described the monthly average of sunlight.

Data Pre processing

This was actually the hardest and most painful phase. The datasets were of unequal timeframe, this i thought was not such a huge problem. However I then realized that the respiratory dataset spanned from week 1 — week 52 (or 53) in a given year. This week was considered epidemic weeks and started on a Sunday and ended on a Saturday. on the on the other hand, the weather datasets consisted of the 12 months a year.

Now I had to figure out how many weeks were there in each month in a given year, and NO its not always 4 weeks! I have described the function below, its in python so be mindful of the indentation

Image of code I wrote for aligning the number of weeks in a month given a year. I don't know how to include code into medium yet.

Some insights

Number of respiratory cases in 2020 was extremely low to the extent it was an outlier, thus I excluded it from my model. After aligning the datasets to a monthly timeframe, to my horror I found the remaining data points to be veryy small. I had little confidence in the model predicting absolute numbers based on such a small training dataset. Thus I split the targets into a range of ‘low’, moderate and ‘high’ respiratory cases.

This is my first attempt at documenting my journey, I hope to improve one project at a time!

--

--