*This post was originally published by Tanmay Jain at Medium [AI]*

The 2019 Coronavirus (COVID-19) pandemic in Wuhan, China, has devastating consequences for the global environment and has overburdened advanced health systems worldwide. The coronavirus epidemic has been declared a pandemic by the World Health Organization (WHO), although the virus continues to spread. The epidemic is still not under control even though recoveries are confirmed.

We have preprocessed the data and then applied hidden Markov models to predict the spread of COVID 19 over different countries and regions. A few of the time series forecasting techniques studied by us were ARIMA, Facebook Prophet, Holt’s Linear Trend method. In this article, based on the previous models and limitations, we applied a hidden Markov model to overcome them and predict accurately.

Dataset is from the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).

This dataset is a time series dataset of COVID-19 in which data is provided from date 22 Jan 2020 to 16 July 2020. In this dataset, we have selected the dinformationfor the country US which consists of about the number of infections and the number of deaths in each city of the US.

Both the files are of .csv format. Our dataset consists of the following columns — UID, iso2, iso3, code3, FIPS, admin2, Province_State, Country_Region, Lat, Long_, Combined_Key, and date (22 Jan to 16 July). Features of our dataset are infection rate, death rate, and amount of time.

Firstly the data set was cleaned by removing useless columns — UID, iso2, iso3, code3, FIPS, Admin2, Lat, Long_. As we only needed the rows by the country w,e only chose the Combined_Key (city, state, country) column. Then the dataset is visualized to a particular row sotoee the curve b/w dates and COVID cases. Now the first derivative i,.e., maximum infection rate and the maximum death rate is recalculated. As we now need a relation b/w max infection rate and max death rate w,e used correlation here. As correlation is a measure of how strongly one variable depends on another.

Hidden Markov model (HMM) are probabilistic models demonstrated by their ability to predict and interpret time-based phenomena a,nd this makes them very useful in forecasting them. They allow us to measure the joint probability, given a set of observed states, of a set of hidden states. Often, the hidden states are called latent states. Once we know the joint probabilities of a sequence of the hidden states, the best sequence is determined, i.e. t,he sequence with the highest probabilities and the best sequence of the hidden states are chosen. In its discrete form, a hidden Markov process can be visualized as a generalization of the urn problem with replacement (where each item from the urn is returned to the original urn before the next step).

The general architecture of an instantiated HMM:

Gaussian hidden Markov (Gaussian HMM) model is a type of finite-state and homogeneous HMM with the normal distribution of the probability of observation. Since the Gaussian HMM is a type of finite-state-space and homogeneous HMM, the three algorithms Forward Algorithm, Backward Algorithm, and Viterbi Algorithm can be used to solve the six common problems such as smoothing, evaluating, forecasting, filtering, learning and decoding problems. We can solve the problems of estimation, filtering and forecasting by using the forward algorithm; the problem of smoothing by using the forward algorithm and the backward algorithm; the problem of decoding by using the Viterbi algorithm; and the problem of learning by using the forward algorithm to determine the probability, if resolved by the maximum likelihood or maximum a posteriori method.

In our case, there were two states namely max_infection rate, max_death_rate as calculated above. The data is split into train and test sets with the help of sklearn library train_test_split. In the ratio of 77% train data and 33% test data. These columns are then combined to column stack(array)with the help of a NumPy function (numpy.column_stack()) which converts the 1-D array into a 2-D array. The previous data features are then calculated and combine similarly with the help of numpy function.

After that, the Gaussian Markov Model is fit on the train data array with different parameters like different covariance type and a number of iterations. The algorithm was taken to be Viterbi.

Then the score is calculated with the help of score function for different parameters on the test data. The score is calculated on the previous data features, in our model we have taken 3 cases for the first 50 days, for the first 100 days and then for the first 500 days.

The accuracy of the model is based on the score. The score computes the log probability under the model. The score is calculated on the basis of the previous day features and ‘diag’ covariance types. We have calculated the outcome score for the different time periods for the first 50 days then for 100 days and then for 200 days.

The final prediction is done with the help of predict function which predicts the most likely state sequence corresponding to the test data. It returns sequence state_sequence (The label for each sample from the test set). The predict_proba function predicts the posterior probability for each state in the model.

As we can see during the first 50 days the rates were very low and then it started increasing.

We have also calculated the correlation between max_infection_rate and max_death_rate before prediction to analyze the relationship between the two parameters.

We tried to overcome past failures and tried to improve efficiency in predicting accurately. Based on the results obtained after applying the Hidden Markov Model we can see the correlation between maximum death counts and maximum infection rate.

As we have seen during the initial days the rates were very low and then they started increasing rapidly. In order to examine the relationship between both parameters, we have also measured the relationship between max infection rate and max death rate before provision.

With the forthcoming data estimates, we are likely to be able to help forecast the potential spread of COVID-19 and assist decision-making in health care, manufacturing, economy, and even academia.

*This post was originally published by Tanmay Jain at Medium [AI]*