2019, Nov 07

Predicting Medical Insurance Charges

Insurance Cost Prediction

1. Introduction

As all we know medical insurance companies are very important in some countries where the medical services are not free. In fact, they generate a lot of money due to most of the companies offer this kind of insurance to their employees.

Drawing

Today we are going to analyze the dataset named Medical Cost Personal Dataset from Kaggle. This dataset has a large number of clients from insurance companies of the USA and multiple personal information for each client. We are going to see the relations between the charges and some interesting attributes as well as some criteria these companies have.

2. Dependences

Before starting with data analysis, these are the main libraries used:

  • pandas and numpy: data vector and matrix operations.
  • plotly and matplotlib: graphics generation
  • sklearn: models, metrics and dataset split.
  • prettytable: representate data in a table.
In [1]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff

from prettytable import PrettyTable
from matplotlib import pyplot as plt
from plotly.offline import iplot

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import LabelEncoder

3. Data exploration

First of all, we need to familiarize with our data. Doing this will help us to understand the attributes and extracting some information from them.

In [2]:
path = r"./Data/insurance.csv"
df = pd.read_csv(path)

Once we have loaded our data, we can take a look on how many samples and attributes we have:

In [3]:
print("Number of samples:", df.shape[0])
print("Number of attributes:", df.shape[1])
Number of samples: 1338
Number of attributes: 7

As we can appreciate, we do not have a lot of samples. This will be a fact to consider later.

Next thing we are going to do, is to show some examples of different samples and watch the structure of our data.

In [4]:
df.head(5)
Out[4]:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520

Now, lets take a look on the attributes data type:

In [5]:
df.dtypes.to_frame("Type").T
Out[5]:
age sex bmi children smoker region charges
Type int64 object float64 int64 object object float64

This table shows some attributes data type that will be needed to be redefined in order to process them: sex, smoker and region.

Lets observe some information about our numerical attributes:

In [6]:
df.describe()
Out[6]:
age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010

One interesting and important fact we can extract from this table is that the BMI mean is 30, which is really bad due to people with a BMI of 30 or higher are considered obese.

Last thing to do in our data exploration is seaching for missing values:

In [7]:
missing_values_table = PrettyTable()

missing_values_table.field_names = ["Attribute", "Missing values"]

for attribute, count in df.isnull().sum().iteritems():
    missing_values_table.add_row([attribute, count])
    
print(missing_values_table)
+-----------+----------------+
| Attribute | Missing values |
+-----------+----------------+
|    age    |       0        |
|    sex    |       0        |
|    bmi    |       0        |
|  children |       0        |
|   smoker  |       0        |
|   region  |       0        |
|  charges  |       0        |
+-----------+----------------+

This table shows how there are no missing values in our data. That is a good fact!

4. Data analysis

Note: During this section you are going to see multiple responsive plots. It is possible to filter them by selecting the labels of the legend or even see some extra information by hovering over them.

First thing we are going to see is if the gender of the clients are actually balanced.

In [8]:
colors = ['#0077B6', '#00B4D8', '#90E0EF', '#CAF0F8', '#03045E']

# We take the 'Sex' attribute values and the corresponding counts.
labels = df["sex"].unique().tolist()
values = df["sex"].value_counts().tolist()

# We generate the pie chart of the 'Sex' attribute.
trace = go.Pie(labels=labels,
               values=values,
               textinfo='label+percent',
               hole=0.3,
               marker_colors=colors)

data = [trace]

figure = go.Figure(data)
figure.update_layout(title="Sex Proportions", width=900, height=400)

iplot(figure)

This pie chart shows how we have practically the same number of samples from both gender.

Another attribute that would be intresting to see is the number of samples per region:

In [9]:
region_colormap = {"southwest": "#00B9FD",
                   "southeast": "#0D8FEC",
                   "northwest": "#3270D0",
                   "northeast": "#114585"}

# We generate the histogram of the 'Region' attribute.
figure = px.histogram(df, x='region', color='region', color_discrete_map=region_colormap)
figure.update_layout(title="Region Histogram", bargap=0.4, width=900, height=300)

iplot(figure)

There are practically the same number of samples per region except for southeast, who has some more samples.

The BMI distribution can be an intresting information, so next step is plotting this attribute distribution:

In [10]:
# We take all the 'BMI' attribute values. 
values = df['bmi'].values.tolist()

data = [values]

# We create a distplot by using the values of the BMI obtained before.
figure = ff.create_distplot(data, ["Body Mass Index (BMI)"], colors=["#0D8FEC"])
figure.update_layout(title="BMI Distribution", xaxis_title='BMI', yaxis_title='Percentage', width=900, height=500)

iplot(figure)

As we can observe, the BMI attribute has a normal distribution.

This attribute is complicated to understand just with numbers, lets try to group the values in order to know the weight status of the samples.

In order to calculate the weight status we are going to consider the following groups:

  • Underweight: BMI is less than 18.5.
  • Normal: BMI is 18.5 to 24.9.
  • Overweight: BMI is 25 to 29.9.
  • Obese: BMI is 30 or more.
In [11]:
# We define the function that classifies the BMI in different weight status.
def get_weight_status(bmi):
    if bmi < 18.5:
        status = "Underweight"
    elif bmi < 25:
        status = "Normal"
    elif bmi < 30:
        status = "Overweight"
    else:
        status = "Obese"
    return status

# We generate a new attribute that indicates the weight status.
df['weight_status'] = df['bmi'].apply(get_weight_status)

Once we have calculated the weight group for each sample, lets see the proportions of each group:

In [12]:
# We take the 'Weight status' attribute values and the corresponding counts.
labels = df['weight_status'].unique().tolist()
values = df['weight_status'].value_counts().tolist()

# We generate the pie chart of the 'Weight status' attribute.
trace = go.Pie(labels=labels,
               values=values,
               textinfo='label+percent',
               hole=0.3,
               marker_colors=colors)

data = [trace]

# We generate a figure containing the 'Weight status' attribute proportions.
figure = go.Figure(data)
figure.update_layout(title="Weight Status Proportions", width=900, height=350)

iplot(figure)

We can clearly see how most of the samples dont have a healthy weight status. In fact, there are more obese people than ones with normal weight. Also, one important fact is that the samples with Overweight represent more than the half of our data.

Insurance companies calculate the charges depending on the impact and probability yo happen of different risks. That is why we can expect weight status to be an important fact. To verify this, we are going to see the relationship between the charges and this weight status:

In [13]:
box_colormap = ["#0D8FEC", "#0479B6", "#0941CF", "#340574"]

# We generate the box of Underweight people.
underweight = df[df['weight_status'] == 'Underweight']['charges'].values
underweight_trace = go.Box(y=underweight, name="Underweight", marker=dict(color=box_colormap[0]))

# We generate the box of normal people.
normal = df[df['weight_status'] == 'Normal']['charges'].values
normal_trace = go.Box(y=normal, name="Normal", marker=dict(color=box_colormap[1]))

# We generate the box of Overweight people.
overweight = df[df['weight_status'] == 'Overweight']['charges'].values
overweight_trace = go.Box(y=overweight, name="Overweight", marker=dict(color=box_colormap[2]))

# We generate the box of Obese people.
obese = df[df['weight_status'] == 'Obese']['charges'].values
obese_trace = go.Box(y=obese, name="Obese", marker=dict(color=box_colormap[3]))

data = [underweight_trace, normal_trace, overweight_trace, obese_trace]

# We create a figure containing all the boxes generated.
figure = go.Figure(data)
figure.update_layout(title="Charges relation by Weight status",
                     xaxis=dict(title="Weight status"),
                     yaxis=dict(title="Charges"),
                     width=900,
                     height=400)

iplot(figure)

As we expected, the weight status plays an important role when calculating the carges. we can clearly see the fact that the more unhealthy the person is the higher the charges are.

Once we have analyzed the BMI, we can take a look on the age attribute. We are also going to group the ages by the criterion shown below:

  • Young: age is less than 30.
  • Adult: age is 30 to 59.
  • Elder: age is 60 or more.
In [14]:
# We define the function that classifies ages.
def get_age_group(age):
    if age < 30:
        category = "Young"
    elif age < 60:
        category = "Adult"
    else:
        category = "Elder"
    return category

# We generate a new attribute that indicates the age group.
df['age_group'] = df['age'].apply(get_age_group)

Once grouped the ages it would be intresting to see the proportions of each group:

In [15]:
# We take the 'Age group' attribute values and the corresponding counts.
labels = df['age_group'].unique().tolist()
values = df['age_group'].value_counts().tolist()

# We generate the pie chart of the 'Weight status' attribute.
trace = go.Pie(labels=labels,
               values=values,
               textinfo='label+percent',
               hole=0.3,
               marker_colors=colors)

data = [trace]

# We generate a figure containing the 'Age group' attribute proportions.
figure = go.Figure(data)
figure.update_layout(title="Age Proportions", width=900, height=350)

iplot(figure)

We can observe how the young people represents more than the half of the population we have. Also, the elder people do not represents even a 10%.

Now it is time to take a look on the relation between charges and age groups:

In [16]:
# We generate the box of Young people.
young = df[df['age_group'] == 'Young']['charges'].values
young_trace = go.Box(y=young, name="Young", marker=dict(color=box_colormap[0]))

# We generate the box of Adult people.
adult = df[df['age_group'] == 'Adult']['charges'].values
adult_trace = go.Box(y=adult, name="Adult", marker=dict(color=box_colormap[1]))

# We generate the box of Elder people.
elder = df[df['age_group'] == 'Elder']['charges'].values
elder_trace = go.Box(y=elder, name="Elder", marker=dict(color=box_colormap[2]))

data = [young_trace, adult_trace, elder_trace]

# We create a figure containing all the boxes generated.
figure = go.Figure(data)
figure.update_layout(title="Charges relation by Age category",
                     xaxis=dict(title="Age group"),
                     yaxis=dict(title="Charges"),
                     width=900,
                     height=400)

iplot(figure)

As we expected, there is a correlation between this two attributes: the older the person is the higher the charges are.

In our data we have one more attribute that represents a very unhealthy habit: smoking. As we have seen before, the charges are higher when the person trends to be unhealthy, so we can expect a huge correlation between being smoker and the charges.

In [17]:
smoker_color_map = {'yes': '#ef476f', 'no': '#06d6a0'}

# We get the charges of smokers and generate a box.
smoker = df[df['smoker'] == 'yes']['charges'].values
smoker_trace = go.Box(y=smoker, name="Yes", marker=dict(color=smoker_color_map['yes']))

# We get the charges of non-smokers and generate a box.
non_smoker = df[df['smoker'] == 'no']['charges'].values
non_smoker_trace = go.Box(y=non_smoker, name="No", marker=dict(color=smoker_color_map['no']))

data = [smoker_trace, non_smoker_trace]

# We create a figure containing all the boxes generated.
figure = go.Figure(data)
figure.update_layout(title="Charges relation by Smoker status",
                     xaxis=dict(title="Smoker"),
                     yaxis=dict(title="Charges"),
                     legend_title_text='Smoker',
                     width=900,
                     height=400)

iplot(figure)

As we expected, there is a huge correlation between the charges and the smoker status of the person. It can be clearly seen how smoker people has more expensive charges.

But, an even more intresting fact would be analyze the relation between the charges and the two attributes that represents the healthy level of a person.

In [18]:
smoker_color_map = {'yes': '#ef476f', 'no': '#06d6a0'}

# We create a grid of therelationship between the 'BMI', the 'Smoker' and the 'Charges' attributes.
figure = ff.create_facet_grid(df, x='bmi', y='charges', color_name='smoker', show_boxes=False,
                              marker={'size': 10, 'opacity': 1.0}, colormap=smoker_color_map)

figure.update_layout(title="Charges by Weight and Smoker status", legend_title_text='Smoker', width=900, height=500)

iplot(figure)

We can extract very important information from this plot:

  • There is a clearly difference on the charges between the smokers and the non-smokers.
  • The charges difference between smokers and non-smokers increases a lot when the BMI exceeds 30, that means this person is considered to be obese so the risk is very high.

In order to appreciate this difference increase we are going to analyze only the obese people grouping by their smoker status:

In [19]:
smoker_color_map = {'yes': '#ef476f', 'no': '#06d6a0'}

# We get the charges of the obese smokers and generate a box.
obese_smokers = df[(df['smoker'] == 'yes') & (df['weight_status'] == 'Obese')]['charges'].values
obese_smokers_trace = go.Box(y=obese_smokers, name="Smokers", marker = dict(color=smoker_color_map['yes']))

# We get the charges of the obese non-smokers and generate a box.
obese_non_smokers = df[(df['smoker'] == 'no') & (df['weight_status'] == 'Obese')]['charges'].values
obese_non_smokers_trace = go.Box(y=obese_non_smokers, name="Non-Smokers", marker = dict(color=smoker_color_map['no']))

data = [obese_smokers_trace, obese_non_smokers_trace]

# We create a figure containing all the boxes generated.
figure = go.Figure(data)
figure.update_layout(title="Relation between Charges and Obese people smoker status",
                     xaxis=dict(title="Obese people"),
                     yaxis=dict(title="Charges"),
                     legend_title_text='Obese people',
                     width=900,
                     height=400)

iplot(figure)

As we have said, there is a huge difference between obese people depending on their smoker status. That is an obvius fact because obese smokers have a huge risks to health insurance companies.

Another thing we can do in order to analyze our data is plotting all the relations between attributes:

In [20]:
colorscale = [[0, '#14B4FE'], [1, '#340574']]

# We generate a figure that contains all posible attribute relationships.
figure = px.scatter_matrix(df.drop(['age_group', 'weight_status'], axis=1),
                           color="charges",
                           color_continuous_scale=colorscale)

figure.update_layout(title="Attributes Relationship", width=900, height=500)

iplot(figure)

We can easily extract a lot of information from this charts. In fact, if we take a look on the charges row we can easily recognize some of the facts we have previously said.

Last step on our data analysis is to compute the correlation between the attributes in order to decide which attributes are going to be used for training the model. However, before doing this we need to represent all our attributes numerically:

In [21]:
# We transform to numerical our categorical attributes.
for attribute in ['sex', 'smoker', 'region']:
    label_encoder = LabelEncoder()
    label_encoder.fit(df[attribute])
    df[attribute] = label_encoder.transform(df[attribute])

Finally, we can represent the correlation values by using a correlation matrix:

In [22]:
# We compute the correlation matrix.
correlation_matrix = df.corr()

# We generate the Heatmap of the correlation matrix.
hm = go.Heatmap(z=correlation_matrix.values,
                x=correlation_matrix.index.values.tolist(),
                y=correlation_matrix.index.values.tolist(),
                colorscale=colorscale)

data = [hm]

# We plot our correlation matrix.
figure = go.Figure(data)
figure.update_layout(title="Correlation Matrix", width=400, height=400)

iplot(figure)

If we take a look on the charges correlations, we can clearly see how only three attributes have an acceptable correlation:

  • Age.
  • BMI.
  • Smoker.

5. Data preparation

Once we have analyzed all our data we need to prepare it for training the models.

First thing we are going to do is to keep only the attributes with an acceptable correlation:

In [23]:
df = df[['age', 'bmi', 'smoker', 'charges']]

Once we have filtered the data, we need to differentiate between the attribute we want to predict and the other ones.

In [24]:
X = df[['age', 'bmi', 'smoker']]
y = df['charges']

Last step is splitting our data in two different sets, one for training the model and the other for testing.

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

6. Model creation

For this problem we are going to create 3 different models and compare them.

6.1. Multiple Linear Regression

A MLR model uses several explanatory variables in order to predict the outcome of a specific variable. The model tries to learn the linear relation between the explanatory variables and the variable we want to predict.

First of all, we are going to define the model:

In [26]:
linear_regressor_model = LinearRegression()

Next step is training it by using the training set previously defined:

In [27]:
linear_regressor_model = linear_regressor_model.fit(X_train, y_train)

Now that we have trained the model we can predict the test set:

In [28]:
linear_regressor_predictions = linear_regressor_model.predict(X_test)

Finally we can evaluate the model predictions by using two different metrics:

MAE (Mean Absolute Error):

This metric expresses the mean of the absolute value of the errors.

R2 Score:

This metric represents how close the data are to the fitted regression line.

In [29]:
linear_regressor_mae = round(mean_absolute_error(y_test, linear_regressor_predictions), 2)
linear_regressor_r2 = round(r2_score(y_test, linear_regressor_predictions), 2)

print('Mean Absolute Error:', linear_regressor_mae)
print('R2 score:', linear_regressor_r2)
Mean Absolute Error: 4260.56
R2 score: 0.78

We can appreciate how we have a MAE of 4260 which means that we have a margin error of 4260 in our predictions.

6.2. Decision Tree Regressor

A decision tree is a model composed by branches where each one represents a decision or outcome. The last nodes of the decision tree, also known as leaf nodes, define the outcome of the model.

First step is creating the decision tree model:

In [30]:
decision_tree_model = DecisionTreeRegressor()

We are going to use one model selection technique provided by the sklearn library named GridSearchCV that finds which combination of parameters fits the best for the model.

In [31]:
k = 3

# We define the parameters we want to test.
parameters = dict(criterion=['mae', 'mse'],
                  max_depth=range(1, 10),
                  splitter=['best', 'random'])

# We create the GridSearchCV.
grid_decision_tree = GridSearchCV(estimator=decision_tree_model,
                                  param_grid=parameters,
                                  scoring='neg_mean_absolute_error',
                                  cv=k)
# We use the GridSearchCV to find the best combination of parameters.
grid_decision_tree.fit(X_train, y_train)

# We get the best combination of parameters.
best_params = grid_decision_tree.best_params_

print("Best parameters:", best_params)
Best parameters: {'criterion': 'mae', 'max_depth': 5, 'splitter': 'best'}

Once we know the parameters that better fit our decision tree, we can train the model:

In [32]:
# We set the parameters of the decision tree.
decision_tree_model.set_params(criterion=best_params['criterion'],
                               max_depth=best_params['max_depth'],
                               splitter=best_params['splitter'])

# We fit the decision tree.
decision_tree_model = decision_tree_model.fit(X_train, y_train)

After training the model, we are going to predict the test set:

In [33]:
decision_tree_predictions = decision_tree_model.predict(X_test)

Finally we can evaluate the model:

In [34]:
decision_tree_mae = round(mean_absolute_error(y_test, decision_tree_predictions), 2)
decision_tree_r2 = round(r2_score(y_test, decision_tree_predictions), 2)

print('Mean Absolute Error:', decision_tree_mae)
print('R2 score:', decision_tree_r2)
Mean Absolute Error: 1983.51
R2 score: 0.86

As we can see, there is only a margin error of 1983 and a R2 score of 0.86 which means our model predicts quite well.

6.3. Random Forest Regressor

Random Forest is a technique that consists in combining the output of multiple decision trees in order to create a model.

We are going to start by defining it:

In [35]:
random_forest_model = RandomForestRegressor()

We are going to find which are the best parameters for our model:

In [36]:
k = 3

# We define the parameters we want to test.
random_forest_parameters = dict(criterion=['mae', 'mse'],
                                max_depth=range(1,10))

# We create the GridSearchCV
grid_random_forest = GridSearchCV(estimator=random_forest_model,
                                  param_grid=random_forest_parameters,
                                  scoring='neg_mean_absolute_error',
                                  cv=k)

# We use the GridSearchCV to find the best combination of parameters.
grid_random_forest.fit(X_train, y_train)

# We get the best combination of parameters.
best_params = grid_random_forest.best_params_ 

print("Best parameters:", best_params)
Best parameters: {'criterion': 'mae', 'max_depth': 4}

Next step is training the model with these parameters:

In [37]:
random_forest_model.set_params(criterion=best_params['criterion'],
                               max_depth=best_params['max_depth'])

random_forest_model = random_forest_model.fit(X_train, y_train)

We can now make predictions on our test set:

In [38]:
random_forest_predictions = random_forest_model.predict(X_test)

Last thing to do is evaluating the model:

In [39]:
random_forest_mae = round(mean_absolute_error(y_test, random_forest_predictions), 2)
random_forest_r2 = round(r2_score(y_test, random_forest_predictions), 2)

print('Mean Absolute Error:', random_forest_mae)
print('R2 score:', random_forest_r2)
Mean Absolute Error: 1965.09
R2 score: 0.87

In these results we can observe how the model predicts quite well with an approximately margin error of 2000.

7. Model comparison

After creating and testing the three models, it is time to pool and compare the results:

In [40]:
# We create a DataFrame of the results obtained for each model.
results = pd.DataFrame()
results['Model'] = ['Multiple Linear Regression', 'Decision Tree Regressor', 'Random Forest Regressor'] 
results['MAE'] = [linear_regressor_mae, decision_tree_mae, random_forest_mae]
results['R2'] = [linear_regressor_r2, decision_tree_r2, random_forest_r2]

# We sort the results by the MAE value.
results = results.sort_values('MAE')

# We create the MAE chart.
mae_figure = px.bar(results, x='MAE', y='Model', color='Model', orientation='h', color_discrete_sequence=colors)
mae_figure.update_layout(title="Mean Absolute Error (MAE)", width=900, height=300)

iplot(mae_figure)

# We create the R2 chart.
r2_figure = px.bar(results, x='R2', y='Model', color='Model', orientation='h', color_discrete_sequence=colors)
r2_figure.update_layout(title="R2 Score", width=900, height=300)

iplot(r2_figure)

As we can see, the Multiple Linear Regression is the worst model. Also we can see how the Decision Tree Regressor model and the Random Forest Regressor model have similar results, beeing the second the best of all them.

For last, lets plot the predictions of all three models in a same chart:

In [41]:
# We create the reference trace.
reference_trace = go.Scatter(x=y_test, y=y_test, mode='markers', name="Reference", marker=dict(color='#03045E'))

# We create the Multiple Linear Regression trace.
mlr_trace = go.Scatter(x=y_test,
                       y=linear_regressor_predictions,
                       mode='markers',
                       name="Multiple Linear Regression",
                       marker=dict(color='#FF007B'))

# We create the Decision Tree trace.
decision_tree_trace = go.Scatter(x=y_test,
                                 y=decision_tree_predictions,
                                 mode='markers',
                                 name="Decision Tree Regressor",
                                 marker=dict(color='#0090FF'))

# We create the Random Forest trace.
random_forest_trace = go.Scatter(x=y_test,
                                 y=random_forest_predictions,
                                 mode='markers',
                                 name="Random Forest Regressor",
                                 marker=dict(color='#8800FF'))

data = [reference_trace, mlr_trace, decision_tree_trace, random_forest_trace]

# We create the figure with all the model traces. 
figure = go.Figure(data)
figure.update_layout(title="Predictions", xaxis_title="Real charges", yaxis_title="Predicted charges", width=900, height=400)

iplot(figure)

In this chart we can appreciate how the MLR is the worst of all three models and also how the Random Forest Regressor predicts quite well our data.

8. Conclusions

After all the analysis and the results obtained, we can expose the following conclusions:

  • There is a huge relation between attributes that represent the health of the client and the charges.
  • Smoker charges are a lot higher than non-smokers.
  • The combination of being obese and smoker is a very combination that results in very expensive charges.
  • We have been able to create different regression models for our dataset.
  • The worst model has been the Multiple Linear Regression one.
  • The model that has better predictions is the Random Forest Regressor.
  • Our best model has a margin error of approximately $1950, which is a good margin since the charges are normally very high.
Author face

Eric Lozano

Computer Science Student at Universitat Autònoma de Barcelona (UAB)