2019, Feb 07

How to get a second date

speed_dating

1. Introduction

Are you curious about knowing what influences love at first sight? Or about knowing what the key attributes to get a call are? In this article we will be covering a problem that will help us to discover what the attributes that make us more attractive to the eyes of other people are. In this case we will be using data extracted from a speed dating experiment, but the conclusions to which we will arrive could be extrapolated to any life situation where the first impressions are crucial, like for example job interviews.

As an outline of this article, the following topics will be covered:

  • Data cleaning and preparation
  • Analysis of gender differences in the partner selection
  • Creation of a classification model to predict if a person will get a call after the date
  • Evaluation of a classification model

The data used for this article has been obtained from Kaggle. This data was gathered from participants in a speed dating experiment conducted by two professors of the Columbia Business School. The aim of this experiment was to obtain experimental data to write a paper showing the gender differences in the mate selection (you can see this paper here).

During the experiment, the attendees had a four minute first date with all the other participants of the opposite sex. At the end of their speed date, the participants had to choose if they would have a second date and also had to rate their partner in the following six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition and Shared Interests. The data also contains questionnaire data such as demographics, dating habits, self-perception of the attributes and lifestyle information.

These are the dependencies for this problem:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import scipy.stats as stats
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix
from sklearn import metrics

2. Data cleaning and preparation

The first step of all problems is reading our data. It is specified at Kaggle to read this data with the following encoding because it contains special characters.

In [2]:
data_path = "Speed_Dating_Data.csv"

#Reading the CSV data 
data = pd.read_csv(data_path, delimiter = ",", encoding = "ISO-8859-1")
call_col = data.loc[:, "them_cal"]

The second step is to check the number of NaN values of each column:

In [3]:
print(data.isnull().sum())
iid            0
id             1
gender         0
idg            0
condtn         0
wave           0
round          0
position       0
positin1    1846
order          0
partner        0
pid           10
match          0
int_corr     158
samerace       0
age_o        104
race_o        73
pf_o_att      89
pf_o_sin      89
pf_o_int      89
pf_o_fun      98
pf_o_amb     107
pf_o_sha     129
dec_o          0
attr_o       212
sinc_o       287
intel_o      306
fun_o        360
amb_o        722
shar_o      1076
            ... 
amb1_3      4404
shar1_3     4404
attr7_3     6362
sinc7_3     6362
intel7_3    6362
fun7_3      6362
amb7_3      6362
shar7_3     6362
attr4_3     5419
sinc4_3     5419
intel4_3    5419
fun4_3      5419
amb4_3      5419
shar4_3     5419
attr2_3     5419
sinc2_3     5419
intel2_3    5419
fun2_3      5419
amb2_3      5419
shar2_3     6362
attr3_3     4404
sinc3_3     4404
intel3_3    4404
fun3_3      4404
amb3_3      4404
attr5_3     6362
sinc5_3     6362
intel5_3    6362
fun5_3      6362
amb5_3      6362
Length: 195, dtype: int64

As we can see, there are a lot of columns that have a large number of NaN values. These columns will not give us information and will hinder the training process of the model. Therefore, we will drop the columns with more than 800 NaN values with the following code.

The only exception that we will have for this rule will be the column “them_cal” because it is the one that we want to predict in our problem.

In [4]:
# Drop columns with more than 800 NaN values
max_num_nan = 800
data = data.loc[:, (data.isnull().sum() < max_num_nan)]
data["them_cal"] = call_col

Now, we will delete all the rows with remaining NaN values.

In [5]:
# Drop rows with NaN values
data = data.dropna()

The following step is usually check the types of the columns, because there are columns that are not useful in the process of training the classification model.

In [6]:
print (data.dtypes)
iid           int64
id          float64
gender        int64
idg           int64
condtn        int64
wave          int64
round         int64
position      int64
order         int64
partner       int64
pid         float64
match         int64
int_corr    float64
samerace      int64
age_o       float64
race_o      float64
pf_o_att    float64
pf_o_sin    float64
pf_o_int    float64
pf_o_fun    float64
pf_o_amb    float64
pf_o_sha    float64
dec_o         int64
attr_o      float64
sinc_o      float64
intel_o     float64
fun_o       float64
amb_o       float64
like_o      float64
prob_o      float64
             ...   
shopping    float64
yoga        float64
exphappy    float64
attr1_1     float64
sinc1_1     float64
intel1_1    float64
fun1_1      float64
amb1_1      float64
shar1_1     float64
attr2_1     float64
sinc2_1     float64
intel2_1    float64
fun2_1      float64
amb2_1      float64
shar2_1     float64
attr3_1     float64
sinc3_1     float64
fun3_1      float64
intel3_1    float64
amb3_1      float64
dec           int64
attr        float64
sinc        float64
intel       float64
fun         float64
amb         float64
like        float64
prob        float64
met         float64
them_cal    float64
Length: 88, dtype: object

As we can see, there are some columns of the type object, which we are going to drop.

In [7]:
# Drop columns of type 'object'
data = data.loc[:, (data.dtypes != 'object')]

In the description of the data it is specified that in the waves from 6 to 9 the participants had to rate the partners with a different method than all the other waves. Since this two methods can’t be mixed, we will drop the rows from the waves from 6 to 9.

In [8]:
# Deleting the waves with different representation
data = data[(data.wave > 9) | (data.wave < 6)]

The last thing we need to do to prepare our data is to get the column that we will want to predict. In the dataset we have the column “them_cal”, that specifies the number of calls this participant has received. We only want to predict if a participant had received any call after the date, so we will convert the column “them_cal” into a binary column named “call”.

In [9]:
# Creating the variable 'call' (Y)
data['call'] = data['them_cal'].map(lambda x: x == 0).astype(int)
data.drop('them_cal', axis=1, inplace=True)

3. Data Analysis

Since the data has a lot attributes, it is very difficult to understand exactly how it is and how it is distributed. That’s the main reason why we need to perform a data analysis before getting to work on the classification model.

The first recommendable thing to do is to analyse the histograms and normal distribution of the attributes. The more normal the distribution of the data the better, because the classification model will not be biased to one of the classes.

The following function is used to plot a histogram and the normal distribution line for a specific attribute:

In [10]:
def histogram(data, column_name):
    column_data = sorted(data.loc[:, column_name])

    fit = stats.norm.pdf(column_data, np.mean(column_data), np.std(column_data))

    plt.title(column_name)
    plt.xlabel(column_name)
    plt.ylabel("frequency")
    plt.plot(column_data, fit, 'o-', color = "salmon")
    plt.hist(column_data, color = "skyblue", density = True, rwidth = 0.75)

    plt.savefig("histograms/" + column_name + ".png")
    plt.close()

We can plot all the attributes histograms using the following loop:

In [11]:
column_names = data.columns.values

for col in column_names:
    histogram(data, col)

All the plots will be saved into a folder named histograms.

Now that we have all our histograms, let’s find some interesting attributes and check their distributions.

For example, the first thing that we can check is if our Y variable (call) is balanced. With the following histogram we can see that there is approximately the same number of participants who received a call as participants who didn’t receive any call.

The next thing we can check are attributes that can help us to better understand how the experiment was performed. For instance, we will check the age and gender distribution of the data. It can easily be seen with the histograms that the age of the participants is within a range of 20 to 35, and that there are more females (0) than males (1) participating in the experiment.

In the data we also have the six attributes that we mentioned in the introduction:

  • Attractiveness
  • Sincerity
  • Intelligence
  • Fun
  • Ambition
  • Shared Interests

Each participant had to rate each partner in these 6 attributes, distributing 100 point between the different attributes.

As well, the data have ratings of each participant in the following hobbies or lifestyle interests:

  • Sports
  • Tv sports
  • Exercise
  • Dining
  • Museums
  • Art
  • Hiking
  • Gaming
  • Clubbing
  • Reading
  • TV
  • Theatre
  • Movies
  • Concerts
  • Music
  • Shopping
  • Yoga

The aim of this article is to discover what personal traits and likes are the ones that make a person more appealing to the other people. That’s why we will try to find if these attributes have any correlation with the fact of receiving a call after the meetup.

In first place, we have created a function to plot the correlation matrix of some variables using the Seaborn library:

In [12]:
def correlation_matrix(data):
    plt.figure()
    sns.heatmap(data.corr(), center=0, annot=True, vmin=-1, vmax=1, cmap="RdBu_r")
    plt.savefig("plots/attr_heatmap.png")
    plt.close()

With that function we will plot the correlation between the personal attributes, the fact of matching in the date and the fact of receiving a call some days later:

In [13]:
# Correlation matrix of the 6 personal attributes
attributes = ['attr', 'sinc', 'intel', 'fun', 'amb', 'match', 'call']
attr_data = data.loc[:, attributes]
correlation_matrix(attr_data)

We will also plot the same correlation heatmap but with the personal interests and hobbies:

In [14]:
interests = ['sports', 'tvsports', 'exercise', 'dining', 'museums', 'art',
             'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater',
             'movies', 'concerts', 'music', 'shopping', 'yoga', 'call']
int_data = data.loc[:, interests]
correlation_matrix(int_data)

If we focus on the correlations of all the analysed variables with the fact of receiving a call, we can obtain a conclusion. At the moment of the truth, none of the attributes or the personal likes are really correlated with the fact of receiving a call after the date.

Nevertheless, in these correlation maps we can observe that the Attractiveness is highly correlated with the Fun, and the same thing happens with the Intelligence and the Ambition. Also, some of the personal interests have a high value of correlation, like for example going to museums with liking art or going to concerts with liking to listen to music.

Last but not least, we want to explore one of the topics why this experiment was performed: the differences between males and females when looking for a partner. It is important to explore the problem when we are working with data, because it will allow us to understand the information within these data.

In the dataset we have some columns that contain the ratings of every participant of which attributes are the ones that is looking for in the opposite sex. With these columns we can plot the preferences of males and females in the 6 mentioned personal attributes.

We will plot this information in a spider chart with the following function:

In [15]:
def spider_chart_combined(data_columns_males, data_columns_females, title):
    labels = np.array(['Attractiveness', 'Sincerity', 'Intelligence', 'Fun', 'Ambition', 'Shared interests'])

    # To close the plot
    average_males = data_columns_males.mean()
    average_males = np.concatenate((average_males, [average_males[0]]))
    average_females = data_columns_females.mean()
    average_females = np.concatenate((average_females, [average_females[0]]))
    angles = np.linspace(0, 2 * np.pi, len(labels), endpoint=False)
    angles = np.concatenate((angles, [angles[0]]))

    fig = plt.figure()
    ax = fig.add_subplot(111, polar=True)
    ax.plot(angles, average_males, 'b', linewidth=2)
    ax.fill(angles, average_males, 'b', alpha=0.25)
    ax.plot(angles, average_females, 'r', linewidth=2)
    ax.fill(angles, average_females, 'r', alpha=0.25)
    ax.set_thetagrids(angles * 180 / np.pi, labels)
    ax.set_title(title)
    ax.grid(True)

    fig.legend(["Male", "Female"], loc=4)
    fig.savefig("plots/spider_chart.png")

To this function we have to pass the 6 attributes columns split by gender and the title we want for the plot. So, the first step is to separate the data between genders and to select the specific columns:

In [16]:
data_males = data[data.gender == 1]
data_females = data[data.gender == 0]

columns_opp_sex = ['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1']

col_males = data_males.loc[:, columns_opp_sex]
col_females = data_females.loc[:, columns_opp_sex]
title = "What are the participants looking for in the opposite sex?"

spider_chart_combined(col_males, col_females, title)
Out[16]:

We also can plot with this function what the attributes that each sex thinks the opposite is more interested in are:

In [17]:
columns_opp_sex_wants = ['attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1']

col_males = data_males.loc[:, columns_opp_sex_wants]
col_females = data_females.loc[:, columns_opp_sex_wants]
title = "What the participants think the opposite sex wants?"

spider_chart_combined(col_males, col_females, title)
Out[17]:

Looking at these 2 graphs we can extract a clear conclusion. The male participants of the experiment are strongly conditioned by the physical attractiveness of the female participants. This personal attribute seems to be the most important for men when looking for a partner.

On the other hand, the female participants seem to be aware of this fact, as can be observed in the second image. Women expect the attractiveness to be the most desirable attribute in the eyes of men.

4. Classification model

Now that we have prepared and analysed our data, we can proceed to use a classification model to predict if a participant will receive a call from the date or not. The first thing we will do is to drop some columns that will hinder the classification model, like for example the ids, and then we will select the X and the y:

In [18]:
data = data.drop(["iid", "id", "pid", "idg"], axis=1)

y = data["call"]
X = data.drop("call", axis=1)

The following step is to split the data. Normally the most common method to evaluate correctly a model is to use a K-fold method, but since we want to create some graphics with the predictions later we will perform a Holdout partition.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Then we will declare, train and predict with a Random Forest Classifier:

In [20]:
rf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

print("Accuracy: " + str(rf.score(X_test, y_test)))
Accuracy: 0.7891891891891892

5. Evaluation

5.1. Confusion Matrix

When we are evaluating a classification model, the most common thing to do is a confusion matrix. Each row of this matrix represents the instances of the predicted class, while each column represents the instances of the actual class.

In a binary classification problem, this matrix has 2 rows and 2 columns and reports the number of false positives, false negatives, true positives and true negatives.

Since accuracy is not a reliable metric for the real performance of a classification, it is important to know how to calculate and use a Confusion Matrix.

Confusion matrix

In [21]:
# Confusion matrix
confusion_matrix = confusion_matrix(y_test, y_pred)

print(confusion_matrix)
[[326  26]
 [130 258]]
In [22]:
# Plotting the confusion matrix
plt.clf()
plt.imshow(confusion_matrix, interpolation='nearest')
classNames = ['No call','Call']
plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames, rotation=45)
plt.yticks(tick_marks, classNames)
s = [['TN','FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        plt.text(j, i, str(s[i][j]) + " = " + str(confusion_matrix[i][j]))
plt.show()
plt.close()
Out[22]:

5.2. ROC Curve

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The TPR is also known as sensitivity or recall, and the FPR is the same as calculating 1 - specificity.

TPR=Sensitivity=TPP=TruePositivePositive
FPR=FPN=FalsePositiveNegative=1Specificity

The hypothetical perfect classifier would be the one with a sensitivity and a specificity of 1 (TPR = 1 and FPR = 0). (the red line). In the following image we can observe the ideal ROC curve, a good ROC curve, and the worst ROC curve that can be obtained.

ROC Curve Examples

The Area Under the Curve (AUC) tells how much the model is capable of distinguishing between classes. Basically, this value says how close our ROC curve is to the ideal ROC curve. The closer this area is to 1 the better. On the other hand, if this value is closer to 0.5 or less it means that the model is not performing well.

In [23]:
# ROC curve
y_probs = rf.predict_proba(X_test)[::,1]

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_probs)
auc = metrics.roc_auc_score(y_test, y_probs)

x = np.linspace(0, 1)
y = x
plt.plot(x, y, linestyle="--", label="auc = 0.5")

plt.plot(fpr, tpr, label="auc = " + str(auc))
plt.title("ROC Curve")
plt.xlabel("FPR (1 - Specificity)")
plt.ylabel("TPR (Sensitivity)")
plt.legend(loc=4)
plt.show()
plt.close()
Out[23]:

This is the obtained ROC curve with the 0.5 AUC line for reference.

The AUC obtained for this problem is 0.89, which is a good value for a classification model. This basically means there is 89% chance that the model will be able to distinguish between the positive class and the negative class.

5.3. Precision-Recall Curve

The precision recall curve summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.

Precision=TPTP+FP
Recall=TPTP+FN
In [24]:
# Precision - Recall Curve ------------------------------------
precision, recall, threshold = metrics.precision_recall_curve(y_test, y_probs)
avg = metrics.average_precision_score(y_test, y_pred)

plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall, precision, alpha=0.2)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title("Precision - Recall curve (AP={0:0.2f})".format(avg))
plt.show()
Out[24]:

This Precision - Recall curve and the calculated Average Precision (AP) are showing that the classification works as expected. We've obtained an Average Precision of 0.78, that is an acceptable value for the trade-off between precision and recall.

6. Conclusions

With all the obtained results and plots we have obtained some conclusions for this problem:

  • Men are highly conditioned by the attractiveness of the female participants.
  • Women preferences are more evenly distributed, but the Intelligence is the attribute that they slightly appreciate more.
  • At the moment of the truth, the personal attributes and the personal likes are not highly related with the fact of receiving a call days after the date.
  • Our model is working properly, as can be seen in the ROC and Precision-Recall curves.
Author face

Laura Planas Simón

Computer Science Student at Universitat Autònoma de Barcelona (UAB)