2019, Oct 07

Pulsar Stars Detection

Predicting a Pulsar Star with Deep Learning

Predicting a Pulsar Star with Deep Learning

1. Motivation

Pulsar stars are a very rare type of Neutron star that produce radio emission detectable on Earth and they are of considerable scientific interest as probes of space-time and states of matter. Their emission spreads across the sky and produces a detectable pattern of broadband radio emission. However in practice almost all detections are caused by radio frequency interference and noise, making legitimate signals hard to find.

Having said that, the main purpose of this problem is to build a simple classifier using deep learning tools in order to predict wether a detected signal comes from a pulsar star or from other sources such as noises, interferences, etc.

All information and data related to this problem can be found here: https://www.kaggle.com/pavanraj159/predicting-a-pulsar-star

2. Data Information

Each signal is described by eight continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile and the remaining four variables are similarly obtained from the DM-SNR curve. These variables are:

Mean of the integrated profile.

Standard deviation of the integrated profile.

Excess kurtosis of the integrated profile.

Skewness of the integrated profile.

Mean of the DM-SNR curve.

Standard deviation of the DM-SNR curve.

Excess kurtosis of the DM-SNR curve.

Skewness of the DM-SNR curve.

Class

The data set shared here contains 17898 total samples.

3. Dependences

Here we can find the libraries we will use in order to develop a solution for this problem:

numpy|pandas: Will help us treat and explore the data, and execute vector and matrix operations.
matplotlib|seaborn: Will help us plot the information so we can visualize it in different ways and have a better understanding of it.
keras: Will provide us all the necessary deep learning tools to develop a solution for the problem.
sklearn: We are using this library since it gives us access to some tools to divide our dataset into training and test and some metrics we can use to evaluate our models.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import keras
from prettytable import PrettyTable
from keras.models import Sequential
from keras.utils.vis_utils import plot_model
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from keras.layers import Dense, Dropout
Using TensorFlow backend.

4. Data Exploration

Since this is a deep learning solution we do not want to make any feature selection or special treatment to our data. However let's take a quick look at our data and understand it first!

In [2]:
#read the csv that contains our data and print the first 5 rows of it
df = pd.read_csv("pulsar_stars.csv")
df.head()
Out[2]:
Mean of the integrated profile Standard deviation of the integrated profile Excess kurtosis of the integrated profile Skewness of the integrated profile Mean of the DM-SNR curve Standard deviation of the DM-SNR curve Excess kurtosis of the DM-SNR curve Skewness of the DM-SNR curve target_class
0 140.562500 55.683782 -0.234571 -0.699648 3.199833 19.110426 7.975532 74.242225 0
1 102.507812 58.882430 0.465318 -0.515088 1.677258 14.860146 10.576487 127.393580 0
2 103.015625 39.341649 0.323328 1.051164 3.121237 21.744669 7.735822 63.171909 0
3 136.750000 57.178449 -0.068415 -0.636238 3.642977 20.959280 6.896499 53.593661 0
4 88.726562 40.672225 0.600866 1.123492 1.178930 11.468720 14.269573 252.567306 0

First of all we observe the name of these columns is quite long and full of spaces, let's rename them:

In [3]:
#Changing the name of some columns
df.columns = ['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile', 'mean_dmsnr',
               'std_dmsnr', 'kurtosis_dmsnr', 'skewness_dmsnr', 'target']

Now let's look for null values and object datatypes we might want to transform into numerical values:

In [4]:
#Looking for null values
df.isna().sum()
Out[4]:
mean_profile        0
std_profile         0
kurtosis_profile    0
skewness_profile    0
mean_dmsnr          0
std_dmsnr           0
kurtosis_dmsnr      0
skewness_dmsnr      0
target              0
dtype: int64
In [5]:
#Looking for object datatypes
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
mean_profile        17898 non-null float64
std_profile         17898 non-null float64
kurtosis_profile    17898 non-null float64
skewness_profile    17898 non-null float64
mean_dmsnr          17898 non-null float64
std_dmsnr           17898 non-null float64
kurtosis_dmsnr      17898 non-null float64
skewness_dmsnr      17898 non-null float64
target              17898 non-null int64
dtypes: float64(8), int64(1)
memory usage: 1.2 MB

After having checked there are no null values and all columns contain numerical values, let's take a look at some information about our data:

In [6]:
#Show statistical information of our data
df.describe()
Out[6]:
mean_profile std_profile kurtosis_profile skewness_profile mean_dmsnr std_dmsnr kurtosis_dmsnr skewness_dmsnr target
count 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000
mean 111.079968 46.549532 0.477857 1.770279 12.614400 26.326515 8.303556 104.857709 0.091574
std 25.652935 6.843189 1.064040 6.167913 29.472897 19.470572 4.506092 106.514540 0.288432
min 5.812500 24.772042 -1.876011 -1.791886 0.213211 7.370432 -3.139270 -1.976976 0.000000
25% 100.929688 42.376018 0.027098 -0.188572 1.923077 14.437332 5.781506 34.960504 0.000000
50% 115.078125 46.947479 0.223240 0.198710 2.801839 18.461316 8.433515 83.064556 0.000000
75% 127.085938 51.023202 0.473325 0.927783 5.464256 28.428104 10.702959 139.309331 0.000000
max 192.617188 98.778911 8.069522 68.101622 223.392140 110.642211 34.539844 1191.000837 1.000000

Just by looking at this table we can extract some important information of our data, for example if we take a look at the target column we can see the max value is 1 (pulse star) and the minimum value is 0 (not a star) while the mean of this column tends to 0, which lets us know there are more "false pulse stars" than actual stars. Additionally we clearly see this data needs some scaling since the difference between values is noticeable.

5. Data Visualization

At this point we are going to plot some things that might be of our interest from this dataset. First of all knowing our target variable let's see the difference between values, or better said, the proportion between them:

In [7]:
#counting pulsars and not pulsars
pulsar = df[df['target'] ==1]
pulsar_count = pulsar["target"].value_counts()[1]
not_pulsar = df[df['target'] == 0]
not_pulsar_count = not_pulsar["target"].value_counts()[0]
In [24]:
#pie plotting the stats between pulsars and not pulsars
plt.figure(figsize=(5,5))
plt.pie(df["target"].value_counts().values,labels=["not pulsar stars","pulsar stars"], autopct="%1.0f%%")
plt.title("Proportion of target variable in dataset")
plt.show()
print("There are " + str(pulsar_count) + " signals that belong to pulsar stars " 
      + "and " + str(not_pulsar_count) + " signals that aren't from pulsars.")
There are 1639 signals that belong to pulsar stars and 16259 signals that aren't from pulsars.

From the graph above we see that approximately the 10% of our samples are real Pulsar Stars while the other 90% detected signals belong to something else such as noise. Now let's have a look at how features correlate with each other.

In [9]:
#plot correlation matrix
plt.figure(figsize=(6,6))
sns.heatmap(df.iloc[:,0:9].corr(), annot=True, fmt='.0%')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1faee85f7f0>

Something curious we can see from this correlation matrix is that four of the eight features we have in our dataset correlate positively with our target variable whilst the other four correlate negatively; this is really going to help when training our model since the separation between classes becomes clear. Having said that let's go deeper into investigating these features!

In [10]:
#violinlpot of all features
features = df.iloc[:,0:8]
plt.figure(figsize=(15,20))
j = 0
for i in features:
    plt.subplot(4,3,j+1)
    sns.violinplot(x=df["target"],y=df[i],palette=["red","lime"])
    plt.title(i)
    plt.axhline(df[i].mean(),linestyle = "dashed", label ="Mean value = " + str(round(df[i].mean(), 2)))
    plt.legend(loc="best")
    j = j + 1

Thanks to these violin plots we are able to extract information about the distribution of values for each of the features our database contain. Furthermore we can view these distributions for the different values our target variable has, plus having this dashed line being the mean value of the feature might help us understand this data better. That said, let's comment some of the features for which we see the separation between classes becomes more clear:

mean_profile. From the correlation matrix we observed the higher the values the less change of the signal coming from a pulsar star, in our violinplot we clearly see it. Additionally, by looking at the mean we could say that if a mean_profile value is above the mean value, that signal might probably to come from another source than a star.

kurtosis_profile. Like mean profile, this feature is also pretty interesting. We clearly observe how the majority of samples whose kurtosis_profile value is above the mean value belong to the group of pulsar stars while, with some outliers that break the rules, lower values than the mean come from other signals. In addition, the distribution of values from the "non pulsar" group is pretty similar, meaning the range of kurtosis_profile values for those signals is quite narrow and the opposite happens to the pulsar group, values tend to be in a range between 0.48-8.

skewness_profile. From skewness_profile we can extract a quite interesting information. It actually seems weird the mean value is just 1.77 when we have values higher than 60 in our dataset. The reason for that is that in our dataset we have approximately 10 times more "non-pulsar" than actual pulsar stars and the majority of skewness values for the non pulsar are pretty close to 0; since the pulsar group is proportionally smaller the mean value is penalized. However that gives us a very important information, the majority of samples whoso skewness_profile value is higher than the mean will probably belong to the pulsar group and, we could say almost 100% samples whose value is higher than 23~ are stars!

mean_dmsnr/skewness_dmsnr. These features are pretty similar in terms of data distribution with the difference being that in mean_dmsn the vast majority of negative star values lay under the mean and in skewness_dmsnr it's exactly the oppositve, pulsar stars are located under the mean value. That said, these distributions look very similar to the one we commented before (skewness_profile) but with one exception: here we can't surely affirm that from "x" value above or below the mean value each sample will belong to a pulsar or not pulsar star, since the range of values for these features is pretty wide.

In order to make sure our hypotesis of data being easily separable for the majority of features is true, we are going to create a pairplot between columns and check if we can visually make that separation.

In [11]:
#pairplot between features
sns.pairplot(df, hue="target", palette=["red","lime"])
Out[11]:
<seaborn.axisgrid.PairGrid at 0x1faee843978>

Thanks to pairplotting it's clear that our data seems to be split in two huge separated groups that can be easily differentiated by just looking at the graphic above.

6. Preparing and Fitting Our Deep Learning Model

Now it's time to prepare our Deep Learning model and, unlike in Machine Learning, we are not going to select any specific features, we are going to set some parameters that we think are going to make the neural network perform and its best and then feed it with all our data. That said, we may still have to split our data into training and set (validation might be OK as well) and scale our data too, if needed.

In [12]:
#Separing our features from target variable
X = df.iloc[:,0:8].values
y = df.iloc[:,8].values
In [13]:
#Scaling our data
sc = StandardScaler()
X = sc.fit_transform(X)
In [14]:
#Splitting data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3)

Now that we have our data scaled and split into training and test, it's time to define our deep learning model. We are going to use a Sequential model since we want to create our model layer by layer. We are going to be creating a simple fully-connected neuronal network with 3 total layers (2 of them as hidden layers and 1 output layer). Between these layers we are going to use Dropout layers in order to reduce overfitting and we're setting the rate as 0.3.

The input dimension is 8 since we're using eight features and output dimension as 1 since we're only receiving one label. The number of neurons per layer has been chosen randomly without being excessively high after having tried with bigger and lower numbers and observing the results barely variate. It's also important to mention that, after having tested other activation functions for hidden layers, we're using relu as activation function since it provided slightly better results and sigmoid for the output layer since this is a binary classification problem. For the output layer we could also use a softmax activation function but, since this is not a multi class classification problem but a binary, there is no real use of using it since it may decrease our accuracy by 2-3%.

In [15]:
#define a sequential Model
model = Sequential()

#First hidden layer
model.add(Dense(16,activation='relu',input_dim=8))
model.add(Dropout(0.25))

#Second hidden Layer
model.add(Dense(8,activation = 'relu'))
model.add(Dropout(0.25))

#Output layer
model.add(Dense(1,activation='sigmoid'))

Having already our model defined there comes the time to compile it. As loss function we're using binary_crossentropy since, as we've already said, this is a binary classification problem. We're aso using adam as optimizer since adapts the learning rate as the training progresses and as far as metrics are concerned we're using accuracy.

In [16]:
#Compiling the model
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
In [17]:
#Printing a summary of our model
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 16)                144       
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 136       
_________________________________________________________________
dropout_2 (Dropout)          (None, 8)                 0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 9         
=================================================================
Total params: 289
Trainable params: 289
Non-trainable params: 0
_________________________________________________________________

We can see our model in a table above, but let's show an image of how it really looks like. To do so we have used the Kaggle build-in python editor since it already has all the necessary libraries installed and we only want the picture of it. That said, here's the picture that describes our model:

Now it usually comes the time to fit our model. However before doing that it might be interesting to adjust the weight of our different output classes, in other words since our database has way more values that don't belong to pulsar stars, we really want to set a higher importance rate to those samples who really come from stars. That's something called handling imbalanced datasets and might help our model to predict, hopefully, our positive pulsar stars samples better.

In [18]:
#Adjusting class weights
weight = {0 : 1., 1 : 2.}

Having done that, let's fit our model!

In [19]:
#Fitting our model with training data and, at the same time, using test data for validation
history = model.fit(X_train, y_train, epochs=100, batch_size=64, class_weight=weight, validation_data = (X_test,y_test), verbose=2)

Train on 12528 samples, validate on 5370 samples
Epoch 99/100
 - 0s - loss: 0.1114 - accuracy: 0.9816 - val_loss: 0.0772 - val_accuracy: 0.9799
Epoch 100/100
 - 0s - loss: 0.1110 - accuracy: 0.9829 - val_loss: 0.0794 - val_accuracy: 0.9795

As we can see our model works pretty well (note that we only show the last two Epochs)! We get an accuracy score of 98% approximately for both our training and test data

7. Evaluating Our Model

In order to do this evaluation we're going to do two things. First of all we're going to plot the model accuracy and loss for each the training and validation data, and afterwards we're going to make a prediction and, through a confussion matrix, see how it goes.

In [20]:
#Visualizing accuracy for both training and test data
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='lower right')
plt.show()

As interesting as it may seem, we can observe how the accuracy of our model keeps practically the same after 10 Epochs, in our case we're using 100 since it might prevent overfitting.

In [21]:
#Visualizing loss for both training and test data
plt.plot(history.history['loss']) 
plt.plot(history.history['val_loss']) 
plt.title('Model Loss') 
plt.ylabel('Loss') 
plt.xlabel('Epoch') 
plt.legend(['Train', 'Test'], loc='upper right') 
plt.show()

With our two graphics already here let's make a prediction and see the confussion matrix:

In [22]:
#Confussion Matrix 
label_aux = plt.subplot()
prediction = model.predict(X_test)
cm = confusion_matrix(y_test, np.round(prediction))
cm_nn = pd.DataFrame(cm, index = ['Not a Star','Star'], columns = ['Not a Star','Star'])
sns.heatmap(cm_nn,annot=True,fmt="d", cbar=False)
label_aux.set_title("Confussion Matrix of Our Model")
label_aux.set_xlabel('Predicted Value');label_aux.set_ylabel('True Value');
In [23]:
#Printing some metrics
ptbl = PrettyTable()
ptbl.field_names = ["Accuracy", "Recall", "F1Score"]
ptbl.add_row([accuracy_score(y_test,np.round(prediction)),recall_score(y_test, np.round(prediction)), 
              f1_score(y_test, np.round(prediction))])
print(ptbl)
+--------------------+--------------------+-------------------+
|      Accuracy      |       Recall       |      F1Score      |
+--------------------+--------------------+-------------------+
| 0.9828398510242086 | 0.8772277227722772 | 0.881592039800995 |
+--------------------+--------------------+-------------------+

8. Conclusions

After solving this problem using Deep Learning we have come to the following conclusions:

  • Our Deep Learning model predicts very well wether a signal comes from a pulsar star or another external source.
  • Thanks to data being clearly separable in two groups (pulsar and not pulsar), our model has worked very well.
  • Had we happened to have used a Machine Learning algorithm instead of Deep Learning techniques, feature engineering would have been clear since there is a strong correlation on this dataset.
  • The number of Epochs (iterations) in our Neural Network could have been reduced since results were pretty much the same after the first 40 iterations but having a higher number of Epochs might prevent overfitting.
  • There are some features that are really well separated as we have explained when plotting the violins. It could be easy to predict the target class just by looking at a sample if some values where below or above the mean value.
  • This dataset is quite imbalance since there are only 10% of samples that belong to pulsar class; adjusting weights before feeding data to our neural network might help us detecting True Positives, which is what we're interested in.

In case someone wants to know how to solve this particular problem using different Machine Learning algorithms, this following link will take you to a very interesting Kernel where the author solves it using different techniques explaining the algorithms he uses as well as what he is doing each step.

Author face

Ferran Rodríguez Mir

Computer Science Student at Universitat Autònoma de Barcelona (UAB)