Building datasets – A Step by step guide

So, you already know that a good dataset is vital for developing an effective AI model—the interwebs is full of this claim. But how do you actually build one? In this blog, you’ll be able to follow along with the process using an existing dataset. The goal is for you to apply this knowledge to create a dataset tailored to your own model. You don’t need to know Python to follow along, but having some familiarity with it will certainly be beneficial.

Let’s dive into the example of a customer churn predictor dataset.

In the business world, losing customers—or churn—can be costly. Predicting when a customer is about to leave allows companies to take preventive actions, improving customer retention and boosting profits. But how do we build a predictive model that can anticipate customer churn?

The foundation of every successful predictive model is the dataset. A well-prepared dataset is crucial for accurately identifying patterns and trends that can predict churn. In this post, we’ll walk you through the process of building and refining a dataset for customer churn prediction, and we’ll provide hands-on examples using a real dataset that you can follow along with.

What is Customer Churn?

Customer churn refers to the percentage of customers who stop using your product or service within a given timeframe. Predicting this allows companies to take preventive action, improving customer retention and boosting profits.


Step 1: Downloading the Dataset

To get started, you can follow along by downloading the Telco Customer Churn dataset from Kaggle’s Telco Customer Churn dataset. This dataset contains information such as customer demographics, account details, and whether or not they churned.

Once you have the dataset, we can jump right into the data exploration and cleaning process.

Step 2: Exploring the Dataset

After downloading the dataset, let’s first load it in Python using the pandas library and get an idea of what the data looks like:

pythonCopy codeimport pandas as pd

# Load the dataset
data = pd.read_csv('Telco-Customer-Churn.csv')

# Display the first few rows
print(data.head())

In this dataset, each row represents a customer, and the columns contain information such as their tenure (how long they’ve been a customer), monthly charges, and whether or not they churned (Churn).

Step 3: Data Cleaning

Real-world datasets often contain missing or inconsistent data, so we’ll need to clean it up. Let’s first check for missing values:

pythonCopy code# Check for missing values
print(data.isnull().sum())

You'll notice that the TotalCharges column has some missing values. To clean this up, we’ll remove the rows with missing values:

pythonCopy code# Drop rows with missing TotalCharges
data = data.dropna(subset=['TotalCharges'])

# Check the dataset size after cleaning
print(data.shape)

Step 4: Data Preparation

Now that we’ve cleaned the data, we need to prepare it for modeling by converting categorical features into numerical ones. This is important because machine learning models work with numbers, not text.

We’ll use one-hot encoding for variables like gender and InternetService and convert Churn into a binary variable (1 for churned, 0 for not churned).

pythonCopy code# One-hot encode categorical variables
data = pd.get_dummies(data, columns=['gender', 'InternetService', 'Contract', 'PaymentMethod'], drop_first=True)

# Convert 'Churn' to binary (Yes -> 1, No -> 0)
data['Churn'] = data['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

Step 5: Splitting the Data

To evaluate the model’s performance, we need to split the dataset into a training set (used to train the model) and a test set (used to evaluate it). We’ll use 80% of the data for training and 20% for testing.

pythonCopy codefrom sklearn.model_selection import train_test_split

# Select features and target variable
X = data.drop(['customerID', 'Churn'], axis=1)
y = data['Churn']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")

Step 6: Building a Model

We’re now ready to build a model! We’ll start with a simple logistic regression model, which is widely used for binary classification problems like churn prediction.

pythonCopy codefrom sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

Step 7: Improving the Model with Feature Engineering

To further improve our model, we can create new features based on the existing ones. For example, dividing a customer’s tenure by their monthly charges might provide useful information about loyalty.

pythonCopy code# Create a new feature: tenure per monthly charge
data['tenure_per_monthly_charge'] = data['tenure'] / data['MonthlyCharges']

# Add the new feature to the model training process
X = data.drop(['customerID', 'Churn'], axis=1)

After adding new features, retrain the model to see if performance improves. Feature engineering can significantly enhance the model’s ability to detect patterns.

Step 8: Analyzing the Results

Once your model is trained, you can evaluate its performance using the accuracy score, but don’t stop there. It’s also crucial to examine precision, recall, and F1-score to ensure the model is performing well, especially in situations where the dataset might be imbalanced (i.e., more non-churners than churners).

pythonCopy code# View classification report with precision, recall, and F1-score
print(classification_report(y_test, y_pred))

Conclusion

Customer churn prediction is a powerful tool that helps businesses retain valuable customers. But remember, the quality of your predictions relies heavily on the quality of your dataset. By properly cleaning, preparing, and engineering your data, you can build highly effective models.

Now that you’ve walked through the process step-by-step, download the Telco Customer Churn dataset and give it a try yourself. Once you’ve done this you can then start planning your own dataset.

If you need help with any of the above steps for your application, we can help you with each step.