Unlocking Customer Insights: RFM Segmentation Strategies with the Online Retail Dataset (Part 1)

Ecesu Olgun Kesici
8 min readNov 22, 2024

--

This article is the first in a four-part series where we explore how the Online Retail Dataset can be leveraged to solve real-world business challenges. Each part focuses on a distinct topic, showcasing practical applications of data analysis and machine learning techniques:

  1. RFM Analysis for Customer Segmentation and Marketing Strategy Recommendations(Bonus: Detailed Analysis of Lost Customers with Tableau Visualizations)
  2. Building Recommendation Systems to Boost Strategic Customer Engagement
  3. Preventing Churn Before It Happens with Predictive Analytics
  4. Optimizing Profits Through Dynamic Pricing Strategies

In this first article, we will dive into RFM analysis, a simple yet powerful framework for understanding customer behavior and segmenting customers effectively.

In today’s data-driven world, understanding your customers’ behavior is critical to designing effective marketing strategies. One of the most popular methods for customer segmentation is RFM analysis. This straightforward yet powerful framework allows businesses to categorize customers based on their purchase history, offering actionable insights that drive targeted marketing efforts and increase profitability.

What is RFM Analysis?

RFM stands for Recency, Frequency, and Monetary value — three key dimensions that provide a snapshot of customer behavior:

  • Recency measures how recently a customer made a purchase. Customers who purchased recently are more likely to engage with your business again.
  • Frequency assesses how often a customer makes purchases. Frequent buyers are often more loyal and valuable to the business.
  • Monetary value captures how much a customer spends. High-spending customers represent significant revenue potential.

By scoring customers on each of these metrics, RFM analysis identifies patterns in purchasing behavior, enabling businesses to group customers into segments and design tailored strategies for each group.

Image was created via Tableau by author

A Practical Example with the Online Retail Dataset

  1. Preparing the Dataset: Before starting RFM analysis, it is crucial to thoroughly understand and analyze the dataset. Questions such as whether the data formats are correct, if there are missing values, and what the dataset represents are all part of the preprocessing steps that can directly impact further analysis.

To address this, we first download the Online Retail dataset from Kaggle to our local environment. Then, using pandas, we read the dataset, correct any incorrect data types, and proceed to the step of making sense of the data:

As observed, there are issues with the formats of the CustomerID and InvoiceDate variables. Since InvoiceDate will not be actively used in this case, it can remain as it is without converting it to a datetime format. However, the CustomerID variable needs to be converted to an object type. After this step, we need to move on to understanding the dataset, as there are unexpected negative values in the Quantity column. What do these negative values represent? To investigate this, we might want to take a closer look at the InvoiceNo column.

It appears that the values in the InvoiceNo column begin with 5, C, and A. The C-prefixed InvoiceNo values are exclusively associated with negative Quantity values. This suggests that C represents canceled or returned transactions. On the other hand, rows with InvoiceNo values starting with 5 have Quantity values greater than 0, as expected. Therefore, for the RFM analysis, we should only include rows with InvoiceNo values starting with 5, indicating successful transactions that completed the sales process.

Taking the analysis one step further, we notice unique values of ['S', 'M', 'm'] in the StockCode column of the dataset. What do these represent? Based on the Description column:

  • S corresponds to "Samples."
  • M and m are marked as "Manual."

Interestingly, we understand that there is no separate category for m; it should actually fall under the M category. The categories S and M can be interpreted as:

  • S → Representing products sent as samples.
  • M → Representing transactions processed as normal sales.

To ensure accurate analysis, the dataset is filtered based on the StockCode column to include only transactions categorized under M. The dataset is then reorganized accordingly. Lastly, some values in the CustomerID variable, which is critical for the RFM analysis, are NaN. Therefore, these rows were dropped, resulting in the final version of the dataset. The final dataset’s shape is as follows: 397,924 x 8.

2. RFM Analysis: In this RFM analysis, we first prepared the dataset by calculating the latest transaction date and creating a TotalAmount column as the product of UnitPrice and Quantity to represent the monetary value of each transaction. Next, we grouped the data by CustomerID to compute the three RFM metrics: Recency (days since the last purchase), Frequency (total number of purchases), and Monetary (total spending). After formatting these metrics, we assigned scores using quartiles: Recency, Frequency and Monetary were scored directly (higher is better). Finally, these scores were combined into a single RFM_Score, providing a concise yet comprehensive summary of each customer’s value and behavior, forming the basis for segmentation:

latest_date = purchase['InvoiceDate'].max() 
purchase['TotalAmount'] = purchase['UnitPrice'] * purchase['Quantity']

rfm = purchase.groupby('CustomerID').agg({
'InvoiceDate': lambda x: (latest_date - x.max()).days, # Recency
'InvoiceNo': 'count', # Frequency
'TotalAmount': 'sum' # Monetary (UnitPrice * Quantity)
}).reset_index()

rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']
rfm['Monetary'] = rfm['Monetary'].round(2)

rfm['R_Quartile'] = pd.qcut(rfm['Recency'], 4, labels=[4, 3, 2, 1])
rfm['F_Quartile'] = pd.qcut(rfm['Frequency'], 4, labels=[1, 2, 3, 4])
rfm['M_Quartile'] = pd.qcut(rfm['Monetary'], 4, labels=[1, 2, 3, 4])

rfm['RFM_Score'] = rfm['R_Quartile'].astype(str) + rfm['F_Quartile'].astype(str) + rfm['M_Quartile'].astype(str)

In other words, a high RFM_Score across all R, F, and M values indicates that the customer is more active and loyal. For example, a score of 444 can represent the best customers.

After creating the RFM analysis, I used the following logic to segment the customers:

def segment_customers(rfm):
segments = {
'Best Customers': ['444'],
'Loyal Customers': ['443', '442', '441', '43x'],
'Almost Lost': ['14x', '13x', '12x', '22x', '21x'],
'Lost Customers': ['11x'],
'New Customers': ['41x', '42x'],
'Potential Loyalists': ['33x', '34x'],
'Need Attention': ['24x', '23x', '32x', '31x']
}

def assign_segment(score):
for segment, patterns in segments.items():
for pattern in patterns:
if all(a == b or b == 'x' for a, b in zip(score, pattern)):
return segment

rfm['Segment'] = rfm['RFM_Score'].apply(assign_segment)
return rfm

rfm = segment_customers(rfm)

For the 3-digit RFM_Scores, composed of 4 possible values for each of R, F, and M, we know that there can be a total of 4 × 4 × 4 = 64 unique score combinations. Therefore, the definition of the segmentation framework, where each customer is assigned to a specific segment, and the potential actions we can take for each segment, are as follows:

1. Best Customers

Definition: Customers who shop the most frequently and spend the most (444).
Goal: Maintain loyalty and reward them.
Campaign Ideas:

  • VIP Programs: Offer special benefits, early access opportunities, and personalized offers. For example, notify this segment first about discounts.

2. Loyal Customers

Definition: Customers who shop regularly but are not in the highest spending group.
Goal: Increase their spending and strengthen their loyalty.
Campaign Ideas:

  • Personalized Recommendations: Provide tailored suggestions based on past purchases for this segment.

3. Almost Lost

Definition: Customers who shopped a long time ago but have not been active recently.
Goal: Try to win them back.
Campaign Ideas:

  • Re-engagement Campaigns: Offer exclusive comeback deals (e.g., “Just for you: 30% off”) or reminders and recommendations related to products they previously showed interest in.

4. New Customers

Definition: Customers who recently made their first purchase.
Goal: Make their first experience memorable and increase loyalty.
Campaign Ideas:

  • Welcome Campaigns: Send thank-you emails and offer discounts for their next purchase or recommend best-selling products.

5. Potential Loyalists

Definition: Customers with promising spending history but not fully loyal yet.
Goal: Include them in loyalty programs and strengthen their engagement.
Campaign Ideas:

  • Incentives for Loyalty Programs: Offer bonuses for first-time membership. Provide reminders about products they previously showed interest in (or favorited).

6. Need Attention

Definition: Customers showing a decrease in engagement or at risk of dissatisfaction.
Goal: Increase their engagement and ensure satisfaction.
Campaign Ideas:

  • Personalized Offers: Send reminders about products they previously showed interest in.

7. Lost Customers

Definition: Customers who haven’t shopped for a long time or have been completely lost.
Goal: Make one last effort to win them back or understand why they were lost.
Campaign Ideas:

  • Dramatic Offers: Create “We Miss You!” themed discount campaigns.

Bonus: Detailed Analysis of Lost Customers with Tableau Visualizations

Understanding why customers are lost and taking actions accordingly can dramatically reduce churn rates. Visualization can be a practical solution for this purpose. Using Tableau, a well-known tool in the industry, I analyzed the Online Retail Dataset with a map plot segmented by different Country information. As is well-known, the dataset belongs to a company based in the United Kingdom.

First, to make the analysis more understandable, I adjusted the number of customers by country based on the following condition,

and we can view all segments by country as follows:

As seen, the majority of users are concentrated in Europe and its surrounding regions. Notably, the highest concentration is in the United Kingdom, followed by Germany and France, which have significantly more users compared to other countries. Considering that the company is based in the UK, this result is not surprising. If an advertising campaign or a similar process is to be managed, these two countries outside of the UK should be prioritized. Additionally, if other countries are considered, Spain, Belgium, and Switzerland also have relatively above-average user numbers, indicating that the company has managed to attract attention in these regions as well.

One particularly striking observation is the absence of Best Customer, Loyal, or Potential Loyalist segments in distant countries such as the USA (except for one user), Canada, Brazil, and Japan. Most of the users from these regions fall into the Lost Customers segment. The first thing to check in this case is likely the shipping process. If shipping takes too long, users might be leaving the company for this reason. Of course, since user density is highest in the United Kingdom, the number of Lost Customers is also relatively high.

In the next part, Recommendation Systems will be introduced to address situations close to user loss, such as “Almost Lost” or “Need Attention” segments.

Sign up to discover human stories that deepen your understanding of the world.

--

--

Ecesu Olgun Kesici
Ecesu Olgun Kesici

Written by Ecesu Olgun Kesici

Co-Founder of Helpimal & Data Scientist

No responses yet

Write a response