Excel vs KNIME
Why Market Basket Analysis?
Whenever you try to personalize an offer to a customer, you first need to do some sort of Market Basket Analysis. Guessing what your customer wants to buy next involves looking at what they’ve bought in the past, and at what similar customers like to buy.
Peeking into shopping baskets is a great starting point. If a customer has added laundry detergent, then they probably also want to buy fabric softener. You know that because 75% of your customers buy those two products together.
Market Basket Analysis can help you with:
- Cross-Selling: how do I get customers to buy more stuff?
- Up-Selling: how do I get customers to buy more expensive stuff?
- Solution Selling: what problem is my customer trying to solve?
- Store Layout: how can I make it easy for customers to find what they want?
- Discounting: how do I personalize a promotion while avoiding cannibalization?
- Loyalty Programs: what unique offer can I make to keep each customer loyal?
To answer all these questions, data science first needs to discover:
“Customers who buy A also buy B”
We are going to look at how to get that data science. And we’ll compare an Excel solution with a KNIME solution.
A quick note on where all our test data comes from before we get started. A different KNIME workflow generated the supermarket data used by these examples. See the workflow: KN-123 Data Generation for Market Basket Analysis.
This data generation workflow created a virtual supermarket with:
- 10 categories (Meat & Fish, Dairy, Fruit, Vegetables, …),
- 48 products (Chicken, Salmon, Beef, Pork, Full Cream Milk, …), and
- 10,000 customers
Customers each shop only once at this supermarket (and hence only get 1 basket each). While they can buy as many different products as they wish, they can only buy the same product once. That is, they cannot buy 2 Apples – they can only buy “Apples”. Most shopping baskets contain between 2 and 15 different products, though some customers buy many more. In all, there are about 108,000 rows of transaction data.
When starting out on this article, I expected to find many Excel examples of Market Basket Analysis. Maybe nothing too sophisticated, but Excel is my go-to tool when learning how something works and generating some quick results.
How wrong I was!
I love Excel, so it pains me to say that Excel is absolutely the wrong tool for this type of problem. Taking a full-semester database course then writing an SQL script from scratch would not only be quicker, but also more flexible and more powerful.
On the other hand, I discovered some Excel super-powers I didn’t know existed. For that reason alone, you may wish to try this yourself.
The Big Picture
You can download the Excel model yourself from the link above. When you do, you’ll find there are two parts the model:
- Build a [Customer x Product x Product] array
- Collapse the array into results using a Pivot Table
Customer C00001 Basket:
To understand how the [Customer x Product x Product] array is built, let’s start by looking at the first Customer. The Basket for Customer “C00001” contains the following items:
- Bottled Water
Excel converts each Customer Basket into a 2×2 Product array, with purchased Product pairs flagged with a “1”. The array is buried deep within Excel’s Data Model, but it would look like this:
The Pivot Table then takes each of these Product arrays and counts the number of times each Product pair is found together across all Baskets. The results look like this:
These results show that Cola and Bottled Water are found together 77 times. We can also see that if the Customer purchases Cola, there is a 97% probability they will also purchase Bottled Water. But the opposite is not true: if the Customer purchases Bottled Water, there is only a 73% chance they will purchase Cola.
The size of the Product array hiding within Excel’s Data Model is:
10,000 Customers x 48 If_Products x 47 Then_Products = 22,560,000 rows
If each row uses 1 Kbyte, Excel would need to find 22.6 GB to hold all this data. This is way too much for the purposes of our example. You will soon see how to reduce the size of this Product array by filtering down the number of Customer Baskets included in the analysis.
The Excel super-powers needed for Market Basket Analysis include:
- Queries & Connections
- Power Pivot
Power Pivot is a mini Business Intelligence (BI) platform that allows you to break through Excel’s 1,048,576 row limitation. Power Pivot depends upon relational data sources called “Data Models” that are much more efficient than spreadsheet cells. The Data Models are hidden, which is why there are no tabs in the spreadsheet containing data from intermediate steps.
To see the Power Pivot menu, you must first enable the Add-in. This is just an overview – not step-by-step instructions – but I’ll paste a few screenshots as we go so you can get the general idea. Afterwards feel free to play with the final spreadsheet.
For Excel’s Queries and Power Pivot to work, we must first convert our transactions data into an Excel “Table”. You do that from within Excel’s Insert menu.
Next, we start building the data queries that drive our analysis. The queries can be opened from Excel’s Data tab.
Six queries need to be developed, with later queries depending upon earlier queries. The Query Dependencies view can be seen from within the Excel Power Query Editor (right-click a query to open). The results from most of these queries are then loaded into the Power Pivot Data Model.
Let’s look more closely at what each query is doing.
Query #1 Baskets
The first query, called “Baskets”, is designed to simply filter the input Table from the Baskets spreadsheet tab down to a more manageable size. As described in The Big Picture above, the original 108,000 rows of transactions data are a bit too much for Excel to handle. I’ve filtered this down to the first 2051 rows, representing just 200 of the 10,000 Customers (that’s only 2% of all the Baskets). When you want to get more accurate results, this is where you would come.
Query #2 Customers
The second query, called “Customers”, is quite simple. It just takes those 2051 rows from the Baskets query and finds all the unique Customers by removing duplicate identifiers. As each Customer only ever buys a single Basket, these Customer identifiers are also the Basket identifiers.
Query #3 Products
The “Products” query also finds the unique Product names, but it adds an additional column. The query uses a “Group Rows” function to count the number of times each Product was sold. This value represents the relative frequency of sales.
Query #4 and Query #5 _Customers and _Products
The _Customers and _Products queries are a bit weird.
Unlike the other queries, these are “Connection only” queries. This means the query logic is passed through to the next downstream query without the intermediate results being saved.
Both queries also have a new “Index” column, with all index values set to “1”. Being new myself, it took me a while to figure out this. The explanation goes back to The Big Picture section above.
Remember that we are trying to generate a [Customer x Product x Product] array. To do that, we need a way to “Cross Join” these tables. Many data analytic tools have built-in Cross Join functions. But to do this in Excel you need to create a full “Outer Join” on the index columns. You’ll see how this works in the final Query #6.
Note that I named both these queries with leading “_” underscore characters to help distinguish their purpose, but that is not necessary.
Query #6 Combinations
Cross Joining to create a [Customer x Product x Product] array creates a table with every combination of Customer and Product pair. This table is used to identify each Product pair found in each Customer Basket.
It doesn’t make sense to find Baskets with redundant Product pairs (like “Apples” and “Apples”). So extra steps are needed to remove rows where the If_Product equals the Then_Product.
With only 200 Customers, Excel’s Data Model will be much smaller than before. The final size of the table can be calculated as:
200 Customers x 48 If_Products x 47 Then_Products = 451,200 rows
Excel Power Pivot
Once the [Customer x Product x Product] array has been built, it is time to move onto the next big step building the Excel model. We need to collapse the array into results by using a Pivot Table. Excel’s Power Pivot functions are complicated to explain, but by now should at least be easy to imagine.
Power Pivot sits on top of Excel’s Data Model. Using the “Combinations” query just generated, a new “Calculated Column” is used to flag all the Product pairs for each Customer with another “1”. This calculation is done in the “Has_Pair” column.
In this example, we can see that Customers C00002, C00003, C00004, C00007, C00008, and C00014 all purchased both [Apples] and [Bananas]. The Has_Pair column for each of these Customers is flagged. Comparing the If_Quantity and the Then_Quantity columns, we can also see that Apples are more popular than Bananas.
To make life even more confusing, Excel introduces yet another feature called “Measures”. These Measures are formulas that a regular Pivot Table can understand. Measures get calculated according to the filters and aggregations set up by the Pivot Table.
At last we arrive back at the more familiar Pivot Table. But now the data has been enhanced with new calculations and measures. The Pivot Table will:
- filter just the Product pairs that have been found in Baskets together,
- sum the number of Has_Pair combinations using the “Matches” Measure,
- calculate the “If A Then B” and “If B Then A” percentages
From these results, it looks like customers like to buy Cola and Bottled Water together, as well as Full Cream Milk with Eggs.
Phew, that was a lot of work! And we really didn’t learn that much. We only analyzed our first 200 customers, and our insights didn’t go far beyond the products that were simply the most popular. Evaluating only pairs of products is also a bit limiting. In most cases, we still don’t know what problem our customers are trying to solve.
In the upcoming KNIME example, we will explore how bigger baskets can generate more accurate recommendations. And we will see how rules can be generated that both guide customers and drive profitability.
In the above example, we looked at how to do Market Basket Analysis with Excel. The outcome was disappointing. It took a tremendous amount of work and we were only able to analyze 200 of our 10,000 customer baskets.
The results were also limited to pairs of products, and in most cases that didn’t tell us much about what our customers were trying to do. If a customer put [hotdogs] into their basket then we could confidently recommend [buns]. But what if they put [chicken] into their basket? Do they want to bake a chicken dinner? Or do they want to make a Caesar salad?
Marketers are going to want to do more powerful analysis in a real data science tool. They also want to construct bigger baskets that lead to better insights. And they want a set of rules that can guide customers while driving profitability. All these elements are explored here.
In this example, I’m going to use the KNIME Analytics platform (pronounced “nime”).
KNIME is just one of many powerful data analytics tools, but it is by far the easiest to use. You don’t have to be a Data Scientist to discover useful insights. Anybody who works with numbers can become a KNIME professional within a day.
Why not Python or R?
Python and R are also very powerful. But I consider these to be “production tools” useful when you’ve figured out your algorithm and you want to bake it into a process. KNIME is an “exploration tool” that allows you to jump about searching for that illusive insight.
In any event, even the most determined Python fan will readily appreciate the visual KNIME workflow I’ve prepared below.
KNIME workflows are built from a large library of nodes. The nodes are configured through a GUI (so you don’t have to remember any syntax details) and linked together in chains.
The Market Basket Analysis workflow is a little long, but it is doing two jobs:
- calculating Recommendation Rules from a set of transaction data, and
- using the rules to suggest a recommendation for a New Basket.
The same Customer-Basket data from the Excel example is used in this KNIME example. Again, there are 10,000 Customers (identified as C00001 to C10000) with Baskets typically containing between 2 and 15 Products. The supermarket offers a total of 48 Products across 10 Categories.
The “Association Rule Learner” is the green node that conducts the Affinity Analysis and calculates the Recommendation Rules. It does this with an Apriori Algorithm (optimized by Professor Borgelt) which determines the frequency of not only the individual Products, but the Product pairs, triplets, quadruplets, etc.
This screenshot shows what’s inside the Association Rule Learner. The node will analyze the [Baskets] column and look for relationships that occur at least 10% of the time (minimum support) which lead to a good recommendation (minimum rule confidence). In KNIME, it’s easy to configure the science with a few clicks.
The Association Rule Learner generates the following output:
- “If Basket” (Antecedent)
- “Then Recommendation” (Consequence)
- Rule Confidence
If the Basket contains [Yogurt] then you could Recommend [Beef] with a 39% Confidence that the Customer will buy it.
Rule Confidence compares the “Then Recommendation” against the “If Basket” with this equation:
- Number of Baskets that contain [Yogurt] = 2157
- Number of Baskets that contain [Yogurt] and [Beef] = 845
- Rule Confidence = 845 / 2157 = 39%
It doesn’t matter if the Basket also contains other stuff.
Recommending Beef with Yogurt seems like a bad idea. There are many other Products these Customers are more likely to buy with Beef, including:
- Cheese (67%),
- Eggs (89%), and
- Full Cream Milk (95%)
But Confidence is not the only factor to consider from our Recommendation Rules. Incremental Revenue and Profitability are also important. We do that by calculating the “Expected Revenue” from each Product Recommendation:
Now that we’ve got a set of Recommendation Rules it is time to apply them.
Consider what happens when a new Customer comes along to our supermarket and puts the following Products into their Basket.
The [New Basket] circled in purple contains:
The green “Subset Matcher” node will split that Basket into all combinations of sub-Baskets. The following [If Basket] combinations then emerge:
- Beef + Eggs
- Beef + Onions
- Onions + Tomatoes
These [If Basket] combinations are then matched (joined) against all the Recommendation Rules and sorted by Expected Revenue.
The algorithm has discovered our Customer is worrying about ingredients for dinner. The Top-5 Recommendations it suggests are:
- Frozen Vegetables
- Frozen Pizza
Success! As you can see, the KNIME solution is more powerful and flexible than the Excel solution. Not only were we able to identify rules that described the baskets of our customers, but we were able to use those rules to generate recommendations. Our KNIME Market Basket solution allows us to both guide customers and drive profitability.