Machine Learning, E-Commerce and Implementation Examples

One of the significant benefits of machine learning for e-commerce is that it facilitates personalized services to users.
Previously, simple methods such as keyword matching where products the user might like were offered as recommendations. Today, when we log in to an e-commerce site using two different accounts, different pages greet us. Thanks to machine learning, we can now use dozens of data, such as purchased product types and their price ranges, when more purchases are made and claimed campaign types. This way, we can offer customized landing pages and send campaign emails or SMS.
What are the possible scenarios?
- We can examine the customers who have made regular purchases for a certain period and then reduced or churned. We can determine the reasons with machine learning, analyze the customers with similar profiles, and prevent customer losses.
- We can optimize prices according to the products’ sales potential, thus reducing warehouse costs by not investing in the products that remain in stock for a long time.
- We can perform regional analysis using the customer’s location; advise the region’s temperature by considering the values.
- Another vital advantage of machine learning is its support for increasing repeat customers. (The cost of customer acquisition is very high, and by improving customer retention only by 5%, we can achieve an increase in profitability of up to 75%.*)
According to Amazon, sales via recommendation engines account for 35% of all sales. With machine learning, you can increase your revenue by analyzing your users from millions of data and regularly informing them about the campaigns they are likely to purchase.
E-commerce & Machine learning
- Segmentation, Personalization, and Targeting: Directing customers to purchase by profiling and dividing them into specific clusters
- Pricing Optimization: Optimization based on competitor prices or sales rates within a specific price range
- Search Ranking: Providing search results according to the customer profile so that the customer can easily reach the desired product without getting lost among the products while searching
- Product Recommendations: Recommend products that fit the customer’s shopping model as those who looked at this product also looked at them or might like the following products
- Customer Support and Self Service Providing a chatbot or smart assistant (such as those used in banking) to provide faster service to the customer. (With these tools providing instant support, the customer solves the problem faster and leaves the satisfaction more satisfied.)
- Supply and Demand Forecasting: With machine learning, it is possible to estimate which products customers prefer in which periods and which products have lower sales to plan and purchase accordingly
Netflix Example
Netflix is a series/movie service that contains hundreds of thousands of content.
Although it does not fall directly into the e-commerce category, a service is being sold, and the business volume is expected to increase.

Netflix’s most prominent feature in recent times is the cover photos of series or movies. A user sees these cover photos according to her/his viewing culture. For example, if you’ve seen almost all the films starring X, Netflix places the actor X on the Y series cover photo, even if X is seen only on a 2-minute scene as a guest in that series.
Netflix does not process and use old school demographic parameters, such as Age and Gender. Instead uses different data such as;
- What scenes did he roll back and forward?
- Did he watch it all at once or in pieces?
- What type of device is he using to watch?
- How high was the volume when watching?
- How many times did he pause the movie?
And dozens of other data we don’t know.

As you can see in the example above, the same movie, but different cover photos are produced according to the genre users love. The majority of these operations are done with the machine learning software Credentials Sharing Insight.
Age Prediction Example Using Big Query ML
(This example has been added to explain the primary usage and parameters.)
We first create an account at https://console.cloud.google.com/, then search for BigQuery and click on the top result.
Then the following screen appears. Click on the Create Data Set button at the bottom right of the screen.

Then, the right menu that appears on the following screen is opened. The data set ID is the server location where the data is stored, and if it is desired to be deleted automatically after a specific time, we write the time in the box below. Finally, click the Create Create button at the bottom of the page and save.

After the dataset is created, we need to create a table. To do so, click the Create Table button, as in the image below.

From the drop-down menu on the right of the screen;
- Firstly click the drop-down menu that says Create table from Choose Upload.
- From the Select file section, navigate to the location of the test data and select your file.
- In the Target section, select your project and dataset. Then fill in the table name at the bottom.
- Finally, fill and save the column names and data types from the Schema section.

Once you’ve completed these steps, you see the message as “Table is created, go to the source.” at the bottom left of the screen.
You can test your table by writing your query in the query field, as in the image below.
When writing the query, be sure to write in the format DataSetName.TableName. You may receive errors if you only type the table name.

Now we are going to the training stage. First, we need to create a model. To do this, we open the query section and proceed as follows.
CREATE MODEL 'DataSetName.ModelName'
We write the regression type we want to use in the model_type section.
{ ‘LINEAR_REG’ | ‘LOGISTIC_REG’ | ‘KMEANS’ | ‘TENSORFLOW’ }
We use linear_reg. For this, we add the input_label_cols parameter.
We write the column that we use for prediction in the input_label_cols section. In this field, we must enter a column that takes numeric values.
Finally, we write our table where we do the learning.

After the model is created, we create a table with our test data.
Things to pay attention
* Our training and test data sets should be different. If we perform the test with the same data, the results can be memorized, and we cannot find out whether the algorithm is working correctly.
* Overfitting is very good when tested on the available data, but the error rate is very high when trying to process new data sets. Because it also memorizes the noise in the data.
Example: You take an exam and have learned all the question patterns in the last five years. When you come across a different type of question in the exam, likely, you cannot solve the question because you have not learned but memorized.
There are two methods to solve this problem;
1. Cross-Validation
We divide the data set into several equal parts. Usually, this number is 10. 9 of them are selected as the training set, and one of them is chosen as the test data set.
We carry out training and classifying ten times, taking different test sets from 10 pieces each time.
In the end, we average the accuracy value from each phase. The result gives us the accuracy of our classification algorithm.
2. Regularization
Regularization aims to prevent overfitting in the model. Thus the learning process takes place, not memorization. It is done by expanding the data set, producing synthetic data (mostly used in operations such as images, videos), and dropout.
SELECT * FROM ML.PREDICT(MODEL `DataSetName.ModelName`, (select * from DataSetName.TestTable))
ML.EVALUATE
Used in model evaluation.ML.PREDICT
Used in prediction via the model.
As we make a prediction, we used the ML.PREDICT
parameter. You can see the predicted values in the predicted_XY column as a result of the query.

Let’s determine the error rate of the algorithm:


The above image shows the error rates of our algorithm. Let’s examine what it means by column names.
MAE(Mean Absolute Error): Predicted value is subtracted from the actual value, and after this is done for all predictions, the totals are divided by N (Predictions). Values are positive because they are calculated as the absolute value.

MSE (Mean Squared Error): MSE measures the performance of an ML model, an estimator. It is always positive, and we can say that estimators with an MSE value closer to zero perform better.

MdAE (Median Absolute Error): Median Absolute Error is calculated by taking the median of all absolute differences between the actual values and the estimated values. The MdAE value of the well-performing estimator takes values close to zero.
R2_Score: The best possible value of r2_score is 1. However, to get 1, all estimates must be correct. In our example, the value was 0.499. It means; the regression algorithm we used did not learn our data set well. In this case, we need to try different algorithms until we get a result closer to 1.
Customer Segmentation Example

Using our data set, we create a customer segmentation example to increase a shopping center’s revenue.
Our data set has the following fields:
- CustomerID
- Gender
- Age
- Annual_Income(1,000$)
- Spending_Score — 0-100 points given to customers by the shopping center
First, we test our data set for null fields.
SELECT * from `staj-krtc.magaza.musteri` where CustomerID is null, or Gender is null, or Age is null, or Annual_Income is null, or Spending_Score is null
Our query returned no null fields. (We ran this query so that the null fields do not cause complexity while we train our data set. If there were null fields, we would have to clear them first.)
Let’s examine our data set a little; based on gender distribution, 42.6% of the customers are male, and 57.4% are female.

Average annual income by gender and customer points given by the shopping center according to purchases:

The average annual income for men is $ 57.2K and points 48.3
The average annual income in women is $ 55.33K and points 51.2
We understand from these values that women earn more than men, but do more shopping.
We can see this more clearly from the below graph:

We need to build a model to train our data set. To do this, we must first determine which fields to use. (Using wrong or unnecessary fields makes the model more complex.)
In this example, we use the annual revenue and customer score. We use the K-means algorithm. The K-means algorithm is a clustering algorithm, and we need to specify how many clusters we want when creating the model. In our example, we wanted 5 clusters.

Visualization of the Model

We need to visualize the cluster results so that we can examine them more easily. We begin by typing the following query:
SELECT * FROM ML.CENTROIDS (MODEL `staj-krtc.magaza.segmentation_sample`)

ML.CENTROIDS: It is the algorithm determined by the K-means or k-median algorithms and shows cluster centroid value.
- Centroid_id: ID of the cluster
- Feature: Name of feature column
- Numerical_value: Numerical value of the centroid.
To better compare clusters, let’s write our query using the UNNEST operator:

As a result of our query, we can see the result of our clustering process more clearly.

SELECT * FROM ML.PREDICT (MODEL `staj-krtc.magaza.segmentation_sample`, (select * from `staj-krtc.magaza.musteri`))
We write the above query, and we get the following output where we calculate the distance of the customers to 5 clusters and see which cluster is closer to the customer.
NEAREST_CENTROIDS_DISTANCE.CENTROID_ID: Cluster ID
NEAREST_CENTROIDS_DISTANCE.DISTANCE: Distance to the clusters
Since we did not specify our model when writing, the kmeans app calculates the distance with Euclid. Let’s move on to the visualization process.

Using Data Studio, we can visualize our clusters. Firstly, we click Discover with Data Studio button.
In the chart below, we can see the average annual revenue of the customers and their shopping center scores based on their clusters.


Let’s interpret our chart:
When we examine the cluster with Centroid id 2, we see that the annual income is very high, but the spending score is low.
When we examine the cluster with Centroid id 5, we see that both the annual income and the spending scores are very high.
When we examine the cluster with Centroid id 4, we see that both the annual income and spending scores are average.
When we examine the cluster with Centroid id 1, we see that the annual income and spending scores are low.
When we examine the cluster with Centroid id 3, we see that the annual income is low, whereas the spending score is very high.
Conclusion: When we examine these clusters, we can see that clusters 2 and 3 are the most prominent. There seems to be little need to work on customers in cluster 3 because they already have a high spending score despite the low income. On the other hand, it is possible to focus on the customers in clusters 1,4,5 and 2 and encourage more shopping.
Purchase Prediction Example

In this example, we use the data provided by the state of Iowa. You can access this data set from BigQuery examples (bigquery-public-data.iowa_liquor_sales.sales).
Our data set contains the following columns. We use the county_number, item_number, and pack fields when training the data set.
invoice_and_item_number, date, store_number, store_name, address, city, zip_code, store_location, county_number, county, category, category_name, vendor_number, vendor_name, item_number, item_description, pack, bottle_volume_ml, state_bottle_cost, state_bottle_retail, bottles_sold, sale_dollars, volume_sold_liters, volume_sold_gallons
Objective: Objective: To determine how many packages can be sold from any product in any district shop.
When creating our model, we use linear_reg as the algorithm.
First, we identify and clear the null fields in our data set, then create our model. (In our age example, we have explained what fields like model_type and input_label_cols are.)

There are nearly 16 million records in our data set. To not divide our single data set, we sorted from small to large and trained our model over the first 8 million. Then we’ll test it all over the data set. It is not recommended because the training data is also in the test dataset, so the score may be higher.


We tested our model and found our r2 score to be 0.23. As the r2 value approaches 1, our model is successful. It is not logical to expect these values to be too high in real life because they are not only determined by the variables in our data set. Factors beyond our control, such as temperature, newly released products by competitors, may also affect our sales.
Author: Furkan Özalp
Date Published: Sep 11, 2019
