Sakthi’s Blog

How Amazon Runs on MIS: The Data Systems, Architecture, and Algorithms Behind the World’s Most Efficient Company

2025-12-07T00:00:00+00:00

While working on another project recently, I ended up reading about how Amazon actually works internally — and it quickly became obvious that Amazon isn’t just an e-commerce platform.
It’s a gigantic, integrated Management Information System (MIS) that connects customers, warehouses, inventory, pricing, logistics, forecasting, and even executive decisions through data pipelines and algorithms.

The deeper you go, the more it becomes clear:

Amazon’s competitive advantage is an MIS advantage.

This blog breaks down the MIS architecture powering Amazon — from real-time transaction systems to massive data lakes and predictive decision-support engines.

1. Amazon’s MIS Architecture: A Layered System

If we map Amazon’s internal technology to the MIS framework we study in class, it looks like this:

Amazon didn’t design these layers academically — they evolved naturally out of scale.
But the match is nearly perfect.

2. Transaction Processing Systems (TPS): The Real-Time Engine

Amazon runs millions of TPS events per minute:

Every product search
Every page view
Every “Add to Cart”
Every barcode scan in a warehouse
Every inventory movement
Every delivery update

These TPS events hit high-speed databases like:

Amazon DynamoDB (for low-latency key-value reads)
Aurora & RDS (for transactional SQL)
Amazon Kinesis (for high-volume event streams)

TPS is the foundation because everything Amazon does is time-sensitive:
If a warehouse worker picks an item, inventory updates instantly; if a user searches for a product, recommendations update instantly.

🟡 MIS takeaway:
TPS enables Amazon’s MIS to have real-time accuracy, not daily or weekly reporting.

3. The Data Backbone: S3 Data Lake + Redshift Warehouse

Behind the scenes, Amazon has a two-part analytics backbone:

A) Data Lake (Amazon S3)

Stores raw logs from:

Website clicks
Order histories
Warehouse scanner logs
IoT sensors on robots
Delivery GPS data
Supplier feeds
Customer service transcripts

This is petabyte-scale data.

B) Data Warehouse (Amazon Redshift)

Redshift performs:

Sales analysis
Forecasting
Inventory planning
Profitability reporting
Cohort analysis
Pricing optimization

The data lake → warehouse pipeline uses:

AWS Glue (ETL)
Athena (interactive querying)
EMR (big data processing)

🟡 MIS takeaway:
This architecture gives Amazon a single source of truth for managerial reporting and decision-making.

4. MIS Layer: Dashboards, Monitoring and Operational Control

Once data is processed, it flows into Amazon’s MIS dashboards.

These dashboards are used by:

Category managers
Supply chain planners
Inbound/outbound operations teams
Delivery station managers
Finance
Vendor managers
Marketplace teams

Examples of MIS reports:

1. Inventory Health Dashboards

Shows:

Sell-through rate
Excess inventory
Out-of-stock risk
Aging inventory
Safety stock levels

2. Supply Chain & Fulfillment Dashboards

Shows:

Picking/packing time
Dock-to-stock metrics
SLA compliance
Throughput per shift
Bottleneck alerts

3. Customer Experience Dashboards

Shows:

Late delivery rates
Cancellation rates
Return rates
Page load performance
Recommendation success rate

These dashboards are updated hourly or even real-time, not monthly like traditional MIS.

🟡 MIS takeaway:
Amazon’s MIS is a live operational cockpit, not a passive reporting system.

5. Decision Support Systems (DSS): Forecasting, Algorithms & Optimization

Amazon’s DSS layer is where the intelligence happens.

This includes:

1. Demand Forecasting Systems

Forecasts demand at the SKU × region × week level
Uses historical sales, seasonality, pricing, competitor trends

Amazon built custom forecasting systems internally + on AWS.

2. Inventory Placement Algorithms

Predict where to store each product BEFORE it’s even ordered.

This is why Amazon can ship so fast — items are pre-positioned near likely buyers.

3. Dynamic Pricing Engine

Prices change based on:

Competitor prices
Inventory levels
Conversion probability
Sales velocity
Time-of-day patterns

4. Route Optimization for Delivery

Routing algorithms evaluate:

Traffic
Weather
Driver capacity
Delivery density

Amazon uses:

Amazon Logistics Routing Engine
Map-based ML models
DSP (delivery service provider) optimization tools

🟡 MIS takeaway:
DSS turns raw data into optimized decisions.
This is the “brains” of Amazon.

6. Executive Support Systems (ESS): Strategic MIS at Scale

At the top level, Amazon’s senior leadership uses MIS outputs to make:

New market entry decisions
Prime pricing changes
Infrastructure investment choices
Vendor negotiations
Long-term supply chain strategy

Key ESS tools include:

Enterprise financial dashboards
Corporate BI platforms
Multi-year trend analysis
Customer lifetime value models
High-level cohort insights

ESS gives a bird’s-eye view of the whole ecosystem.

🟡 MIS takeaway:
Amazon’s “Day 1” philosophy is driven by data — ESS ensures leaders have high-quality information to stay agile.

7. How All These Systems Connect: A Simplified Data Flow

Every part of Amazon — from a warehouse picker to the CEO — is looking at different layers of the same integrated MIS ecosystem.

8. Why Amazon’s MIS Gives It an Unfair Competitive Advantage

1️⃣ Speed

Decisions are made based on up-to-the-minute data.

2️⃣ Predictive Intelligence

Amazon knows what customers will want before they want it.

3️⃣ Scale

Systems are built on AWS, meaning infinite scaling.

4️⃣ Integration

Every part of the value chain talks to every other part.

5️⃣ Automation

Humans don’t decide most operational tasks — algorithms do.

This is MIS at its absolute maximum potential.

9. Final Thoughts: Amazon as an MIS Success Story

If you strip away the brand, the website, the fast delivery and Prime…

Amazon is essentially a giant MIS.
Every competitive edge it has — speed, accuracy, customer obsession, low prices — is enabled by information systems and real-time decision architecture.

For students, analysts, or anyone in business/tech, studying Amazon gives us the clearest example of what a modern MIS can look like when it is:

Vast in scale
Deeply integrated
Real-time
Predictive
Automated
Relentlessly optimized

And that’s why Amazon remains one of the best MIS case studies of the 21st century.

Virtual Organization and the Flattening of Management: What MIS Enables

2025-11-02T00:00:00+00:00

A Blog post on how IT enables decentralized decision making, and post pandemic collaboration models

Introduction

The 21st-century firm is no longer confined to glass offices and fixed hierarchies.
Cloud collaboration tools, real-time dashboards, and AI-driven MIS have flattened organizations — pushing decision-making to the edge.
Employees now manage data, processes, and innovation directly through systems — not through layers of supervision.
According to Laudon & Laudon’s MIS framework, IT reduces transaction and agency costs, enabling organizations to operate with fewer management layers and greater autonomy.

The Concept of the “Flattened” Organization

Flattening refers to reducing the vertical hierarchy — fewer managers, broader spans of control.
MIS automates information flow → fewer intermediaries needed to gather and report data.
With digital dashboards, analytics, and collaborative tools, frontline employees can access the same insights executives see.
Example: In GitLab’s all-remote setup, developers, designers, and marketers access shared dashboards and OKR boards — no need to “wait for approval loops.”
This structure promotes agility, transparency, and accountability.

Virtual Organizations – Beyond Geography

A virtual organization operates through digital linkages rather than physical proximity.
It’s a network of individuals and teams connected through MIS platforms — Slack, Asana, Jira, Notion, or custom ERP dashboards.
Virtual setups allow:
- Cross-time-zone workflows
- Access to global talent
- Real-time updates and version control
MIS integrates communication (Zoom, Teams) + coordination (Asana, Trello) + data (PowerBI, Tableau) to maintain organizational coherence.

Example:

GitLab, a fully remote company with 2,000+ employees across 60+ countries, relies on its open-source MIS stack — issue trackers, analytics boards, and handbooks — to function without any offices.
Asana enables similar cross-time-zone task visibility with analytics integrations that feed directly into management dashboards.

The Role of MIS in Enabling Decentralized Decision-Making

MIS serves as the digital nervous system of modern firms.
- It provides:
- Shared databases → everyone accesses current, accurate data.
- Real-time analytics dashboards → decision support for all levels.
- Workflow automation → reduced dependency on manual reporting.
As a result:
- Managers act as facilitators, not controllers.
- Employees take initiative using transparent data insights.
- Decisions happen closer to the problem source.

Example Systems:

ERP (SAP, Odoo) – integrates departments for transparency.
BI Tools (Power BI, Tableau) – democratize analytics.
Project MIS (Asana, ClickUp) – merge operations and metrics.

Challenges and Counterpoints

Information Overload: Too much access can confuse priorities.
Cultural Gaps: Flat, virtual systems require strong digital etiquette.
Security Risks: Decentralized systems widen the attack surface.
Coordination Complexity: Without structured roles, accountability may blur.

Organizations need governance frameworks within MIS to balance openness with control.

Leadership in the Age of Flattened Hierarchies

Leaders evolve from “commanders” to coaches and connectors.
Key leadership traits in virtual organizations:
- Digital literacy & tool fluency
- Data-driven empathy (understanding through analytics, not assumptions)
- Transparency & trust-based accountability
- Comfort with asynchronous communication
MIS helps track performance objectively — but leadership ensures meaning and motivation stay human.

“The best leaders today manage through information, not through proximity.”

Conclusion – The Future Organization Is Flat, Fast, and Fluid

MIS + Cloud + AI have made geography irrelevant and hierarchies optional.
Tomorrow’s firms will function as living ecosystems — nodes of collaboration powered by information systems.
The challenge is not implementing more technology, but learning to lead effectively through it.

Sustainability, Leadership & Performance: 5‑Year Analytics

2025-09-14T00:00:00+00:00

Sakthi Swaroopan S - CB.BU.P2ASB25147

0) Setup

# install.packages(c("readxl","dplyr","magrittr","factoextra","rattle","DT","psych","tibble","tidyr"))


library(readxl)
library(dplyr)
library(magrittr)
library(factoextra)
library(rattle)
library(DT)
library(psych)
library(tibble)
library(tidyr)


set.seed(42)


# ---- Paths ----
DATA_PATH <- "/ESG_Dataset_Sakthi.xlsx"
SHEET_NAME <- "Sheet1"

1) Importing the data

Data was collected through the annual reports sourced from NSE.

raw <- read_excel(DATA_PATH, sheet = SHEET_NAME) %>%
janitor::clean_names() # using janitor fully qualified (not attaching)


# Expected columns after clean_names():
# company_name, year, industry_type, ceo_name, ceo_gender,
# carbon_emissions, energy_consumption, employee_turnover, roe, roa

2) Exploratory analysis

# Structure and a peek
str(raw)

## tibble [48 × 11] (S3: tbl_df/tbl/data.frame)
##  $ company_name      : chr [1:48] "Sona BLW Percision forgings ltd" "Sona BLW Percision forgings ltd" "Sona BLW Percision forgings ltd" "Sona BLW Percision forgings ltd" ...
##  $ year              : num [1:48] 2021 2022 2023 2024 2025 ...
##  $ carbon_emissions  : chr [1:48] "32756" "40330" "48468" "58317" ...
##  $ energy_consumption: chr [1:48] "41800.07" "52308.14" "311100" "358157" ...
##  $ employee_turnover : chr [1:48] "7.6499999999999999E-2" "0.11" "0.16" "0.13" ...
##  $ roa               : num [1:48] 0.1065 0.141 0.1303 0.1351 0.0941 ...
##  $ roe               : num [1:48] 0.153 0.179 0.172 0.188 0.107 ...
##  $ industry_type     : chr [1:48] "Automotive" "Automotive" "Automotive" "Automotive" ...
##  $ location          : chr [1:48] "Haryana" "Haryana" "Haryana" "Haryana" ...
##  $ ceo_name          : chr [1:48] "Vivek Vikram Singh" "Vivek Vikram Singh" "Vivek Vikram Singh" "Vivek Vikram Singh" ...
##  $ ceo_gender        : chr [1:48] "Male" "Male" "Male" "Male" ...

DT::datatable(head(raw, 20), options = list(pageLength = 10), caption = "Raw data (first 20 rows)")

## Error in loadNamespace(name): there is no package called 'webshot'

# Simple counts
raw %>% count(company_name, sort = TRUE) %>% DT::datatable(caption = "Rows per company")

## Error in loadNamespace(name): there is no package called 'webshot'

raw %>% count(year, sort = TRUE) %>% DT::datatable(caption = "Rows per year")

## Error in loadNamespace(name): there is no package called 'webshot'

Data is uneven across companies and years; not all variables are consistently reported.
Some firms dominate in reporting while others have sparse records.

3) Pre‑processing the data

Rules applied

Drop rows where Energy_Consumption == “Not Reported”. This ensures comparability between the companies.
Replace NA in Carbon_Emissions with 0 for non-manufacturing firms.
Cast types for numeric columns; keep factors for categories.

Standardizing Decimals to improve readability and consistency.

df <- raw %>%
# Normalize text placeholders to real NAs
mutate(
energy_consumption = dplyr::na_if(energy_consumption, "Not Reported"),
carbon_emissions = dplyr::na_if(carbon_emissions, "NA")
) %>%
# Coerce to appropriate types
mutate(
year = as.integer(year),
industry_type = as.factor(industry_type),
ceo_gender = factor(ceo_gender, levels = c("Male","Female","Other")),
energy_consumption = suppressWarnings(as.numeric(energy_consumption)),
carbon_emissions = suppressWarnings(as.numeric(carbon_emissions)),
employee_turnover = suppressWarnings(as.numeric(employee_turnover)),
roe = suppressWarnings(as.numeric(roe)),
roa = suppressWarnings(as.numeric(roa))
) %>%
# Apply the two cleaning rules
filter(!is.na(energy_consumption)) %>% # drop Not Reported rows
mutate(carbon_emissions = dplyr::coalesce(carbon_emissions, 0)) %>%
mutate(across(c(carbon_emissions, energy_consumption,employee_turnover, roe, roa), ~ round(.x, 2))) %>%
arrange(company_name, year)
    
    
# Quick sanity check
stopifnot(all(c("company_name","year","industry_type","ceo_gender",
"carbon_emissions","energy_consumption","employee_turnover","roe","roa") %in% names(df)))

4) Preview of data before analysis

DT::datatable(head(df, 20), options = list(pageLength = 10), caption = "Cleaned data (first 20 rows)")

## Error in loadNamespace(name): there is no package called 'webshot'

5) Exploratory analysis (post‑clean)

# Numeric summary by year
year_summary <- df %>%
group_by(year) %>%
summarise(
n = dplyr::n(),
mean_emissions = mean(carbon_emissions, na.rm = TRUE),
mean_energy = mean(energy_consumption, na.rm = TRUE),
mean_turnover = mean(employee_turnover, na.rm = TRUE),
mean_roe = mean(roe, na.rm = TRUE),
mean_roa = mean(roa, na.rm = TRUE)
)
DT::datatable(year_summary, caption = "Year‑wise summary")

## Error in loadNamespace(name): there is no package called 'webshot'

# Correlation matrix (pooled numeric)
num_cols <- c("carbon_emissions","energy_consumption","employee_turnover","roe","roa")
cor_mat <- stats::cor(df[, num_cols], use = "pairwise.complete.obs")
cor_mat <- round(cor_mat, 2)
cor_mat

##                    carbon_emissions energy_consumption employee_turnover   roe   roa
## carbon_emissions               1.00              -0.15             -0.46  0.03  0.13
## energy_consumption            -0.15               1.00              0.04 -0.22 -0.14
## employee_turnover             -0.46               0.04              1.00 -0.24 -0.22
## roe                            0.03              -0.22             -0.24  1.00  0.94
## roa                            0.13              -0.14             -0.22  0.94  1.00

# helper function to convert psych::describe output into vertical key-value tibble
describe_long <- function(x) {
  psych::describe(x) %>%
    as_tibble() %>%
    select(-vars, -n) %>%   # drop unneeded cols (keep stats)
    pivot_longer(cols = everything(),
                 names_to = "Metric",
                 values_to = "Value")
}

# Now display each variable as vertical DT
DT::datatable(describe_long(df["carbon_emissions"]), caption = "Carbon Emissions")

## Error in loadNamespace(name): there is no package called 'webshot'

DT::datatable(describe_long(df["energy_consumption"]), caption = "Energy Consumption")

## Error in loadNamespace(name): there is no package called 'webshot'

DT::datatable(describe_long(df["employee_turnover"]), caption = "Employee Turnover")

## Error in loadNamespace(name): there is no package called 'webshot'

DT::datatable(describe_long(df["roe"]), caption = "ROE")

## Error in loadNamespace(name): there is no package called 'webshot'

DT::datatable(describe_long(df["roa"]), caption = "ROA")

## Error in loadNamespace(name): there is no package called 'webshot'

Average ROE and ROA vary significantly year to year.
Turnover shows an overall negative correlation with profitability

Energy consumption and emissions remain volatile, without a steady downward trend.

6) Analysis

6.1) Company Performance Snapshot

company_summary <- df %>%
  group_by(company_name) %>%
  summarise(
    avg_emissions   = mean(carbon_emissions, na.rm = TRUE),
    avg_energy      = mean(energy_consumption, na.rm = TRUE),
    avg_turnover    = mean(employee_turnover, na.rm = TRUE),
    avg_roe         = mean(roe, na.rm = TRUE),
    avg_roa         = mean(roa, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_roe))

DT::datatable(
  company_summary,
  caption = "5-Year Company Performance Snapshot (Averages)",
  options = list(pageLength = 10)
)

## Error in loadNamespace(name): there is no package called 'webshot'

Some firms balance profitability with efficiency, while others underperform despite high energy/emissions.

6.2) Year-over-year Trends

# Recompute to keep the section self-contained
yoy <- df %>%
  group_by(year) %>%
  summarise(
    mean_emissions = mean(carbon_emissions, na.rm = TRUE),
    mean_energy    = mean(energy_consumption, na.rm = TRUE),
    mean_turnover  = mean(employee_turnover, na.rm = TRUE),
    mean_roe       = mean(roe, na.rm = TRUE),
    mean_roa       = mean(roa, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(year)

DT::datatable(
  yoy,
  caption = "Year-wise Averages (Sustainability & Profitability)"
)

## Error in loadNamespace(name): there is no package called 'webshot'

# Emissions trend
plot(
  yoy$year, yoy$mean_emissions, type = "b",
  xlab = "Year", ylab = "Avg Carbon Emissions",
  main = "Trend: Average Carbon Emissions by Year"
)

# ROE trend
plot(
  yoy$year, yoy$mean_roe, type = "b",
  xlab = "Year", ylab = "Avg ROE",
  main = "Trend: Average ROE by Year"
)

# ROA trend
plot(
  yoy$year, yoy$mean_roa, type = "b",
  xlab = "Year", ylab = "Avg ROA",
  main = "Trend: Average ROA by Year"
)

Modest improvement in profitability, but sustainability indicators (emissions, energy) do not show consistent decline.

6.3) Comparative Plots

# Helper to draw scatter + OLS line + correlation 
make_scatter <- function(x, y, xlab, ylab, title) {
  ok <- is.finite(x) & is.finite(y)
  plot(x[ok], y[ok],
       xlab = xlab, ylab = ylab,
       main = title, pch = 19)
  # OLS line
  fit <- stats::lm(y[ok] ~ x[ok])
  abline(fit, lwd = 2)
  # Pearson correlation
  r <- stats::cor(x[ok], y[ok], method = "pearson")
  legend("topleft", bty = "n",
         legend = paste0("r = ", round(r, 2)))
}

# Emissions vs ROE
make_scatter(
  x = df$carbon_emissions, y = df$roe,
  xlab = "Carbon Emissions", ylab = "ROE",
  title = "Carbon Emissions vs ROE"
)

# Energy vs ROA
make_scatter(
  x = df$energy_consumption, y = df$roa,
  xlab = "Energy Consumption", ylab = "ROA",
  title = "Energy Consumption vs ROA"
)

# Turnover vs ROE
make_scatter(
  x = df$employee_turnover, y = df$roe,
  xlab = "Employee Turnover", ylab = "ROE",
  title = "Employee Turnover vs ROE"
)

Negative relationship between turnover and ROE is evident.
Emissions and energy show a slight negative relationship with performance.

6.4) Clustering

clust_df <- df %>%
  group_by(company_name, industry_type) %>%
  summarise(
    carbon_emissions   = mean(carbon_emissions,   na.rm = TRUE),
    energy_consumption = mean(energy_consumption, na.rm = TRUE),
    employee_turnover  = mean(employee_turnover,  na.rm = TRUE),
    roe                = mean(roe,                na.rm = TRUE),
    roa                = mean(roa,                na.rm = TRUE),
    .groups = "drop"
  )

# Variables used for clustering (inputs)
vars_used <- c("carbon_emissions", "energy_consumption",
               "employee_turnover", "roe", "roa")
DT::datatable(
  tibble::tibble(Variables_Used = vars_used),
  caption = "Variables used for clustering (company-level averages)"
)

## Error in loadNamespace(name): there is no package called 'webshot'

# Prepare numeric matrix and scale
X <- clust_df %>% dplyr::select(all_of(vars_used)) %>% as.data.frame()
X_scaled <- scale(X)

# choose K (WSS / Silhouette)
factoextra::fviz_nbclust(X_scaled, kmeans, method = "wss", k.max = 8)

factoextra::fviz_nbclust(X_scaled, kmeans, method = "silhouette", k.max = 8)

# Fit K-means (set k after inspecting above plots)
k <- 3
km <- stats::kmeans(X_scaled, centers = k, nstart = 50)

# Attach cluster ids to companies
clust_out <- clust_df %>% mutate(cluster = factor(km$cluster))
DT::datatable(clust_out, caption = "Company → Cluster assignments")

## Error in loadNamespace(name): there is no package called 'webshot'

# Companies in each cluster (compact list)
companies_by_cluster <- clust_out %>%
  group_by(cluster) %>%
  summarise(companies = paste(company_name, collapse = ", "), .groups = "drop")
DT::datatable(companies_by_cluster, caption = "Companies in each cluster")

## Error in loadNamespace(name): there is no package called 'webshot'

# PCA for labeled visualization (labels = company names)
rownames(X_scaled) <- clust_df$company_name
pca_obj <- stats::prcomp(X_scaled, center = FALSE, scale. = FALSE)  # already scaled

factoextra::fviz_pca_ind(
  pca_obj,
  geom = "point",
  habillage = clust_out$cluster,   # color by cluster
  addEllipses = FALSE,             # avoid ellipse warnings for small clusters
  label = "all",                   # show company labels
  repel = TRUE,                    # nicer label placement
  title = "Company Segments (PCA with labels)"
)

# PCA loadings table to interpret Dim1/Dim2 drivers
loadings <- tibble::as_tibble(pca_obj$rotation[, 1:2], rownames = "variable")
colnames(loadings) <- c("variable", "Dim1_loading", "Dim2_loading")
DT::datatable(loadings, caption = "PCA loadings (which variables drive Dim1/Dim2)")

## Error in loadNamespace(name): there is no package called 'webshot'

# Cluster-level profiles (means of original features)
cluster_profiles <- clust_out %>%
  group_by(cluster) %>%
  summarise(across(all_of(vars_used), ~ mean(.x, na.rm = TRUE)), .groups = "drop")
DT::datatable(cluster_profiles, caption = "Cluster profiles (feature means)")

## Error in loadNamespace(name): there is no package called 'webshot'

Three groups emerge:
- Efficient & Profitable (low emissions, higher returns).
- Transitioners (partial improvements).
- Underperformers (high energy/emissions, low returns).

Automotive company stands out with both high emissions and high profitability.

6.5) Predictive analysis

6.5.1) Company-wise Prediction

# Filter company
df_c <- df %>% filter(company_name == "Artemis Medicare Services Ltd.")

# Train/test split (latest year as test, earlier years as train)
train <- df_c %>% filter(year < max(year))
test  <- df_c %>% filter(year == max(year))

# Model: ROE as dependent, predictors = sustainability + turnover + time
model <- lm(roe ~ carbon_emissions + energy_consumption + employee_turnover + year,
            data = train)

summary(model)

## 
## Call:
## lm(formula = roe ~ carbon_emissions + energy_consumption + employee_turnover + 
##     year, data = train)
## 
## Residuals:
## ALL 3 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (2 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)        1.006e-01        NaN     NaN      NaN
## carbon_emissions   8.860e-07        NaN     NaN      NaN
## energy_consumption 4.147e-07        NaN     NaN      NaN
## employee_turnover         NA         NA      NA       NA
## year                      NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,	Adjusted R-squared:    NaN 
## F-statistic:   NaN on 2 and 0 DF,  p-value: NA

# Predict on test
test$pred_roe <- predict(model, newdata = test)

# Metrics
rmse <- function(a, p) sqrt(mean((a - p)^2, na.rm = TRUE))
mae  <- function(a, p) mean(abs(a - p), na.rm = TRUE)
r2   <- function(a, p) 1 - sum((a - p)^2, na.rm = TRUE) /
                      sum((a - mean(a, na.rm = TRUE))^2, na.rm = TRUE)

cat("RMSE:", rmse(test$roe, test$pred_roe), "\n")

## RMSE: 0.002823746

cat("MAE :", mae(test$roe,  test$pred_roe), "\n")

## MAE : 0.002823746

cat("R^2 :", r2(test$roe,   test$pred_roe), "\n")

## R^2 : -Inf

# Example: Artemis Medicare Services Ltd.
df_c <- df %>% filter(company_name == "Artemis Medicare Services Ltd.")

# Set up 1 row, 3 columns layout
par(mfrow = c(1, 3))

# 1. Carbon Emissions vs ROE
plot(df_c$carbon_emissions, df_c$roe,
     xlab = "Carbon Emissions", ylab = "ROE",
     main = "Emissions vs ROE",
     pch = 19, col = "blue")
abline(lm(roe ~ carbon_emissions, data = df_c), col = "red", lwd = 2)

# 2. Energy Consumption vs ROE
plot(df_c$energy_consumption, df_c$roe,
     xlab = "Energy Consumption", ylab = "ROE",
     main = "Energy vs ROE",
     pch = 19, col = "darkgreen")
abline(lm(roe ~ energy_consumption, data = df_c), col = "red", lwd = 2)

# 3. Employee Turnover vs ROE
plot(df_c$employee_turnover, df_c$roe,
     xlab = "Employee Turnover", ylab = "ROE",
     main = "Turnover vs ROE",
     pch = 19, col = "purple")
abline(lm(roe ~ employee_turnover, data = df_c), col = "red", lwd = 2)

# Reset to default
par(mfrow = c(1,1))

Models collapse due to very few years per company (overfitting, meaningless coefficients).
Highlights the data depth problem in company-level analytics.

6.5.2) Collective Model

# Train/test split: use <=2023 for training, >=2024 for testing
train <- df %>% filter(year <= 2023)
test  <- df %>% filter(year >= 2024)

# --- ROE model (main dependent variable) ---
model_roe <- lm(
  roe ~ carbon_emissions + energy_consumption + employee_turnover +
        industry_type + ceo_gender + year,
  data = train
)

summary(model_roe)

## 
## Call:
## lm(formula = roe ~ carbon_emissions + energy_consumption + employee_turnover + 
##     industry_type + ceo_gender + year, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17750 -0.07029 -0.00265  0.02201  0.39258 
## 
## Coefficients:
##                                                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                    -2.063e+01  1.320e+02  -0.156    0.878
## carbon_emissions                                4.553e-06  6.610e-06   0.689    0.504
## energy_consumption                             -2.880e-09  4.787e-09  -0.602    0.559
## employee_turnover                              -2.251e-01  4.274e-01  -0.527    0.608
## industry_typeHealthcare                         1.937e-01  2.263e-01   0.856    0.409
## industry_typePharmaceuticals and Biotechnology  2.534e-01  2.857e-01   0.887    0.392
## ceo_genderFemale                               -1.680e-02  1.198e-01  -0.140    0.891
## year                                            1.021e-02  6.532e-02   0.156    0.878
## 
## Residual standard error: 0.1561 on 12 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1638,	Adjusted R-squared:  -0.324 
## F-statistic: 0.3358 on 7 and 12 DF,  p-value: 0.9222

# Predictions
test$pred_roe <- predict(model_roe, newdata = test)

# --- ROA model (secondary diagnostic) ---
model_roa <- lm(
  roa ~ carbon_emissions + energy_consumption + employee_turnover +
        industry_type + ceo_gender + year,
  data = train
)

summary(model_roa)

## 
## Call:
## lm(formula = roa ~ carbon_emissions + energy_consumption + employee_turnover + 
##     industry_type + ceo_gender + year, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14946 -0.04262 -0.00180  0.02661  0.34493 
## 
## Coefficients:
##                                                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                    -2.239e+01  1.084e+02  -0.207    0.840
## carbon_emissions                                4.189e-06  5.430e-06   0.771    0.455
## energy_consumption                             -1.371e-09  3.932e-09  -0.349    0.733
## employee_turnover                              -5.568e-02  3.511e-01  -0.159    0.877
## industry_typeHealthcare                         1.465e-01  1.859e-01   0.788    0.446
## industry_typePharmaceuticals and Biotechnology  1.429e-01  2.347e-01   0.609    0.554
## ceo_genderFemale                               -2.333e-02  9.838e-02  -0.237    0.817
## year                                            1.106e-02  5.366e-02   0.206    0.840
## 
## Residual standard error: 0.1282 on 12 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1457,	Adjusted R-squared:  -0.3527 
## F-statistic: 0.2923 on 7 and 12 DF,  p-value: 0.9443

test$pred_roa <- predict(model_roa, newdata = test)

# --- Model performance metrics ---
rmse <- function(a, p) sqrt(mean((a - p)^2, na.rm = TRUE))
mae  <- function(a, p) mean(abs(a - p), na.rm = TRUE)
r2   <- function(a, p) 1 - sum((a - p)^2, na.rm = TRUE) /
                      sum((a - mean(a, na.rm = TRUE))^2, na.rm = TRUE)

metrics <- tibble::tibble(
  Metric = c("RMSE", "MAE", "R^2"),
  ROE    = c(rmse(test$roe, test$pred_roe),
             mae(test$roe,  test$pred_roe),
             r2(test$roe,   test$pred_roe)),
  ROA    = c(rmse(test$roa, test$pred_roa),
             mae(test$roa,  test$pred_roa),
             r2(test$roa,   test$pred_roa))
)

DT::datatable(metrics, caption = "Pooled Model Performance (ROE vs ROA)")

## Error in loadNamespace(name): there is no package called 'webshot'

# --- Actual vs Predicted table for test set ---
test_results <- test %>%
  select(company_name, year,
         roe, pred_roe,
         roa, pred_roa)

DT::datatable(test_results, caption = "Test Set Predictions — Pooled Model")

## Error in loadNamespace(name): there is no package called 'webshot'

# --- Partial effect plots for each predictor in ROE model ---

# 1. Carbon Emissions vs ROE
plot(train$carbon_emissions, train$roe,
     xlab = "Carbon Emissions", ylab = "ROE",
     main = "Effect of Carbon Emissions on ROE",
     pch = 19, col = "blue")

emm_em <- data.frame(
  carbon_emissions = seq(min(train$carbon_emissions, na.rm = TRUE),
                         max(train$carbon_emissions, na.rm = TRUE),
                         length.out = 100),
  energy_consumption = mean(train$energy_consumption, na.rm = TRUE),
  employee_turnover  = mean(train$employee_turnover, na.rm = TRUE),
  industry_type      = train$industry_type[1], # pick one level as baseline
  ceo_gender         = train$ceo_gender[1],
  year               = mean(train$year, na.rm = TRUE)
)

lines(emm_em$carbon_emissions,
      predict(model_roe, newdata = emm_em),
      col = "red", lwd = 2)

legend("topleft", legend = c("Actual Data", "Fitted Line"),
       col = c("blue","red"), pch = c(19, NA), lty = c(NA,1))

# 2. Energy Consumption vs ROE
plot(train$energy_consumption, train$roe,
     xlab = "Energy Consumption", ylab = "ROE",
     main = "Effect of Energy Consumption on ROE",
     pch = 19, col = "darkgreen")

emm_en <- data.frame(
  carbon_emissions   = mean(train$carbon_emissions, na.rm = TRUE),
  energy_consumption = seq(min(train$energy_consumption, na.rm = TRUE),
                           max(train$energy_consumption, na.rm = TRUE),
                           length.out = 100),
  employee_turnover  = mean(train$employee_turnover, na.rm = TRUE),
  industry_type      = train$industry_type[1],
  ceo_gender         = train$ceo_gender[1],
  year               = mean(train$year, na.rm = TRUE)
)

lines(emm_en$energy_consumption,
      predict(model_roe, newdata = emm_en),
      col = "red", lwd = 2)

legend("topleft", legend = c("Actual Data", "Fitted Line"),
       col = c("darkgreen","red"), pch = c(19, NA), lty = c(NA,1))

# 3. Employee Turnover vs ROE
plot(train$employee_turnover, train$roe,
     xlab = "Employee Turnover", ylab = "ROE",
     main = "Effect of Employee Turnover on ROE",
     pch = 19, col = "purple")

emm_to <- data.frame(
  carbon_emissions   = mean(train$carbon_emissions, na.rm = TRUE),
  energy_consumption = mean(train$energy_consumption, na.rm = TRUE),
  employee_turnover  = seq(min(train$employee_turnover, na.rm = TRUE),
                           max(train$employee_turnover, na.rm = TRUE),
                           length.out = 100),
  industry_type      = train$industry_type[1],
  ceo_gender         = train$ceo_gender[1],
  year               = mean(train$year, na.rm = TRUE)
)

lines(emm_to$employee_turnover,
      predict(model_roe, newdata = emm_to),
      col = "red", lwd = 2)

legend("topleft", legend = c("Actual Data", "Fitted Line"),
       col = c("purple","red"), pch = c(19, NA), lty = c(NA,1))

Pooled regression statistically valid but predictive accuracy remains weak.
Useful for identifying directional patterns (-turnover -> -ROE ; +emissions -> -ROE).
Confirms that external factors (market shocks, policies, R&D) drive much of the unexplained variation.

7) Conclusion

Our analysis linked sustainability metrics such as emissions, energy use and turnover with financial performance such as ROE/ROA across 10 companies over 5 years.
Company-level models failed due to very limited data (4–5 years per firm), showing why data depth matters in predictive analytics.
Pooled regression models were statistically valid but had weak predictive accuracy, highlighting the complex nature of ROE.
Despite poor prediction, the models provided directional insights:
- High employee turnover → consistently lower ROE/ROA.
- High emissions & energy intensity → generally linked with weaker returns.
Clustering analysis grouped firms into: efficient & profitable, underperformers, and transitioners — offering a method for strategic benchmarking.

8) Use of AI declaration

Declaration: AI tools were used only for grammatical refinement, formatting and pretty tables and graphs. All analysis, data preparation, modeling choices, and interpretations are original work.

9) Data sources declaration

Annual reports sourced from NSE India webpage
Sustainability reports also sourced from NSE India Webpage
Ratios through Dion Solutions Ltd. Available on MoneyControl

10) Blog link

https://proplayerplayz.github.io

Customer Review Analysis

2025-08-26T00:00:00+00:00

1. Introduction

Customer reviews are very valuable information for business decisions.

We are going to use text mining to extract quantifiable information to use for analysis

2. Data Pre-processing

We have to convert the unstructured data into structured format to apply descriptive statistics.

##  [1] "building"           "corpus"             "cosine_dist_matrix" "cosine_distance"    "cosine_similarity" 
##  [6] "crs"                "crv"                "d"                  "ddata"              "denominator"       
## [11] "dist_obj"           "dtm"                "dtm_matrix"         "g"                  "groceries_data"    
## [16] "hclust_obj"         "m"                  "numerator"          "p01"                "p02"               
## [21] "p03"                "p04"                "p05"                "p06"                "p07"               
## [26] "reviews"            "rules"              "rules_conf"         "scoring"            "texts"             
## [31] "transactions_data"  "transactions_list"  "v"

dtm <- DocumentTermMatrix(corpus)
inspect(dtm)

## <>
## Non-/sparse entries: 254/3631
## Sparsity           : 93%
## Maximal term length: 11
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs alfredo best chicken deep dish food good great pizza sauce
##   10       1    0       1    0    0    1    0     0     1     0
##   11       0    0       0    0    0    0    0     0     0     1
##   12       0    1       0    1    1    1    1     0     2     0
##   17       0    0       0    1    1    0    0     0     1     0
##   19       0    1       2    1    1    0    0     0     2     0
##   2        0    0       0    0    0    2    1     0     0     0
##   20       1    0       0    0    1    0    1     0     1     0
##   21       0    0       0    0    1    0    1     2     0     0
##   5        0    0       0    0    0    0    0     0     0     2
##   9        1    0       1    0    0    0    1     1     1     1

numerator <- crossprod_simple_triplet_matrix(dtm)
denominator <- sqrt(col_sums(dtm^2)) %*% t(sqrt(col_sums(dtm^2)))
cosine_similarity <- numerator / denominator
cosine_distance <- 1 - cosine_similarity

cosine_dist_matrix <- as.matrix(cosine_distance)
print(round(cosine_dist_matrix, 2))

##              Terms
## Terms         absolutely  add alfredo also although amazing appetizers atmosphere attitudes bad barely best better butter
##   absolutely        0.00 1.00    0.59 1.00     1.00    1.00       1.00       0.29         1   1   1.00 0.65   0.29   1.00
##   add               1.00 0.00    1.00 1.00     1.00    1.00       1.00       1.00         1   1   1.00 1.00   1.00   1.00
##   alfredo           0.59 1.00    0.00 1.00     0.42    1.00       1.00       1.00         1   1   1.00 1.00   0.42   0.42
##   also              1.00 1.00    1.00 0.00     1.00    1.00       1.00       1.00         1   1   1.00 0.65   1.00   1.00
##   although          1.00 1.00    0.42 1.00     0.00    1.00       1.00       1.00         1   1   1.00 1.00   1.00   0.00
##              Terms
## Terms         caccatore called cheese cheeses chew chicago chicken choose classic comes complaints cooked creamy crepe cute  day
##   absolutely       1.00   1.00   1.00    1.00 1.00    1.00    1.00   1.00    1.00  1.00       1.00   1.00   1.00  1.00 0.29 1.00
##   add              1.00   1.00   0.29    1.00 1.00    1.00    1.00   1.00    1.00  0.00       1.00   1.00   1.00  1.00 1.00 1.00
##   alfredo          1.00   1.00   1.00    1.00 1.00    1.00    0.53   1.00    1.00  1.00       1.00   1.00   1.00  1.00 1.00 1.00
##   also             1.00   0.29   1.00    1.00 1.00    0.29    1.00   1.00    1.00  1.00       1.00   1.00   1.00  1.00 1.00 1.00
##   although         1.00   1.00   1.00    1.00 1.00    1.00    0.59   1.00    1.00  1.00       1.00   1.00   1.00  1.00 1.00 1.00
##              Terms
## Terms         deep definite definitely delicious deterrent diamond  die dish dont  dry entire ever excellent eyes famous
##   absolutely  1.00     1.00       1.00      1.00      1.00    1.00 1.00 0.68 1.00 1.00   1.00 1.00      1.00 1.00   1.00
##   add         1.00     1.00       0.00      1.00      1.00    1.00 1.00 0.55 0.00 1.00   0.00 1.00      1.00 1.00   1.00
##   alfredo     1.00     1.00       1.00      1.00      1.00    1.00 1.00 0.74 1.00 0.42   1.00 1.00      1.00 1.00   1.00
##   also        0.59     1.00       1.00      1.00      1.00    1.00 1.00 0.68 1.00 1.00   1.00 0.50      1.00 1.00   1.00
##   although    1.00     1.00       1.00      1.00      1.00    1.00 1.00 1.00 1.00 1.00   1.00 1.00      1.00 1.00   1.00
##              Terms
## Terms         fantastic fantastico favorite fettuccine fighting filthy five flavorful food fooddo forever fourty fresh friend
##   absolutely       0.29       1.00     1.00       0.29     1.00   1.00 1.00      1.00 0.76   1.00    1.00   1.00  1.00   1.00
##   add              1.00       1.00     1.00       1.00     1.00   1.00 1.00      1.00 1.00   1.00    1.00   1.00  1.00   1.00
##   alfredo          1.00       1.00     1.00       0.42     1.00   1.00 1.00      0.42 0.81   1.00    1.00   1.00  0.59   1.00
##   also             1.00       1.00     1.00       1.00     0.29   1.00 0.50      1.00 0.53   1.00    1.00   0.29  1.00   1.00
##   although         1.00       1.00     1.00       1.00     1.00   1.00 1.00      0.00 0.67   1.00    1.00   1.00  0.29   1.00
##              Terms
## Terms         friendly garden garlic gnocchi going good  got great happy heard hiking home homemade house however huge including
##   absolutely      1.00   0.29   1.00    1.00     1 0.79 1.00  0.75  1.00  1.00   1.00 1.00     1.00  1.00    1.00 1.00      1.00
##   add             1.00   1.00   1.00    1.00     1 0.70 1.00  0.29  1.00  1.00   1.00 1.00     1.00  1.00    1.00 1.00      1.00
##   alfredo         1.00   0.42   0.42    1.00     1 0.65 1.00  0.80  1.00  0.42   1.00 1.00     1.00  1.00    0.42 1.00      1.00
##   also            1.00   1.00   1.00    1.00     1 0.36 0.29  0.75  1.00  1.00   1.00 1.00     1.00  0.29    1.00 1.00      1.00
##   although        1.00   1.00   0.00    1.00     1 1.00 1.00  1.00  1.00  1.00   1.00 1.00     1.00  1.00    0.00 1.00      1.00
##              Terms
## Terms         instead italian item  ive just lasagna lasagne left like linguini little  lol long loud lousy love lovers made
##   absolutely     1.00    1.00 1.00 0.29 1.00    1.00    0.29 1.00 1.00     1.00   1.00 0.29 1.00 1.00     1 1.00   1.00 1.00
##   add            1.00    1.00 1.00 1.00 1.00    0.11    1.00 1.00 0.00     1.00   1.00 1.00 1.00 1.00     1 1.00   1.00 0.42
##   alfredo        1.00    1.00 1.00 0.42 0.42    1.00    0.42 1.00 1.00     0.42   1.00 1.00 1.00 1.00     1 1.00   1.00 1.00
##   also           1.00    1.00 1.00 1.00 1.00    1.00    1.00 0.29 1.00     1.00   1.00 1.00 1.00 1.00     1 1.00   1.00 1.00
##   although       1.00    1.00 1.00 1.00 1.00    1.00    1.00 1.00 1.00     0.00   1.00 1.00 1.00 1.00     1 1.00   1.00 1.00
##              Terms
## Terms         make manicotti many meals meat meatballs melt melts menu minute mouth mozarella much mushrooms mussels nice okay
##   absolutely  1.00      1.00 0.50     1 1.00      1.00 1.00  1.00 1.00   1.00  1.00      1.00 1.00      1.00    1.00 1.00 1.00
##   add         1.00      1.00 1.00     1 0.42      1.00 1.00  1.00 1.00   1.00  1.00      1.00 1.00      1.00    1.00 1.00 1.00
##   alfredo     1.00      1.00 0.59     1 1.00      1.00 1.00  1.00 1.00   1.00  1.00      1.00 1.00      1.00    0.42 0.59 0.42
##   also        1.00      1.00 1.00     1 1.00      1.00 1.00  1.00 0.29   0.29  1.00      1.00 1.00      1.00    1.00 0.50 1.00
##   although    1.00      1.00 1.00     1 1.00      1.00 1.00  1.00 1.00   1.00  1.00      1.00 1.00      1.00    0.00 1.00 1.00
##              Terms
## Terms         okive olive options order ordered ordering overpower overrated  pan parm pasta people perfectly pesto  pie pizza
##   absolutely   0.29  0.29    1.00  1.00    1.00     1.00      1.00         1 1.00 1.00  1.00   1.00      1.00  1.00 1.00  0.82
##   add          1.00  1.00    1.00  1.00    1.00     1.00      0.00         1 1.00 1.00  1.00   1.00      1.00  1.00 1.00  1.00
##   alfredo      0.42  0.42    1.00  1.00    0.42     1.00      1.00         1 1.00 1.00  1.00   1.00      1.00  1.00 1.00  0.55
##   also         1.00  1.00    1.00  1.00    1.00     1.00      1.00         1 0.29 1.00  1.00   0.29      1.00  1.00 0.29  0.45
##   although     1.00  1.00    1.00  1.00    0.00     1.00      1.00         1 1.00 1.00  1.00   1.00      1.00  1.00 1.00  0.74
##              Terms
## Terms         pizzas place places plump portions pretty prices ready real really reason reasonable recommend rest ricotta rough
##   absolutely    1.00  1.00   0.29  1.00     1.00   1.00   1.00  1.00 1.00   1.00   1.00       1.00      1.00 1.00    1.00  1.00
##   add           1.00  1.00   1.00  1.00     1.00   1.00   1.00  1.00 1.00   0.42   1.00       1.00      1.00 1.00    0.29  1.00
##   alfredo       1.00  1.00   0.42  1.00     1.00   1.00   1.00  1.00 1.00   0.67   1.00       1.00      1.00 1.00    1.00  1.00
##   also          1.00  1.00   1.00  1.00     1.00   1.00   1.00  0.29 0.29   0.59   1.00       1.00      1.00 0.29    1.00  1.00
##   although      1.00  1.00   1.00  1.00     1.00   1.00   1.00  1.00 1.00   1.00   1.00       1.00      1.00 1.00    1.00  1.00
##              Terms
## Terms         sauce seamlessly seating service shrimp shrimps slow spaghetti special spectacular spices staff stars steamed stop
##   absolutely   1.00       1.00    1.00    1.00   1.00    1.00 1.00      1.00    1.00        1.00   1.00  1.00  1.00    1.00 1.00
##   add          1.00       1.00    1.00    1.00   1.00    1.00 1.00      1.00    1.00        1.00   0.00  1.00  1.00    1.00 1.00
##   alfredo      0.76       1.00    1.00    1.00   1.00    1.00 1.00      1.00    1.00        0.42   1.00  1.00  1.00    0.42 1.00
##   also         1.00       1.00    0.29    0.59   1.00    1.00 1.00      1.00    1.00        1.00   1.00  1.00  1.00    1.00 1.00
##   although     1.00       1.00    1.00    1.00   1.00    1.00 1.00      1.00    1.00        1.00   1.00  1.00  1.00    0.00 1.00
##              Terms
## Terms         stopped stuffed style sublime super tails take takes taste tasted thing time tops tortellini tortellinis tried
##   absolutely        1    1.00  1.00    1.00  1.00  1.00 1.00  1.00  0.50   1.00  1.00 1.00 1.00       1.00        1.00  0.50
##   add               1    1.00  1.00    1.00  1.00  1.00 1.00  1.00  1.00   1.00  1.00 1.00 1.00       1.00        1.00  1.00
##   alfredo           1    1.00  1.00    1.00  1.00  1.00 1.00  1.00  0.18   0.42  0.42 1.00 1.00       1.00        1.00  0.59
##   also              1    1.00  0.29    1.00  1.00  1.00 1.00  1.00  1.00   1.00  1.00 1.00 0.29       1.00        1.00  1.00
##   although          1    1.00  1.00    1.00  1.00  1.00 1.00  1.00  0.29   1.00  1.00 1.00 1.00       1.00        1.00  1.00
##              Terms
## Terms         veggie wait waiter want wasnt watering way white wine worth yummy
##   absolutely    1.00 1.00   0.50 1.00  1.00     1.00   1  1.00 1.00  1.00  1.00
##   add           1.00 1.00   1.00 0.00  1.00     1.00   1  1.00 1.00  1.00  1.00
##   alfredo       1.00 1.00   0.59 1.00  0.42     1.00   1  0.42 0.42  1.00  1.00
##   also          1.00 0.29   1.00 1.00  1.00     1.00   1  1.00 1.00  0.29  1.00
##   although      1.00 1.00   1.00 1.00  1.00     1.00   1  0.00 0.00  1.00  1.00
##  [ reached 'max' / getOption("max.print") -- omitted 180 rows ]

heatmap(cosine_dist_matrix, col = colorRampPalette(c("white", "steelblue"))(100))

dtm_matrix <- as.matrix(dtm)  # Convert sparse DTM to full matrix
dist_obj <- proxy::dist(dtm_matrix, method = "cosine")  # Proper cosine distance

hclust_obj <- hclust(dist_obj, method = "ward.D2")
plot(hclust_obj, labels = paste("Doc", 1:nrow(dtm_matrix)), main = "Document Clustering")

m <- as.matrix(dtm)
v <- sort(colSums(m), decreasing = TRUE)
d <- data.frame(word = names(v), freq = v)


set.seed(123)  # for reproducibility
wordcloud(
  words = d$word,
  freq = d$freq,
  min.freq = 1,
  max.words = 100,
  random.order = FALSE,
  rot.per = 0.35,
  colors = brewer.pal(8, "Dark2")
)

Market Basket Analysis

2025-08-26T00:00:00+00:00

1. Introduction

Market Basket Analysis is a method to understand the purchasing behavior/choice of customers. Based on the frequency of purchases(support) and associations of items(confidence) we can develop rules to predict the items in the customer basket.

2. Market Basket Analysis using Groceries Dataset from Kaggle

2.1. Preliminary Data Exploration

##   Member_number       Date  itemDescription
## 1          1808 21-07-2015   tropical fruit
## 2          2552 05-01-2015       whole milk
## 3          2300 19-09-2015        pip fruit
## 4          1187 12-12-2015 other vegetables
## 5          3037 01-02-2015       whole milk
## 6          4941 14-02-2015       rolls/buns

##       Member_number       Date       itemDescription
## 38760          3364 06-05-2014                   oil
## 38761          4471 08-10-2014         sliced cheese
## 38762          2022 23-02-2014                 candy
## 38763          1097 16-04-2014              cake bar
## 38764          1510 03-12-2014 fruit/vegetable juice
## 38765          1521 26-12-2014              cat food

## 'data.frame':	38765 obs. of  3 variables:
##  $ Member_number  : int  1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
##  $ Date           : chr  "21-07-2015" "05-01-2015" "19-09-2015" "12-12-2015" ...
##  $ itemDescription: chr  "tropical fruit" "whole milk" "pip fruit" "other vegetables" ...

## [1] 3898

The Groceries_Data from Kaggle has 38765 Observations and 3 Variables including Member_number, Date_of_Purchase and the items in the basket. The transaction data spans over the years 2014 and 2015.

2.2. Preparing the data for Market Basket Analysis

The data is currently in a “row per item” format, we will need to convert this into “row per transaction” format to effectively perform the market basket analysis.

Using association rules package we get the “baskets” of items from the data to use with the apriori algorithm to find association rules

We use these transactions to get a result of combinations of the items in the “basket” along with values such as support, confidence and list which help us determine the likelihood that the customer buys a certain item given they have already picked out certain items. We will display the top 10 items in this list to get an idea of how our result looks like

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target  ext
##         0.2    0.1    1 none FALSE            TRUE       5   5e-04      2     10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 14963 transaction(s)] done [0.00s].
## sorting and recoding items ... [158 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [19 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

##      lhs                                rhs          support      confidence coverage    lift     count
## [1]  {artif. sweetener}              => {whole milk} 0.0005346521 0.2758621  0.001938114 1.746815  8   
## [2]  {brandy}                        => {whole milk} 0.0008688097 0.3421053  0.002539598 2.166281 13   
## [3]  {spices}                        => {soda}       0.0006014837 0.2250000  0.002673261 2.317051  9   
## [4]  {softener}                      => {whole milk} 0.0008019782 0.2926829  0.002740092 1.853328 12   
## [5]  {house keeping products}        => {whole milk} 0.0007351467 0.2444444  0.003007418 1.547872 11   
## [6]  {finished products}             => {whole milk} 0.0008688097 0.2031250  0.004277217 1.286229 13   
## [7]  {rolls/buns, white bread}       => {whole milk} 0.0006014837 0.2812500  0.002138609 1.780933  9   
## [8]  {other vegetables, white bread} => {whole milk} 0.0005346521 0.2051282  0.002606429 1.298914  8   
## [9]  {margarine, soda}               => {whole milk} 0.0005346521 0.2051282  0.002606429 1.298914  8   
## [10] {curd, rolls/buns}              => {whole milk} 0.0006014837 0.2195122  0.002740092 1.389996  9

We can sort the output based on confidence for a clearer picture.

##      lhs                          rhs                support      confidence coverage    lift     count
## [1]  {pork, sausage}           => {whole milk}       0.0006014837 0.3913043  0.001537125 2.477819  9   
## [2]  {brandy}                  => {whole milk}       0.0008688097 0.3421053  0.002539598 2.166281 13   
## [3]  {softener}                => {whole milk}       0.0008019782 0.2926829  0.002740092 1.853328 12   
## [4]  {rolls/buns, white bread} => {whole milk}       0.0006014837 0.2812500  0.002138609 1.780933  9   
## [5]  {artif. sweetener}        => {whole milk}       0.0005346521 0.2758621  0.001938114 1.746815  8   
## [6]  {sausage, shopping bags}  => {other vegetables} 0.0005346521 0.2758621  0.001938114 2.259291  8   
## [7]  {sausage, yogurt}         => {whole milk}       0.0014702934 0.2558140  0.005747511 1.619866 22   
## [8]  {house keeping products}  => {whole milk}       0.0007351467 0.2444444  0.003007418 1.547872 11   
## [9]  {pastry, soda}            => {whole milk}       0.0009356412 0.2295082  0.004076723 1.453293 14   
## [10] {pastry, sausage}         => {whole milk}       0.0007351467 0.2291667  0.003207913 1.451130 11

Taking a look at the top 10 rows in the confidence sorted results we can observe that “whole milk” has a high likelihood of being picked when “pork” and “sausage” are also already picked. We also observe similar relation ships between the “LHS” and the “RHS” column. The Results give us an idea of the probability that the “RHS” item is taken when we already have “LHS” items.

Market Segmentation

2025-08-26T00:00:00+00:00

1. Introduction

This document will perform Market Segmentation Analysis on the data provided by KTC. We will be looking into importing and performing cluster analysis on the data to find useful patterns in Customer data.

2. Descriptive Mining

We are going to explore the data and find the patterns and do the segmentation.

2.1. Data Exploration

We have information regarding 30 customers of KTC Company. We have details of their Age, Income, Marital Status, No. of Children and their financial status in the means of whether they have a mortgage loan and other loans.

## # A tibble: 30 × 7
##      Age Female Income Married Children  Loan Mortgage
##                    
##  1    48      1 17546        0        1     0        0
##  2    40      0 30085.       1        3     1        1
##  3    51      1 16575.       1        0     1        0
##  4    23      1 20375.       1        3     0        0
##  5    57      1 50576.       1        0     0        0
##  6    57      1 37870.       1        2     0        0
##  7    22      0  8877.       0        0     0        0
##  8    58      0 24947.       1        0     1        0
##  9    37      1 25304.       1        2     1        0
## 10    54      0 24212.       1        2     1        0
## # ℹ 20 more rows

## 
## Data frame:crs$dataset[, c(crs$input, crs$risk, crs$target)]	30 observations and 7 variables    Maximum # NAs:0
## 
## 
##          Storage
## Age       double
## Female    double
## Income    double
## Married   double
## Children  double
## Loan      double
## Mortgage  double

##       Age            Female           Income         Married       Children           Loan           Mortgage  
##  Min.   :22.00   Min.   :0.0000   Min.   : 8877   Min.   :0.0   Min.   :0.0000   Min.   :0.0000   Min.   :0.0  
##  1st Qu.:37.25   1st Qu.:0.0000   1st Qu.:18166   1st Qu.:1.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0  
##  Median :47.00   Median :1.0000   Median :24241   Median :1.0   Median :0.5000   Median :0.0000   Median :0.0  
##  Mean   :45.97   Mean   :0.5667   Mean   :28012   Mean   :0.8   Mean   :0.9333   Mean   :0.4333   Mean   :0.4  
##  3rd Qu.:56.75   3rd Qu.:1.0000   3rd Qu.:35923   3rd Qu.:1.0   3rd Qu.:2.0000   3rd Qu.:1.0000   3rd Qu.:1.0  
##  Max.   :66.00   Max.   :1.0000   Max.   :59804   Max.   :1.0   Max.   :3.0000   Max.   :1.0000   Max.   :1.0

2.1.1. Age

## crs$dataset["Age"] 
## 
##  1  Variables      30  Observations
## ------------------------------------------------------------------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05      .10      .25      .50      .75      .90      .95 
##       30        0       23    0.998    45.97     46.5    15.12    22.45    26.60    37.25    47.00    56.75    61.10    64.20 
## 
## lowest : 22 23 27 31 36, highest: 57 58 61 62 66
## ------------------------------------------------------------------------------------------------------------------------------------

2.1.2. Female (Gender Column

## crs$dataset["Female"] 
## 
##  1  Variables      30  Observations
## ------------------------------------------------------------------------------------------------------------------------------------
## Female 
##        n  missing distinct     Info      Sum     Mean 
##       30        0        2    0.737       17   0.5667 
## 
## ------------------------------------------------------------------------------------------------------------------------------------

2.1.3. Income

## crs$dataset["Income"] 
## 
##  1  Variables      30  Observations
## ------------------------------------------------------------------------------------------------------------------------------------
## Income 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05      .10      .25      .50      .75      .90      .95 
##       30        0       30        1    28012    25590    14919    13945    15716    18166    24241    35923    51039    56676 
## 
## lowest : 8877.07 12640.3 15538.8 15735.8 16497.3, highest: 41034   50576.3 55204.7 57880.7 59803.9
## ------------------------------------------------------------------------------------------------------------------------------------

2.1.4. Marriage

## crs$dataset["Married"] 
## 
##  1  Variables      30  Observations
## ------------------------------------------------------------------------------------------------------------------------------------
## Married 
##        n  missing distinct     Info      Sum     Mean 
##       30        0        2    0.481       24      0.8 
## 
## ------------------------------------------------------------------------------------------------------------------------------------

2.1.5. Children

## crs$dataset["Children"] 
## 
##  1  Variables      30  Observations
## ------------------------------------------------------------------------------------------------------------------------------------
## Children 
##        n  missing distinct     Info     Mean  pMedian      Gmd 
##       30        0        4    0.858   0.9333        1    1.163 
##                                   
## Value          0     1     2     3
## Frequency     15     5     7     3
## Proportion 0.500 0.167 0.233 0.100
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------------------------------------------------------------------------------

2.1.6. Loan

## crs$dataset["Loan"] 
## 
##  1  Variables      30  Observations
## ------------------------------------------------------------------------------------------------------------------------------------
## Loan 
##        n  missing distinct     Info      Sum     Mean 
##       30        0        2    0.737       13   0.4333 
## 
## ------------------------------------------------------------------------------------------------------------------------------------

2.1.7. Mortgage

## crs$dataset["Mortgage"] 
## 
##  1  Variables      30  Observations
## ------------------------------------------------------------------------------------------------------------------------------------
## Mortgage 
##        n  missing distinct     Info      Sum     Mean 
##       30        0        2    0.721       12      0.4 
## 
## ------------------------------------------------------------------------------------------------------------------------------------

2.1.8. Distributions

# Generate the plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Age) %>%
  ggplot2::ggplot(ggplot2::aes(x=Age)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Age\n\nRattle 2025-Jul-19 09:48:10 sakth") +
  ggplot2::ggtitle("Distribution of Age") +
  ggplot2::labs(y="Density")

# Use ggplot2 to generate histogram plot for Female

# Generate the plot.

p02 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Female) %>%
  ggplot2::ggplot(ggplot2::aes(x=Female)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Female\n\nRattle 2025-Jul-19 09:48:10 sakth") +
  ggplot2::ggtitle("Distribution of Female") +
  ggplot2::labs(y="Density")

# Use ggplot2 to generate histogram plot for Income

# Generate the plot.

p03 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Income) %>%
  ggplot2::ggplot(ggplot2::aes(x=Income)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Income\n\nRattle 2025-Jul-19 09:48:10 sakth") +
  ggplot2::ggtitle("Distribution of Income") +
  ggplot2::labs(y="Density")

# Use ggplot2 to generate histogram plot for Married

# Generate the plot.

p04 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Married) %>%
  ggplot2::ggplot(ggplot2::aes(x=Married)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Married\n\nRattle 2025-Jul-19 09:48:10 sakth") +
  ggplot2::ggtitle("Distribution of Married") +
  ggplot2::labs(y="Density")

# Use ggplot2 to generate histogram plot for Children

# Generate the plot.

p05 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Children) %>%
  ggplot2::ggplot(ggplot2::aes(x=Children)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Children\n\nRattle 2025-Jul-19 09:48:10 sakth") +
  ggplot2::ggtitle("Distribution of Children") +
  ggplot2::labs(y="Density")

# Use ggplot2 to generate histogram plot for Loan

# Generate the plot.

p06 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Loan) %>%
  ggplot2::ggplot(ggplot2::aes(x=Loan)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Loan\n\nRattle 2025-Jul-19 09:48:10 sakth") +
  ggplot2::ggtitle("Distribution of Loan") +
  ggplot2::labs(y="Density")

# Use ggplot2 to generate histogram plot for Mortgage

# Generate the plot.

p07 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Mortgage) %>%
  ggplot2::ggplot(ggplot2::aes(x=Mortgage)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Mortgage\n\nRattle 2025-Jul-19 09:48:10 sakth") +
  ggplot2::ggtitle("Distribution of Mortgage") +
  ggplot2::labs(y="Density")

# Display the plots.

gridExtra::grid.arrange(p01, p02, p03, p04, p05, p06, p07)

2.2. Dendrogram

Observing the above dendrogram we can observe that

2.3. Elbow Method

# Elbow method for finding the no of clusters
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_nbclust(crs$dataset[, c(1:7)], kmeans, method = "wss") +
  labs(subtitle = "Elbow Method")

We can observe that when the no. of clusters is 2 there is a sharp change in the total within sum of squares. This shows that 2 is the optimal no. of clusters to have for this dataset

3. Segmentation and Clustering

Clustering is a method of grouping the observation based on their similarities. We use distance measures for assessing the dissimilarity among the observations. There are many measures of distance including Euclidean, Manhattan etc, Similarly we have different types of clustering algorithms such as K Means, Hierarchical, BiClustering etc. We will begin with Hierarchical clustering as part of our data exploration analysis.

3.1. Hierarchical Clustering

No. of Clusters = 5

No. of Clusters = 4

No. of Clusters = 3

No. of Clusters = 2

3.2. K-means Clustering

## [1] "12 10 8"

##          Age       Female       Income      Married     Children         Loan     Mortgage 
## 4.596667e+01 5.666667e-01 2.801187e+04 8.000000e-01 9.333333e-01 4.333333e-01 4.000000e-01

##      Age    Female   Income   Married Children Loan  Mortgage
## 1 37.000 0.5833333 16826.18 0.6666667    1.000 0.25 0.4166667
## 2 47.200 0.5000000 25661.02 0.8000000    1.100 0.80 0.4000000
## 3 57.875 0.6250000 47728.97 1.0000000    0.625 0.25 0.3750000

## [1] 131352595  61338314 586111857

4. Conclusion

We have successfully explored the data and performed the appropriate clustering methods to identify the pattern in the data.

From this we can see the formed clusters clearly, and we can say that all the data points within each cluster are significantly similar to each other. From this we can do various analysis like classifying a new entry to the dataset or identifying largest common cluster to find the most common type of customers.