Sakthi Swaroopan S - CB.BU.P2ASB25147

0) Setup

# install.packages(c("readxl","dplyr","magrittr","factoextra","rattle","DT","psych","tibble","tidyr"))


library(readxl)
library(dplyr)
library(magrittr)
library(factoextra)
library(rattle)
library(DT)
library(psych)
library(tibble)
library(tidyr)


set.seed(42)


# ---- Paths ----
DATA_PATH <- "/ESG_Dataset_Sakthi.xlsx"
SHEET_NAME <- "Sheet1"

1) Importing the data

Data was collected through the annual reports sourced from NSE.

raw <- read_excel(DATA_PATH, sheet = SHEET_NAME) %>%
janitor::clean_names() # using janitor fully qualified (not attaching)


# Expected columns after clean_names():
# company_name, year, industry_type, ceo_name, ceo_gender,
# carbon_emissions, energy_consumption, employee_turnover, roe, roa

2) Exploratory analysis

# Structure and a peek
str(raw)

## tibble [48 × 11] (S3: tbl_df/tbl/data.frame)
##  $ company_name      : chr [1:48] "Sona BLW Percision forgings ltd" "Sona BLW Percision forgings ltd" "Sona BLW Percision forgings ltd" "Sona BLW Percision forgings ltd" ...
##  $ year              : num [1:48] 2021 2022 2023 2024 2025 ...
##  $ carbon_emissions  : chr [1:48] "32756" "40330" "48468" "58317" ...
##  $ energy_consumption: chr [1:48] "41800.07" "52308.14" "311100" "358157" ...
##  $ employee_turnover : chr [1:48] "7.6499999999999999E-2" "0.11" "0.16" "0.13" ...
##  $ roa               : num [1:48] 0.1065 0.141 0.1303 0.1351 0.0941 ...
##  $ roe               : num [1:48] 0.153 0.179 0.172 0.188 0.107 ...
##  $ industry_type     : chr [1:48] "Automotive" "Automotive" "Automotive" "Automotive" ...
##  $ location          : chr [1:48] "Haryana" "Haryana" "Haryana" "Haryana" ...
##  $ ceo_name          : chr [1:48] "Vivek Vikram Singh" "Vivek Vikram Singh" "Vivek Vikram Singh" "Vivek Vikram Singh" ...
##  $ ceo_gender        : chr [1:48] "Male" "Male" "Male" "Male" ...

DT::datatable(head(raw, 20), options = list(pageLength = 10), caption = "Raw data (first 20 rows)")

## Error in loadNamespace(name): there is no package called 'webshot'

# Simple counts
raw %>% count(company_name, sort = TRUE) %>% DT::datatable(caption = "Rows per company")

## Error in loadNamespace(name): there is no package called 'webshot'

raw %>% count(year, sort = TRUE) %>% DT::datatable(caption = "Rows per year")

## Error in loadNamespace(name): there is no package called 'webshot'

Data is uneven across companies and years; not all variables are consistently reported.
Some firms dominate in reporting while others have sparse records.

3) Pre‑processing the data

Rules applied

Drop rows where Energy_Consumption == “Not Reported”. This ensures comparability between the companies.
Replace NA in Carbon_Emissions with 0 for non-manufacturing firms.
Cast types for numeric columns; keep factors for categories.

Standardizing Decimals to improve readability and consistency.

df <- raw %>%
# Normalize text placeholders to real NAs
mutate(
energy_consumption = dplyr::na_if(energy_consumption, "Not Reported"),
carbon_emissions = dplyr::na_if(carbon_emissions, "NA")
) %>%
# Coerce to appropriate types
mutate(
year = as.integer(year),
industry_type = as.factor(industry_type),
ceo_gender = factor(ceo_gender, levels = c("Male","Female","Other")),
energy_consumption = suppressWarnings(as.numeric(energy_consumption)),
carbon_emissions = suppressWarnings(as.numeric(carbon_emissions)),
employee_turnover = suppressWarnings(as.numeric(employee_turnover)),
roe = suppressWarnings(as.numeric(roe)),
roa = suppressWarnings(as.numeric(roa))
) %>%
# Apply the two cleaning rules
filter(!is.na(energy_consumption)) %>% # drop Not Reported rows
mutate(carbon_emissions = dplyr::coalesce(carbon_emissions, 0)) %>%
mutate(across(c(carbon_emissions, energy_consumption,employee_turnover, roe, roa), ~ round(.x, 2))) %>%
arrange(company_name, year)
    
    
# Quick sanity check
stopifnot(all(c("company_name","year","industry_type","ceo_gender",
"carbon_emissions","energy_consumption","employee_turnover","roe","roa") %in% names(df)))

4) Preview of data before analysis

DT::datatable(head(df, 20), options = list(pageLength = 10), caption = "Cleaned data (first 20 rows)")

## Error in loadNamespace(name): there is no package called 'webshot'

5) Exploratory analysis (post‑clean)

# Numeric summary by year
year_summary <- df %>%
group_by(year) %>%
summarise(
n = dplyr::n(),
mean_emissions = mean(carbon_emissions, na.rm = TRUE),
mean_energy = mean(energy_consumption, na.rm = TRUE),
mean_turnover = mean(employee_turnover, na.rm = TRUE),
mean_roe = mean(roe, na.rm = TRUE),
mean_roa = mean(roa, na.rm = TRUE)
)
DT::datatable(year_summary, caption = "Year‑wise summary")

## Error in loadNamespace(name): there is no package called 'webshot'

# Correlation matrix (pooled numeric)
num_cols <- c("carbon_emissions","energy_consumption","employee_turnover","roe","roa")
cor_mat <- stats::cor(df[, num_cols], use = "pairwise.complete.obs")
cor_mat <- round(cor_mat, 2)
cor_mat

##                    carbon_emissions energy_consumption employee_turnover   roe   roa
## carbon_emissions               1.00              -0.15             -0.46  0.03  0.13
## energy_consumption            -0.15               1.00              0.04 -0.22 -0.14
## employee_turnover             -0.46               0.04              1.00 -0.24 -0.22
## roe                            0.03              -0.22             -0.24  1.00  0.94
## roa                            0.13              -0.14             -0.22  0.94  1.00

# helper function to convert psych::describe output into vertical key-value tibble
describe_long <- function(x) {
  psych::describe(x) %>%
    as_tibble() %>%
    select(-vars, -n) %>%   # drop unneeded cols (keep stats)
    pivot_longer(cols = everything(),
                 names_to = "Metric",
                 values_to = "Value")
}

# Now display each variable as vertical DT
DT::datatable(describe_long(df["carbon_emissions"]), caption = "Carbon Emissions")

## Error in loadNamespace(name): there is no package called 'webshot'

DT::datatable(describe_long(df["energy_consumption"]), caption = "Energy Consumption")

## Error in loadNamespace(name): there is no package called 'webshot'

DT::datatable(describe_long(df["employee_turnover"]), caption = "Employee Turnover")

## Error in loadNamespace(name): there is no package called 'webshot'

DT::datatable(describe_long(df["roe"]), caption = "ROE")

## Error in loadNamespace(name): there is no package called 'webshot'

DT::datatable(describe_long(df["roa"]), caption = "ROA")

## Error in loadNamespace(name): there is no package called 'webshot'

Average ROE and ROA vary significantly year to year.
Turnover shows an overall negative correlation with profitability

Energy consumption and emissions remain volatile, without a steady downward trend.

6) Analysis

6.1) Company Performance Snapshot

company_summary <- df %>%
  group_by(company_name) %>%
  summarise(
    avg_emissions   = mean(carbon_emissions, na.rm = TRUE),
    avg_energy      = mean(energy_consumption, na.rm = TRUE),
    avg_turnover    = mean(employee_turnover, na.rm = TRUE),
    avg_roe         = mean(roe, na.rm = TRUE),
    avg_roa         = mean(roa, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_roe))

DT::datatable(
  company_summary,
  caption = "5-Year Company Performance Snapshot (Averages)",
  options = list(pageLength = 10)
)

## Error in loadNamespace(name): there is no package called 'webshot'

Some firms balance profitability with efficiency, while others underperform despite high energy/emissions.

6.2) Year-over-year Trends

# Recompute to keep the section self-contained
yoy <- df %>%
  group_by(year) %>%
  summarise(
    mean_emissions = mean(carbon_emissions, na.rm = TRUE),
    mean_energy    = mean(energy_consumption, na.rm = TRUE),
    mean_turnover  = mean(employee_turnover, na.rm = TRUE),
    mean_roe       = mean(roe, na.rm = TRUE),
    mean_roa       = mean(roa, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(year)

DT::datatable(
  yoy,
  caption = "Year-wise Averages (Sustainability & Profitability)"
)

## Error in loadNamespace(name): there is no package called 'webshot'

# Emissions trend
plot(
  yoy$year, yoy$mean_emissions, type = "b",
  xlab = "Year", ylab = "Avg Carbon Emissions",
  main = "Trend: Average Carbon Emissions by Year"
)

plot of chunk year_trend

# ROE trend
plot(
  yoy$year, yoy$mean_roe, type = "b",
  xlab = "Year", ylab = "Avg ROE",
  main = "Trend: Average ROE by Year"
)

plot of chunk year_trend

# ROA trend
plot(
  yoy$year, yoy$mean_roa, type = "b",
  xlab = "Year", ylab = "Avg ROA",
  main = "Trend: Average ROA by Year"
)

plot of chunk year_trend

Modest improvement in profitability, but sustainability indicators (emissions, energy) do not show consistent decline.

6.3) Comparative Plots

# Helper to draw scatter + OLS line + correlation 
make_scatter <- function(x, y, xlab, ylab, title) {
  ok <- is.finite(x) & is.finite(y)
  plot(x[ok], y[ok],
       xlab = xlab, ylab = ylab,
       main = title, pch = 19)
  # OLS line
  fit <- stats::lm(y[ok] ~ x[ok])
  abline(fit, lwd = 2)
  # Pearson correlation
  r <- stats::cor(x[ok], y[ok], method = "pearson")
  legend("topleft", bty = "n",
         legend = paste0("r = ", round(r, 2)))
}

# Emissions vs ROE
make_scatter(
  x = df$carbon_emissions, y = df$roe,
  xlab = "Carbon Emissions", ylab = "ROE",
  title = "Carbon Emissions vs ROE"
)

plot of chunk comparative_plots

# Energy vs ROA
make_scatter(
  x = df$energy_consumption, y = df$roa,
  xlab = "Energy Consumption", ylab = "ROA",
  title = "Energy Consumption vs ROA"
)

plot of chunk comparative_plots

# Turnover vs ROE
make_scatter(
  x = df$employee_turnover, y = df$roe,
  xlab = "Employee Turnover", ylab = "ROE",
  title = "Employee Turnover vs ROE"
)

plot of chunk comparative_plots

Negative relationship between turnover and ROE is evident.
Emissions and energy show a slight negative relationship with performance.

6.4) Clustering

clust_df <- df %>%
  group_by(company_name, industry_type) %>%
  summarise(
    carbon_emissions   = mean(carbon_emissions,   na.rm = TRUE),
    energy_consumption = mean(energy_consumption, na.rm = TRUE),
    employee_turnover  = mean(employee_turnover,  na.rm = TRUE),
    roe                = mean(roe,                na.rm = TRUE),
    roa                = mean(roa,                na.rm = TRUE),
    .groups = "drop"
  )

# Variables used for clustering (inputs)
vars_used <- c("carbon_emissions", "energy_consumption",
               "employee_turnover", "roe", "roa")
DT::datatable(
  tibble::tibble(Variables_Used = vars_used),
  caption = "Variables used for clustering (company-level averages)"
)

## Error in loadNamespace(name): there is no package called 'webshot'

# Prepare numeric matrix and scale
X <- clust_df %>% dplyr::select(all_of(vars_used)) %>% as.data.frame()
X_scaled <- scale(X)

# choose K (WSS / Silhouette)
factoextra::fviz_nbclust(X_scaled, kmeans, method = "wss", k.max = 8)

plot of chunk clustering

factoextra::fviz_nbclust(X_scaled, kmeans, method = "silhouette", k.max = 8)

plot of chunk clustering

# Fit K-means (set k after inspecting above plots)
k <- 3
km <- stats::kmeans(X_scaled, centers = k, nstart = 50)

# Attach cluster ids to companies
clust_out <- clust_df %>% mutate(cluster = factor(km$cluster))
DT::datatable(clust_out, caption = "Company → Cluster assignments")

## Error in loadNamespace(name): there is no package called 'webshot'

# Companies in each cluster (compact list)
companies_by_cluster <- clust_out %>%
  group_by(cluster) %>%
  summarise(companies = paste(company_name, collapse = ", "), .groups = "drop")
DT::datatable(companies_by_cluster, caption = "Companies in each cluster")

## Error in loadNamespace(name): there is no package called 'webshot'

# PCA for labeled visualization (labels = company names)
rownames(X_scaled) <- clust_df$company_name
pca_obj <- stats::prcomp(X_scaled, center = FALSE, scale. = FALSE)  # already scaled

factoextra::fviz_pca_ind(
  pca_obj,
  geom = "point",
  habillage = clust_out$cluster,   # color by cluster
  addEllipses = FALSE,             # avoid ellipse warnings for small clusters
  label = "all",                   # show company labels
  repel = TRUE,                    # nicer label placement
  title = "Company Segments (PCA with labels)"
)

plot of chunk clustering

# PCA loadings table to interpret Dim1/Dim2 drivers
loadings <- tibble::as_tibble(pca_obj$rotation[, 1:2], rownames = "variable")
colnames(loadings) <- c("variable", "Dim1_loading", "Dim2_loading")
DT::datatable(loadings, caption = "PCA loadings (which variables drive Dim1/Dim2)")

## Error in loadNamespace(name): there is no package called 'webshot'

# Cluster-level profiles (means of original features)
cluster_profiles <- clust_out %>%
  group_by(cluster) %>%
  summarise(across(all_of(vars_used), ~ mean(.x, na.rm = TRUE)), .groups = "drop")
DT::datatable(cluster_profiles, caption = "Cluster profiles (feature means)")

## Error in loadNamespace(name): there is no package called 'webshot'

Three groups emerge:
- Efficient & Profitable (low emissions, higher returns).
- Transitioners (partial improvements).
- Underperformers (high energy/emissions, low returns).

Automotive company stands out with both high emissions and high profitability.

6.5) Predictive analysis

6.5.1) Company-wise Prediction

# Filter company
df_c <- df %>% filter(company_name == "Artemis Medicare Services Ltd.")

# Train/test split (latest year as test, earlier years as train)
train <- df_c %>% filter(year < max(year))
test  <- df_c %>% filter(year == max(year))

# Model: ROE as dependent, predictors = sustainability + turnover + time
model <- lm(roe ~ carbon_emissions + energy_consumption + employee_turnover + year,
            data = train)

summary(model)

## 
## Call:
## lm(formula = roe ~ carbon_emissions + energy_consumption + employee_turnover + 
##     year, data = train)
## 
## Residuals:
## ALL 3 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (2 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)        1.006e-01        NaN     NaN      NaN
## carbon_emissions   8.860e-07        NaN     NaN      NaN
## energy_consumption 4.147e-07        NaN     NaN      NaN
## employee_turnover         NA         NA      NA       NA
## year                      NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,	Adjusted R-squared:    NaN 
## F-statistic:   NaN on 2 and 0 DF,  p-value: NA

# Predict on test
test$pred_roe <- predict(model, newdata = test)

# Metrics
rmse <- function(a, p) sqrt(mean((a - p)^2, na.rm = TRUE))
mae  <- function(a, p) mean(abs(a - p), na.rm = TRUE)
r2   <- function(a, p) 1 - sum((a - p)^2, na.rm = TRUE) /
                      sum((a - mean(a, na.rm = TRUE))^2, na.rm = TRUE)

cat("RMSE:", rmse(test$roe, test$pred_roe), "\n")

## RMSE: 0.002823746

cat("MAE :", mae(test$roe,  test$pred_roe), "\n")

## MAE : 0.002823746

cat("R^2 :", r2(test$roe,   test$pred_roe), "\n")

## R^2 : -Inf

# Example: Artemis Medicare Services Ltd.
df_c <- df %>% filter(company_name == "Artemis Medicare Services Ltd.")

# Set up 1 row, 3 columns layout
par(mfrow = c(1, 3))

# 1. Carbon Emissions vs ROE
plot(df_c$carbon_emissions, df_c$roe,
     xlab = "Carbon Emissions", ylab = "ROE",
     main = "Emissions vs ROE",
     pch = 19, col = "blue")
abline(lm(roe ~ carbon_emissions, data = df_c), col = "red", lwd = 2)

# 2. Energy Consumption vs ROE
plot(df_c$energy_consumption, df_c$roe,
     xlab = "Energy Consumption", ylab = "ROE",
     main = "Energy vs ROE",
     pch = 19, col = "darkgreen")
abline(lm(roe ~ energy_consumption, data = df_c), col = "red", lwd = 2)

# 3. Employee Turnover vs ROE
plot(df_c$employee_turnover, df_c$roe,
     xlab = "Employee Turnover", ylab = "ROE",
     main = "Turnover vs ROE",
     pch = 19, col = "purple")
abline(lm(roe ~ employee_turnover, data = df_c), col = "red", lwd = 2)

plot of chunk prediction1

# Reset to default
par(mfrow = c(1,1))

Models collapse due to very few years per company (overfitting, meaningless coefficients).
Highlights the data depth problem in company-level analytics.

6.5.2) Collective Model

# Train/test split: use <=2023 for training, >=2024 for testing
train <- df %>% filter(year <= 2023)
test  <- df %>% filter(year >= 2024)

# --- ROE model (main dependent variable) ---
model_roe <- lm(
  roe ~ carbon_emissions + energy_consumption + employee_turnover +
        industry_type + ceo_gender + year,
  data = train
)

summary(model_roe)

## 
## Call:
## lm(formula = roe ~ carbon_emissions + energy_consumption + employee_turnover + 
##     industry_type + ceo_gender + year, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17750 -0.07029 -0.00265  0.02201  0.39258 
## 
## Coefficients:
##                                                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                    -2.063e+01  1.320e+02  -0.156    0.878
## carbon_emissions                                4.553e-06  6.610e-06   0.689    0.504
## energy_consumption                             -2.880e-09  4.787e-09  -0.602    0.559
## employee_turnover                              -2.251e-01  4.274e-01  -0.527    0.608
## industry_typeHealthcare                         1.937e-01  2.263e-01   0.856    0.409
## industry_typePharmaceuticals and Biotechnology  2.534e-01  2.857e-01   0.887    0.392
## ceo_genderFemale                               -1.680e-02  1.198e-01  -0.140    0.891
## year                                            1.021e-02  6.532e-02   0.156    0.878
## 
## Residual standard error: 0.1561 on 12 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1638,	Adjusted R-squared:  -0.324 
## F-statistic: 0.3358 on 7 and 12 DF,  p-value: 0.9222

# Predictions
test$pred_roe <- predict(model_roe, newdata = test)

# --- ROA model (secondary diagnostic) ---
model_roa <- lm(
  roa ~ carbon_emissions + energy_consumption + employee_turnover +
        industry_type + ceo_gender + year,
  data = train
)

summary(model_roa)

## 
## Call:
## lm(formula = roa ~ carbon_emissions + energy_consumption + employee_turnover + 
##     industry_type + ceo_gender + year, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14946 -0.04262 -0.00180  0.02661  0.34493 
## 
## Coefficients:
##                                                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                    -2.239e+01  1.084e+02  -0.207    0.840
## carbon_emissions                                4.189e-06  5.430e-06   0.771    0.455
## energy_consumption                             -1.371e-09  3.932e-09  -0.349    0.733
## employee_turnover                              -5.568e-02  3.511e-01  -0.159    0.877
## industry_typeHealthcare                         1.465e-01  1.859e-01   0.788    0.446
## industry_typePharmaceuticals and Biotechnology  1.429e-01  2.347e-01   0.609    0.554
## ceo_genderFemale                               -2.333e-02  9.838e-02  -0.237    0.817
## year                                            1.106e-02  5.366e-02   0.206    0.840
## 
## Residual standard error: 0.1282 on 12 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1457,	Adjusted R-squared:  -0.3527 
## F-statistic: 0.2923 on 7 and 12 DF,  p-value: 0.9443

test$pred_roa <- predict(model_roa, newdata = test)

# --- Model performance metrics ---
rmse <- function(a, p) sqrt(mean((a - p)^2, na.rm = TRUE))
mae  <- function(a, p) mean(abs(a - p), na.rm = TRUE)
r2   <- function(a, p) 1 - sum((a - p)^2, na.rm = TRUE) /
                      sum((a - mean(a, na.rm = TRUE))^2, na.rm = TRUE)

metrics <- tibble::tibble(
  Metric = c("RMSE", "MAE", "R^2"),
  ROE    = c(rmse(test$roe, test$pred_roe),
             mae(test$roe,  test$pred_roe),
             r2(test$roe,   test$pred_roe)),
  ROA    = c(rmse(test$roa, test$pred_roa),
             mae(test$roa,  test$pred_roa),
             r2(test$roa,   test$pred_roa))
)

DT::datatable(metrics, caption = "Pooled Model Performance (ROE vs ROA)")

## Error in loadNamespace(name): there is no package called 'webshot'

# --- Actual vs Predicted table for test set ---
test_results <- test %>%
  select(company_name, year,
         roe, pred_roe,
         roa, pred_roa)

DT::datatable(test_results, caption = "Test Set Predictions — Pooled Model")

## Error in loadNamespace(name): there is no package called 'webshot'

# --- Partial effect plots for each predictor in ROE model ---

# 1. Carbon Emissions vs ROE
plot(train$carbon_emissions, train$roe,
     xlab = "Carbon Emissions", ylab = "ROE",
     main = "Effect of Carbon Emissions on ROE",
     pch = 19, col = "blue")

emm_em <- data.frame(
  carbon_emissions = seq(min(train$carbon_emissions, na.rm = TRUE),
                         max(train$carbon_emissions, na.rm = TRUE),
                         length.out = 100),
  energy_consumption = mean(train$energy_consumption, na.rm = TRUE),
  employee_turnover  = mean(train$employee_turnover, na.rm = TRUE),
  industry_type      = train$industry_type[1], # pick one level as baseline
  ceo_gender         = train$ceo_gender[1],
  year               = mean(train$year, na.rm = TRUE)
)

lines(emm_em$carbon_emissions,
      predict(model_roe, newdata = emm_em),
      col = "red", lwd = 2)

legend("topleft", legend = c("Actual Data", "Fitted Line"),
       col = c("blue","red"), pch = c(19, NA), lty = c(NA,1))

plot of chunk prediction2

# 2. Energy Consumption vs ROE
plot(train$energy_consumption, train$roe,
     xlab = "Energy Consumption", ylab = "ROE",
     main = "Effect of Energy Consumption on ROE",
     pch = 19, col = "darkgreen")

emm_en <- data.frame(
  carbon_emissions   = mean(train$carbon_emissions, na.rm = TRUE),
  energy_consumption = seq(min(train$energy_consumption, na.rm = TRUE),
                           max(train$energy_consumption, na.rm = TRUE),
                           length.out = 100),
  employee_turnover  = mean(train$employee_turnover, na.rm = TRUE),
  industry_type      = train$industry_type[1],
  ceo_gender         = train$ceo_gender[1],
  year               = mean(train$year, na.rm = TRUE)
)

lines(emm_en$energy_consumption,
      predict(model_roe, newdata = emm_en),
      col = "red", lwd = 2)

legend("topleft", legend = c("Actual Data", "Fitted Line"),
       col = c("darkgreen","red"), pch = c(19, NA), lty = c(NA,1))

plot of chunk prediction2

# 3. Employee Turnover vs ROE
plot(train$employee_turnover, train$roe,
     xlab = "Employee Turnover", ylab = "ROE",
     main = "Effect of Employee Turnover on ROE",
     pch = 19, col = "purple")

emm_to <- data.frame(
  carbon_emissions   = mean(train$carbon_emissions, na.rm = TRUE),
  energy_consumption = mean(train$energy_consumption, na.rm = TRUE),
  employee_turnover  = seq(min(train$employee_turnover, na.rm = TRUE),
                           max(train$employee_turnover, na.rm = TRUE),
                           length.out = 100),
  industry_type      = train$industry_type[1],
  ceo_gender         = train$ceo_gender[1],
  year               = mean(train$year, na.rm = TRUE)
)

lines(emm_to$employee_turnover,
      predict(model_roe, newdata = emm_to),
      col = "red", lwd = 2)

legend("topleft", legend = c("Actual Data", "Fitted Line"),
       col = c("purple","red"), pch = c(19, NA), lty = c(NA,1))

plot of chunk prediction2

Pooled regression statistically valid but predictive accuracy remains weak.
Useful for identifying directional patterns (-turnover -> -ROE ; +emissions -> -ROE).
Confirms that external factors (market shocks, policies, R&D) drive much of the unexplained variation.

7) Conclusion

Our analysis linked sustainability metrics such as emissions, energy use and turnover with financial performance such as ROE/ROA across 10 companies over 5 years.
Company-level models failed due to very limited data (4–5 years per firm), showing why data depth matters in predictive analytics.
Pooled regression models were statistically valid but had weak predictive accuracy, highlighting the complex nature of ROE.
Despite poor prediction, the models provided directional insights:
- High employee turnover → consistently lower ROE/ROA.
- High emissions & energy intensity → generally linked with weaker returns.
Clustering analysis grouped firms into: efficient & profitable, underperformers, and transitioners — offering a method for strategic benchmarking.

8) Use of AI declaration

Declaration: AI tools were used only for grammatical refinement, formatting and pretty tables and graphs. All analysis, data preparation, modeling choices, and interpretations are original work.

9) Data sources declaration

Annual reports sourced from NSE India webpage
Sustainability reports also sourced from NSE India Webpage
Ratios through Dion Solutions Ltd. Available on MoneyControl

10) Blog link

https://proplayerplayz.github.io

Market Segmentation

Virtual Organization and the Flattening of Management: What MIS Enables