Correlation analysis and predictive modeling¶
This notebook builds on the multi-scan assembly demonstrated in basic_usage.ipynb (DatasetBuilder.from_date_scan_numbers) and walks through:
- Correlation ranking (
geecs_data_utils.analysis.CorrelationReport) — Pearson / Spearman / Kendall with row filters and substring exclusions. - Predictive modeling (
geecs_data_utils.modeling.ml) —MLDatasetBuilderfor X/y selection,RegressionTrainerfor linear / ridge / elastic-net fits, sklearn pipeline + CV examples (PCA, PLS, Optuna), andsave_model_artifact/load_model_artifactfor persistence.
Requires geecs_data_utils installed with the ml extra (pip install "geecs_data_utils[ml]" or poetry install --extras ml).
Correlation ranking (geecs_data_utils.analysis)¶
Use CorrelationReport.from_dataframe instead of ad-hoc notebook functions. Row filters use RowFilterSpec tuples. exclude_terms trims the reported correlation series only.
Outliers: use apply_outlier_config + OutlierConfig from geecs_data_utils.data (e.g. method="nan" with sigma=) before correlating.
Each recipe reloads data from disk independently — run either correlation example on its own.
from geecs_data_utils.data import DatasetBuilder, OutlierConfig, apply_outlier_config
from geecs_data_utils.analysis import CorrelationReport
assembled = DatasetBuilder.from_date_scan_numbers(
year=2026,
month=4,
day=14,
experiment="Undulator",
numbers=range(39, 47),
load_scalars=True,
source="sfile",
on_missing="skip",
dropna=False,
)
df_capped = apply_outlier_config(
assembled.frame.copy(),
OutlierConfig(method="nan", sigma=5),
)
param = "U_BCaveICT Python Results.ChA Alias:U_BCaveICT Charge pC"
report = CorrelationReport.from_dataframe(
df_capped,
target=param,
exclude_terms=[
"charge",
"magspec",
"magcam",
"esp",
"HP_daq",
"1HzShifted",
"HallProbe",
"jetz",
"probecamstage",
"timestamp",
"scan",
"Bin #",
"Elapsed time",
"Shotnumber",
"Aerotech",
],
filters=[(param, ">", 2)],
top_n=100,
)
corrs = report.correlations
df_filt = report.filtered_frame
print(corrs.head(5))
You can also handle filtering and outliers directly during dataset building rather than in downstream function calls or during correlation analysis. Each of these calls utilize the same central functionality.
from geecs_data_utils.data import DatasetBuilder, OutlierConfig
from geecs_data_utils.analysis import CorrelationReport
param = "U_BCaveICT Python Results.ChA Alias:U_BCaveICT Charge pC"
assembled = DatasetBuilder.from_date_scan_numbers(
year=2026,
month=4,
day=14,
experiment="Undulator",
numbers=range(39, 47),
filters=[(param, ">", 2)],
outlier_config=OutlierConfig(method="nan", sigma=5),
load_scalars=True,
source="sfile",
on_missing="skip",
dropna=False,
)
report = CorrelationReport.from_dataframe(
assembled.frame,
target=param,
exclude_terms=[
"charge",
"magspec",
"magcam",
"esp",
"HP_daq",
"1HzShifted",
"HallProbe",
"jetz",
"probecamstage",
"timestamp",
"scan",
"Bin #",
"Elapsed time",
"Shotnumber",
"Aerotech",
],
top_n=100,
)
corrs = report.correlations
df_filt = report.filtered_frame
print(corrs.head(5))
Modeling with sklearn (notebook-first)¶
This section is self-contained: it reloads the same scan merge and preprocessing used in basic_usage.ipynb (multi-scan table, ±5σ outliers → NaN on numeric columns, ICT charge target, threshold row filter) using DatasetBuilder, OutlierConfig / apply_outlier_config, CorrelationReport, and MLDatasetBuilder.
Earlier notebook sections are optional here.
The following cells mirror basic_usage.ipynb: in-sample OLS, Monte Carlo train/test splits, PCA + reduced-rank regression, PLS, correlation preview, missing-value footprint, Optuna (linear / ridge / lasso), best-trial diagnostics, and sklearn Pipeline cross-validation.
For packaged fit/save/load workflows, see geecs_data_utils.modeling.ml.
Shared modeling table (MLDatasetBuilder)¶
After correlation ranking, MLDatasetBuilder.from_dataframe drops rows with NaN in any selected predictor or the target — the same effective definition as df_filt.dropna(subset=[target] + check_columns), but kept in one place for all sklearn cells above.
from geecs_data_utils.data import DatasetBuilder, OutlierConfig, apply_outlier_config
from geecs_data_utils.analysis import CorrelationReport
from geecs_data_utils.modeling.ml import MLDatasetBuilder
# Demo scans for this notebook (shared with the multi-scan section of basic_usage.ipynb): ±5σ → NaN, ICT charge target
_DEMO_DATE = dict(year=2026, month=4, day=14, experiment="Undulator")
_DEMO_NUMBERS = range(39, 47)
assembled_model = DatasetBuilder.from_date_scan_numbers(
**_DEMO_DATE,
numbers=_DEMO_NUMBERS,
append_paths=False,
on_missing="skip",
dropna=False,
)
df_raw_model = assembled_model.frame.copy()
df_capped_model = apply_outlier_config(
df_raw_model,
OutlierConfig(method="nan", sigma=5),
)
param = "U_BCaveICT Python Results.ChA Alias:U_BCaveICT Charge pC"
report_model = CorrelationReport.from_dataframe(
df_capped_model,
target=param,
exclude_terms=[
"charge",
"magspec",
"magcam",
"esp",
"HP_daq",
"1HzShifted",
"HallProbe",
"jetz",
"probecamstage",
"timestamp",
"scan",
"Bin #",
"Elapsed time",
"Shotnumber",
"Aerotech",
],
filters=[(param, ">", 2)],
top_n=100,
)
corrs = report_model.correlations
df_filt = report_model.filtered_frame
key_column = param
check_columns = corrs.index.tolist()
ml_ds = MLDatasetBuilder.from_dataframe(
df_filt,
target_column=key_column,
feature_columns=check_columns,
dropna=True,
)
X_arr = ml_ds.frame[ml_ds.feature_columns].to_numpy(dtype=float)
y_arr = ml_ds.frame[ml_ds.target_column].to_numpy(dtype=float)
_lr = assembled_model.load_report
if _lr is not None:
print(
"Demo merge: loaded scan numbers",
list(_lr.numbers_loaded),
f"({len(_lr.numbers_loaded)} ok, {len(_lr.skipped)} skipped)",
)
print(f"Rows in correlation filtered_frame: {len(df_filt)}")
print(
f"Modeling matrix after dropna on target + {len(check_columns)} features: {ml_ds.rows_final} rows"
)
In-sample multivariate linear fit¶
Matches basic_usage.ipynb: scale all predictors, LinearRegression, variance explained (score), scatter actual vs predicted charge, 2D histogram.
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_arr)
model = LinearRegression()
model.fit(X_scaled, y_arr)
r2 = model.score(X_scaled, y_arr)
y_pred = model.predict(X_scaled)
print(f"Total variance explained by check parameters: {r2:.2f}")
plt.figure(figsize=(4, 4))
plt.scatter(y_arr, y_pred, alpha=0.5)
plt.plot(
[y_arr.min(), y_arr.max()],
[y_arr.min(), y_arr.max()],
"r--",
label="Perfect prediction",
)
plt.xlabel("Actual Charge [pC]")
plt.ylabel("Predicted Charge [pC]")
plt.legend()
plt.show()
plt.figure(figsize=(4, 4))
plt.hist2d(y_arr, y_pred, bins=100)
plt.xlabel("Actual Charge [pC]")
plt.ylabel("Predicted Charge [pC]")
plt.colorbar(label="Counts")
plt.show()
Train / test splits (Monte Carlo loop)¶
Repeat train_test_split with random_state=i on X_scaled / y_arr, collect test R², plot the last split (same layout as basic_usage.ipynb), then mean(vals) in the next cell.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
vals = []
for i in range(1, 50):
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y_arr, test_size=0.3, random_state=i
)
reg = LinearRegression()
reg.fit(X_train, y_train)
y_hat = reg.predict(X_test)
r2_test = reg.score(X_test, y_test)
vals.append(r2_test)
print(f"Last split test R²: {r2_test:.3f}")
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
axes[0].scatter(y_test, y_hat, alpha=0.5)
axes[0].plot(
[y_test.min(), y_test.max()],
[y_test.min(), y_test.max()],
"r--",
)
axes[0].set_xlabel("Actual Key Parameter")
axes[0].set_ylabel("Predicted")
axes[0].set_title("Predicted vs Actual")
axes[1].hist(y_test, bins=30, alpha=0.5, label="Actual", density=True)
axes[1].hist(y_hat, bins=30, alpha=0.5, label="Predicted", density=True)
axes[1].set_xlabel("Key Parameter Value")
axes[1].set_ylabel("Density")
axes[1].set_title("Distribution: Actual vs Predicted")
axes[1].legend()
residuals = y_test - y_hat
axes[2].scatter(y_hat, residuals, alpha=0.5)
axes[2].axhline(0, color="r", linestyle="--")
axes[2].set_xlabel("Predicted")
axes[2].set_ylabel("Residual (Actual - Predicted)")
axes[2].set_title("Residuals vs Predicted")
plt.tight_layout()
plt.show()
print(f"residual rms: {np.std(residuals):.3f}")
print(f"rms of original signal: {np.std(y_test):.3f}")
print(f"percentage of mean rms of residuals: {np.std(residuals) / np.mean(y_test):.3f}")
print(f"percentage of mean rms of signal: {np.std(y_test) / np.mean(y_test):.3f}")
PCA on ranked predictors¶
Use df_filt[check_columns].dropna() and align y with df_filt.loc[X.index, key_column] — same indexing as basic_usage.ipynb (distinct from the MLDatasetBuilder matrix so PCA row counts match the simpler df_filt-driven path).
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
X = df_filt[check_columns].copy()
X = X.dropna()
scaler_x = StandardScaler()
X_scaled_pca = scaler_x.fit_transform(X)
pca = PCA()
X_pca = pca.fit_transform(X_scaled_pca)
explained_variance = pca.explained_variance_ratio_
cum_variance = np.cumsum(explained_variance)
display(
pd.DataFrame(
{
"PC": np.arange(1, len(explained_variance) + 1),
"Explained Variance Ratio": explained_variance,
"Cumulative Variance": cum_variance,
}
).head(15)
)
plt.figure(figsize=(5, 3))
plt.plot(cum_variance, marker="o")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Cumulative Explained Variance")
plt.grid(True)
plt.show()
import numpy as np
n_80 = int(np.argmax(cum_variance >= 0.80) + 1)
n_90 = int(np.argmax(cum_variance >= 0.90) + 1)
n_95 = int(np.argmax(cum_variance >= 0.95) + 1)
print("Components for 80% variance:", n_80)
print("Components for 90% variance:", n_90)
print("Components for 95% variance:", n_95)
from sklearn.linear_model import LinearRegression
y_pc = df_filt.loc[X.index, key_column]
N = n_95
X_reduced = X_pca[:, :N]
reg_pc = LinearRegression()
reg_pc.fit(X_reduced, y_pc)
r2_pca = reg_pc.score(X_reduced, y_pc)
print("R² using first", N, "PCs:", r2_pca)
PLS regression¶
PLSRegression on X_scaled_pca with y_pls aligned to X.index (same rows as the PCA cell).
from sklearn.cross_decomposition import PLSRegression
y_pls = df_filt.loc[X.index, key_column].values
n_components = 10
pls = PLSRegression(n_components=n_components)
pls.fit(X_scaled_pca, y_pls)
r2_train = pls.score(X_scaled_pca, y_pls)
print(f"R² using {n_components} PLS components: {r2_train:.3f}")
Ranked correlations and per-column NaN footprint¶
Inspect corrs and quantify rows lost if each column were required individually.
import pandas as pd
corrs.head(40)
cols_ms = [key_column] + check_columns
rows_before = len(df_filt)
rows_lost = {}
for col in cols_ms:
rows_after = df_filt.dropna(subset=[col]).shape[0]
rows_lost[col] = rows_before - rows_after
rows_lost_series = pd.Series(rows_lost).sort_values(ascending=False)
rows_lost_series.head(20)
Hyperparameter search with Optuna¶
Objective matches basic_usage.ipynb (linear / ridge / lasso + top_n_features). The gbr branch stays present but unreachable unless you widen model_name suggestions.
import optuna
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Lasso, LinearRegression, Ridge
from sklearn.model_selection import ShuffleSplit, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
optuna.logging.set_verbosity(optuna.logging.WARNING)
df_clean = df_filt.dropna(subset=[key_column] + check_columns)
def objective(trial):
"""Return mean ShuffleSplit CV R² for the suggested model and feature count."""
model_name = trial.suggest_categorical("model", ["linear", "ridge", "lasso"])
top_n = trial.suggest_int("top_n_features", 5, len(check_columns))
features = corrs.index[:top_n].tolist()
X = df_clean[features].values
y = df_clean[key_column].values
if model_name == "linear":
model = LinearRegression()
elif model_name == "ridge":
alpha = trial.suggest_float("alpha", 1e-3, 100, log=True)
model = Ridge(alpha=alpha)
elif model_name == "lasso":
alpha = trial.suggest_float("alpha", 1e-4, 10, log=True)
model = Lasso(alpha=alpha, max_iter=5000)
elif model_name == "gbr":
model = GradientBoostingRegressor(
n_estimators=trial.suggest_int("n_estimators", 50, 200),
max_depth=trial.suggest_int("max_depth", 2, 6),
learning_rate=trial.suggest_float("lr", 0.01, 0.3, log=True),
subsample=trial.suggest_float("subsample", 0.6, 1.0),
random_state=42,
)
pipe = Pipeline([("scaler", StandardScaler()), ("model", model)])
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring="r2")
return scores.mean()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=60, show_progress_bar=True)
print(f"\nBest CV R²: {study.best_value:.3f}")
print(f"Best params: {study.best_params}")
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import ElasticNet, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
bp = study.best_params
top_n = bp["top_n_features"]
features = corrs.index[:top_n].tolist()
sub = df_filt[[key_column] + features].dropna()
X_ot = sub[features].values
y_ot = sub[key_column].values
model_name = bp["model"]
if model_name == "ridge":
best_model = Ridge(alpha=bp["alpha"])
elif model_name == "lasso":
best_model = Lasso(alpha=bp["alpha"], max_iter=5000)
elif model_name == "elasticnet":
best_model = ElasticNet(alpha=bp["alpha"], l1_ratio=bp["l1_ratio"], max_iter=5000)
elif model_name == "rf":
best_model = RandomForestRegressor(
n_estimators=bp["n_estimators"],
max_depth=bp["max_depth"],
random_state=42,
)
elif model_name == "pls":
best_model = PLSRegression(n_components=bp["n_components"])
else:
best_model = LinearRegression()
pipe = Pipeline([("scaler", StandardScaler()), ("model", best_model)])
X_train, X_test, y_train, y_test = train_test_split(
X_ot, y_ot, test_size=0.3, random_state=42
)
pipe.fit(X_train, y_train)
y_pred_ot = pipe.predict(X_test).ravel()
r2_test = pipe.score(X_test, y_test)
residuals = y_test - y_pred_ot
fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))
axes[0].scatter(y_test, y_pred_ot, alpha=0.4, s=10)
lims = [min(y_test.min(), y_pred_ot.min()), max(y_test.max(), y_pred_ot.max())]
axes[0].plot(lims, lims, "r--", lw=1)
axes[0].set_xlabel("Actual")
axes[0].set_ylabel("Predicted")
axes[0].set_title(f"Predicted vs Actual (R²={r2_test:.3f})")
axes[1].hist(y_test, bins=40, alpha=0.5, label="Actual", density=True)
axes[1].hist(y_pred_ot, bins=40, alpha=0.5, label="Predicted", density=True)
axes[1].set_xlabel(key_column.split(".")[-1])
axes[1].set_title("Distribution comparison")
axes[1].legend()
axes[2].scatter(y_pred_ot, residuals, alpha=0.4, s=10)
axes[2].axhline(0, color="r", linestyle="--", lw=1)
axes[2].set_xlabel("Predicted")
axes[2].set_ylabel("Residual")
axes[2].set_title("Residuals")
plt.tight_layout()
plt.show()
print(f"Test R²: {r2_test:.3f}")
print(f"Residual RMS: {np.std(residuals):.3f}")
print(f"Signal RMS: {np.std(y_test):.3f}")
print(f"Residual/Mean: {np.std(residuals) / np.mean(y_test):.1%}")
fitted_model = pipe.named_steps["model"]
if hasattr(fitted_model, "feature_importances_"):
imp = fitted_model.feature_importances_
idx = np.argsort(imp)[-15:]
plt.figure(figsize=(8, 5))
plt.barh([features[i] for i in idx], imp[idx])
plt.xlabel("Feature Importance")
plt.title(f"Top features ({model_name})")
plt.tight_layout()
plt.show()
elif hasattr(fitted_model, "coef_"):
coefs = np.abs(np.array(fitted_model.coef_).ravel())
idx = np.argsort(coefs)[-15:]
plt.figure(figsize=(8, 5))
plt.barh([features[i] for i in idx], coefs[idx])
plt.xlabel("|Coefficient| (on scaled features)")
plt.title(f"Top features ({model_name})")
plt.tight_layout()
plt.show()
top_keep = study.best_params["top_n_features"]
features_best = corrs.index[:top_keep].tolist()
df_clean = df_filt.dropna(subset=[key_column] + check_columns)
sub_best = df_filt[[key_column] + features_best].dropna()
print(f"All ranked features: {len(df_clean)} rows, {len(check_columns)} features")
print(f"Optuna best subset: {len(sub_best)} rows, {top_keep} features")
Plain pipeline CV (linear regression)¶
Pipeline + cross_val_score with 3-fold, 5-fold, and ShuffleSplit — same checks as basic_usage.ipynb.
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
df_clean = df_filt.dropna(subset=[key_column] + check_columns)
X_cv = df_clean[check_columns].values
y_cv = df_clean[key_column].values
pipe = Pipeline([("scaler", StandardScaler()), ("model", LinearRegression())])
cv3 = cross_val_score(pipe, X_cv, y_cv, cv=3, scoring="r2")
cv5 = cross_val_score(pipe, X_cv, y_cv, cv=5, scoring="r2")
print(f"3-fold CV R²: {cv3.mean():.3f} ± {cv3.std():.3f} (per fold: {cv3})")
print(f"5-fold CV R²: {cv5.mean():.3f} ± {cv5.std():.3f} (per fold: {cv5})")
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=42)
scores = cross_val_score(pipe, X_cv, y_cv, cv=cv, scoring="r2")
print(f"ShuffleSplit CV R²: {scores.mean():.3f} ± {scores.std():.3f}")
Packaged fit / persist / reload (RegressionTrainer)¶
Earlier cells used sklearn directly for teaching; MLDatasetBuilder only standardized the tabular X / y matrix.
The modeling.ml layer adds RegressionTrainer (default preprocessing pipeline + linear / ridge / elastic-net), TrainingMetrics (optional cv summary), and save_model_artifact / load_model_artifact for a directory of joblib + JSON sidecars. Run after the modeling setup cell that defines ml_ds.
import tempfile
from pathlib import Path
import pandas as pd
from geecs_data_utils.modeling.ml import (
RegressionTrainer,
load_model_artifact,
save_model_artifact,
)
# Uses ml_ds from the modeling setup cell (same merged-scan table).
top_k = min(25, len(ml_ds.feature_columns))
feat_cols = ml_ds.feature_columns[:top_k]
X_demo = ml_ds.frame[feat_cols]
y_demo = ml_ds.frame[ml_ds.target_column]
trainer = RegressionTrainer(model="ridge", model_params={"alpha": 1.0})
artifact = trainer.fit(
X_demo,
y_demo,
cv=5,
target_name="ict_charge_pc",
scan_info={"notebook": "basic_usage_refactor"},
)
print(f"In-sample R²: {artifact.metrics.r2:.4f}")
if artifact.metrics.cv_r2_mean is not None:
print(
"5-fold CV R²:",
f"{artifact.metrics.cv_r2_mean:.4f} ± {artifact.metrics.cv_r2_std:.4f}",
)
tmpdir = Path(tempfile.mkdtemp(prefix="geecs_ml_demo_"))
artifact_dir = tmpdir / "ridge_charge_demo"
save_model_artifact(artifact, artifact_dir)
print("Saved artifact dir:", artifact_dir)
loaded = load_model_artifact(artifact_dir)
preview = X_demo.iloc[:3]
pred = loaded.predict(preview)
actual = y_demo.iloc[:3].to_numpy()
compare = pd.DataFrame({"actual": actual, "predicted": pred, "residual": pred - actual})
print(compare.round(4).to_string(index=False))