Correlation analysis and predictive modeling¶

This notebook builds on the multi-scan assembly demonstrated in basic_usage.ipynb (DatasetBuilder.from_date_scan_numbers) and walks through:

Correlation ranking (geecs_data_utils.analysis.CorrelationReport) — Pearson / Spearman / Kendall with row filters and substring exclusions.
Predictive modeling (geecs_data_utils.modeling.ml) — MLDatasetBuilder for X/y selection, RegressionTrainer for linear / ridge / elastic-net fits, sklearn pipeline + CV examples (PCA, PLS, Optuna), and save_model_artifact / load_model_artifact for persistence.

Requires geecs_data_utils installed with the ml extra (pip install "geecs_data_utils[ml]" or poetry install --extras ml).

Correlation ranking (`geecs_data_utils.analysis`)¶

Use CorrelationReport.from_dataframe instead of ad-hoc notebook functions. Row filters use RowFilterSpec tuples. exclude_terms trims the reported correlation series only.

Outliers: use apply_outlier_config + OutlierConfig from geecs_data_utils.data (e.g. method="nan" with sigma=) before correlating. Each recipe reloads data from disk independently — run either correlation example on its own.

In [ ]:

Copied!





from geecs_data_utils.data import DatasetBuilder, OutlierConfig, apply_outlier_config
from geecs_data_utils.analysis import CorrelationReport

assembled = DatasetBuilder.from_date_scan_numbers(
    year=2026,
    month=4,
    day=14,
    experiment="Undulator",
    numbers=range(39, 47),
    load_scalars=True,
    source="sfile",
    on_missing="skip",
    dropna=False,
)

df_capped = apply_outlier_config(
    assembled.frame.copy(),
    OutlierConfig(method="nan", sigma=5),
)

param = "U_BCaveICT Python Results.ChA Alias:U_BCaveICT Charge pC"

report = CorrelationReport.from_dataframe(
    df_capped,
    target=param,
    exclude_terms=[
        "charge",
        "magspec",
        "magcam",
        "esp",
        "HP_daq",
        "1HzShifted",
        "HallProbe",
        "jetz",
        "probecamstage",
        "timestamp",
        "scan",
        "Bin #",
        "Elapsed time",
        "Shotnumber",
        "Aerotech",
    ],
    filters=[(param, ">", 2)],
    top_n=100,
)

corrs = report.correlations
df_filt = report.filtered_frame

print(corrs.head(5))
from geecs_data_utils.data import DatasetBuilder, OutlierConfig, apply_outlier_config
from geecs_data_utils.analysis import CorrelationReport

assembled = DatasetBuilder.from_date_scan_numbers(
    year=2026,
    month=4,
    day=14,
    experiment="Undulator",
    numbers=range(39, 47),
    load_scalars=True,
    source="sfile",
    on_missing="skip",
    dropna=False,
)

df_capped = apply_outlier_config(
    assembled.frame.copy(),
    OutlierConfig(method="nan", sigma=5),
)

param = "U_BCaveICT Python Results.ChA Alias:U_BCaveICT Charge pC"

report = CorrelationReport.from_dataframe(
    df_capped,
    target=param,
    exclude_terms=[
        "charge",
        "magspec",
        "magcam",
        "esp",
        "HP_daq",
        "1HzShifted",
        "HallProbe",
        "jetz",
        "probecamstage",
        "timestamp",
        "scan",
        "Bin #",
        "Elapsed time",
        "Shotnumber",
        "Aerotech",
    ],
    filters=[(param, ">", 2)],
    top_n=100,
)

corrs = report.correlations
df_filt = report.filtered_frame

print(corrs.head(5))

You can also handle filtering and outliers directly during dataset building rather than in downstream function calls or during correlation analysis. Each of these calls utilize the same central functionality.

In [ ]:

Copied!





from geecs_data_utils.data import DatasetBuilder, OutlierConfig
from geecs_data_utils.analysis import CorrelationReport

param = "U_BCaveICT Python Results.ChA Alias:U_BCaveICT Charge pC"

assembled = DatasetBuilder.from_date_scan_numbers(
    year=2026,
    month=4,
    day=14,
    experiment="Undulator",
    numbers=range(39, 47),
    filters=[(param, ">", 2)],
    outlier_config=OutlierConfig(method="nan", sigma=5),
    load_scalars=True,
    source="sfile",
    on_missing="skip",
    dropna=False,
)

report = CorrelationReport.from_dataframe(
    assembled.frame,
    target=param,
    exclude_terms=[
        "charge",
        "magspec",
        "magcam",
        "esp",
        "HP_daq",
        "1HzShifted",
        "HallProbe",
        "jetz",
        "probecamstage",
        "timestamp",
        "scan",
        "Bin #",
        "Elapsed time",
        "Shotnumber",
        "Aerotech",
    ],
    top_n=100,
)

corrs = report.correlations
df_filt = report.filtered_frame

print(corrs.head(5))
from geecs_data_utils.data import DatasetBuilder, OutlierConfig
from geecs_data_utils.analysis import CorrelationReport

param = "U_BCaveICT Python Results.ChA Alias:U_BCaveICT Charge pC"

assembled = DatasetBuilder.from_date_scan_numbers(
    year=2026,
    month=4,
    day=14,
    experiment="Undulator",
    numbers=range(39, 47),
    filters=[(param, ">", 2)],
    outlier_config=OutlierConfig(method="nan", sigma=5),
    load_scalars=True,
    source="sfile",
    on_missing="skip",
    dropna=False,
)

report = CorrelationReport.from_dataframe(
    assembled.frame,
    target=param,
    exclude_terms=[
        "charge",
        "magspec",
        "magcam",
        "esp",
        "HP_daq",
        "1HzShifted",
        "HallProbe",
        "jetz",
        "probecamstage",
        "timestamp",
        "scan",
        "Bin #",
        "Elapsed time",
        "Shotnumber",
        "Aerotech",
    ],
    top_n=100,
)

corrs = report.correlations
df_filt = report.filtered_frame

print(corrs.head(5))

Modeling with sklearn (notebook-first)¶

This section is self-contained: it reloads the same scan merge and preprocessing used in basic_usage.ipynb (multi-scan table, ±5σ outliers → NaN on numeric columns, ICT charge target, threshold row filter) using DatasetBuilder, OutlierConfig / apply_outlier_config, CorrelationReport, and MLDatasetBuilder.

Earlier notebook sections are optional here.

The following cells mirror basic_usage.ipynb: in-sample OLS, Monte Carlo train/test splits, PCA + reduced-rank regression, PLS, correlation preview, missing-value footprint, Optuna (linear / ridge / lasso), best-trial diagnostics, and sklearn Pipeline cross-validation.

For packaged fit/save/load workflows, see geecs_data_utils.modeling.ml.

Shared modeling table (`MLDatasetBuilder`)¶

After correlation ranking, MLDatasetBuilder.from_dataframe drops rows with NaN in any selected predictor or the target — the same effective definition as df_filt.dropna(subset=[target] + check_columns), but kept in one place for all sklearn cells above.

In [ ]:

Copied!





from geecs_data_utils.data import DatasetBuilder, OutlierConfig, apply_outlier_config
from geecs_data_utils.analysis import CorrelationReport
from geecs_data_utils.modeling.ml import MLDatasetBuilder

# Demo scans for this notebook (shared with the multi-scan section of basic_usage.ipynb): ±5σ → NaN, ICT charge target
_DEMO_DATE = dict(year=2026, month=4, day=14, experiment="Undulator")
_DEMO_NUMBERS = range(39, 47)

assembled_model = DatasetBuilder.from_date_scan_numbers(
    **_DEMO_DATE,
    numbers=_DEMO_NUMBERS,
    append_paths=False,
    on_missing="skip",
    dropna=False,
)

df_raw_model = assembled_model.frame.copy()
df_capped_model = apply_outlier_config(
    df_raw_model,
    OutlierConfig(method="nan", sigma=5),
)

param = "U_BCaveICT Python Results.ChA Alias:U_BCaveICT Charge pC"

report_model = CorrelationReport.from_dataframe(
    df_capped_model,
    target=param,
    exclude_terms=[
        "charge",
        "magspec",
        "magcam",
        "esp",
        "HP_daq",
        "1HzShifted",
        "HallProbe",
        "jetz",
        "probecamstage",
        "timestamp",
        "scan",
        "Bin #",
        "Elapsed time",
        "Shotnumber",
        "Aerotech",
    ],
    filters=[(param, ">", 2)],
    top_n=100,
)

corrs = report_model.correlations
df_filt = report_model.filtered_frame

key_column = param
check_columns = corrs.index.tolist()

ml_ds = MLDatasetBuilder.from_dataframe(
    df_filt,
    target_column=key_column,
    feature_columns=check_columns,
    dropna=True,
)

X_arr = ml_ds.frame[ml_ds.feature_columns].to_numpy(dtype=float)
y_arr = ml_ds.frame[ml_ds.target_column].to_numpy(dtype=float)

_lr = assembled_model.load_report
if _lr is not None:
    print(
        "Demo merge: loaded scan numbers",
        list(_lr.numbers_loaded),
        f"({len(_lr.numbers_loaded)} ok, {len(_lr.skipped)} skipped)",
    )
print(f"Rows in correlation filtered_frame: {len(df_filt)}")
print(
    f"Modeling matrix after dropna on target + {len(check_columns)} features: {ml_ds.rows_final} rows"
)
from geecs_data_utils.data import DatasetBuilder, OutlierConfig, apply_outlier_config
from geecs_data_utils.analysis import CorrelationReport
from geecs_data_utils.modeling.ml import MLDatasetBuilder

# Demo scans for this notebook (shared with the multi-scan section of basic_usage.ipynb): ±5σ → NaN, ICT charge target
_DEMO_DATE = dict(year=2026, month=4, day=14, experiment="Undulator")
_DEMO_NUMBERS = range(39, 47)

assembled_model = DatasetBuilder.from_date_scan_numbers(
    **_DEMO_DATE,
    numbers=_DEMO_NUMBERS,
    append_paths=False,
    on_missing="skip",
    dropna=False,
)

df_raw_model = assembled_model.frame.copy()
df_capped_model = apply_outlier_config(
    df_raw_model,
    OutlierConfig(method="nan", sigma=5),
)

param = "U_BCaveICT Python Results.ChA Alias:U_BCaveICT Charge pC"

report_model = CorrelationReport.from_dataframe(
    df_capped_model,
    target=param,
    exclude_terms=[
        "charge",
        "magspec",
        "magcam",
        "esp",
        "HP_daq",
        "1HzShifted",
        "HallProbe",
        "jetz",
        "probecamstage",
        "timestamp",
        "scan",
        "Bin #",
        "Elapsed time",
        "Shotnumber",
        "Aerotech",
    ],
    filters=[(param, ">", 2)],
    top_n=100,
)

corrs = report_model.correlations
df_filt = report_model.filtered_frame

key_column = param
check_columns = corrs.index.tolist()

ml_ds = MLDatasetBuilder.from_dataframe(
    df_filt,
    target_column=key_column,
    feature_columns=check_columns,
    dropna=True,
)

X_arr = ml_ds.frame[ml_ds.feature_columns].to_numpy(dtype=float)
y_arr = ml_ds.frame[ml_ds.target_column].to_numpy(dtype=float)

_lr = assembled_model.load_report
if _lr is not None:
    print(
        "Demo merge: loaded scan numbers",
        list(_lr.numbers_loaded),
        f"({len(_lr.numbers_loaded)} ok, {len(_lr.skipped)} skipped)",
    )
print(f"Rows in correlation filtered_frame: {len(df_filt)}")
print(
    f"Modeling matrix after dropna on target + {len(check_columns)} features: {ml_ds.rows_final} rows"
)

In-sample multivariate linear fit¶

Matches basic_usage.ipynb: scale all predictors, LinearRegression, variance explained (score), scatter actual vs predicted charge, 2D histogram.

In [ ]:

Copied!





import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_arr)

model = LinearRegression()
model.fit(X_scaled, y_arr)
r2 = model.score(X_scaled, y_arr)
y_pred = model.predict(X_scaled)

print(f"Total variance explained by check parameters: {r2:.2f}")

plt.figure(figsize=(4, 4))
plt.scatter(y_arr, y_pred, alpha=0.5)
plt.plot(
    [y_arr.min(), y_arr.max()],
    [y_arr.min(), y_arr.max()],
    "r--",
    label="Perfect prediction",
)
plt.xlabel("Actual Charge [pC]")
plt.ylabel("Predicted Charge [pC]")
plt.legend()
plt.show()

plt.figure(figsize=(4, 4))
plt.hist2d(y_arr, y_pred, bins=100)
plt.xlabel("Actual Charge [pC]")
plt.ylabel("Predicted Charge [pC]")
plt.colorbar(label="Counts")
plt.show()
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_arr)

model = LinearRegression()
model.fit(X_scaled, y_arr)
r2 = model.score(X_scaled, y_arr)
y_pred = model.predict(X_scaled)

print(f"Total variance explained by check parameters: {r2:.2f}")

plt.figure(figsize=(4, 4))
plt.scatter(y_arr, y_pred, alpha=0.5)
plt.plot(
    [y_arr.min(), y_arr.max()],
    [y_arr.min(), y_arr.max()],
    "r--",
    label="Perfect prediction",
)
plt.xlabel("Actual Charge [pC]")
plt.ylabel("Predicted Charge [pC]")
plt.legend()
plt.show()

plt.figure(figsize=(4, 4))
plt.hist2d(y_arr, y_pred, bins=100)
plt.xlabel("Actual Charge [pC]")
plt.ylabel("Predicted Charge [pC]")
plt.colorbar(label="Counts")
plt.show()

Train / test splits (Monte Carlo loop)¶

Repeat train_test_split with random_state=i on X_scaled / y_arr, collect test R², plot the last split (same layout as basic_usage.ipynb), then mean(vals) in the next cell.

In [ ]:

Copied!





import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

vals = []

for i in range(1, 50):
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, y_arr, test_size=0.3, random_state=i
    )
    reg = LinearRegression()
    reg.fit(X_train, y_train)
    y_hat = reg.predict(X_test)
    r2_test = reg.score(X_test, y_test)
    vals.append(r2_test)

print(f"Last split test R²: {r2_test:.3f}")

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

axes[0].scatter(y_test, y_hat, alpha=0.5)
axes[0].plot(
    [y_test.min(), y_test.max()],
    [y_test.min(), y_test.max()],
    "r--",
)
axes[0].set_xlabel("Actual Key Parameter")
axes[0].set_ylabel("Predicted")
axes[0].set_title("Predicted vs Actual")

axes[1].hist(y_test, bins=30, alpha=0.5, label="Actual", density=True)
axes[1].hist(y_hat, bins=30, alpha=0.5, label="Predicted", density=True)
axes[1].set_xlabel("Key Parameter Value")
axes[1].set_ylabel("Density")
axes[1].set_title("Distribution: Actual vs Predicted")
axes[1].legend()

residuals = y_test - y_hat
axes[2].scatter(y_hat, residuals, alpha=0.5)
axes[2].axhline(0, color="r", linestyle="--")
axes[2].set_xlabel("Predicted")
axes[2].set_ylabel("Residual (Actual - Predicted)")
axes[2].set_title("Residuals vs Predicted")

plt.tight_layout()
plt.show()

print(f"residual rms: {np.std(residuals):.3f}")
print(f"rms of original signal: {np.std(y_test):.3f}")
print(f"percentage of mean rms of residuals: {np.std(residuals) / np.mean(y_test):.3f}")
print(f"percentage of mean rms of signal: {np.std(y_test) / np.mean(y_test):.3f}")
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

vals = []

for i in range(1, 50):
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, y_arr, test_size=0.3, random_state=i
    )
    reg = LinearRegression()
    reg.fit(X_train, y_train)
    y_hat = reg.predict(X_test)
    r2_test = reg.score(X_test, y_test)
    vals.append(r2_test)

print(f"Last split test R²: {r2_test:.3f}")

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

axes[0].scatter(y_test, y_hat, alpha=0.5)
axes[0].plot(
    [y_test.min(), y_test.max()],
    [y_test.min(), y_test.max()],
    "r--",
)
axes[0].set_xlabel("Actual Key Parameter")
axes[0].set_ylabel("Predicted")
axes[0].set_title("Predicted vs Actual")

axes[1].hist(y_test, bins=30, alpha=0.5, label="Actual", density=True)
axes[1].hist(y_hat, bins=30, alpha=0.5, label="Predicted", density=True)
axes[1].set_xlabel("Key Parameter Value")
axes[1].set_ylabel("Density")
axes[1].set_title("Distribution: Actual vs Predicted")
axes[1].legend()

residuals = y_test - y_hat
axes[2].scatter(y_hat, residuals, alpha=0.5)
axes[2].axhline(0, color="r", linestyle="--")
axes[2].set_xlabel("Predicted")
axes[2].set_ylabel("Residual (Actual - Predicted)")
axes[2].set_title("Residuals vs Predicted")

plt.tight_layout()
plt.show()

print(f"residual rms: {np.std(residuals):.3f}")
print(f"rms of original signal: {np.std(y_test):.3f}")
print(f"percentage of mean rms of residuals: {np.std(residuals) / np.mean(y_test):.3f}")
print(f"percentage of mean rms of signal: {np.std(y_test) / np.mean(y_test):.3f}")

PCA on ranked predictors¶

Use df_filt[check_columns].dropna() and align y with df_filt.loc[X.index, key_column] — same indexing as basic_usage.ipynb (distinct from the MLDatasetBuilder matrix so PCA row counts match the simpler df_filt-driven path).

In [ ]:

Copied!





import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X = df_filt[check_columns].copy()
X = X.dropna()

scaler_x = StandardScaler()
X_scaled_pca = scaler_x.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled_pca)

explained_variance = pca.explained_variance_ratio_
cum_variance = np.cumsum(explained_variance)

display(
    pd.DataFrame(
        {
            "PC": np.arange(1, len(explained_variance) + 1),
            "Explained Variance Ratio": explained_variance,
            "Cumulative Variance": cum_variance,
        }
    ).head(15)
)

plt.figure(figsize=(5, 3))
plt.plot(cum_variance, marker="o")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Cumulative Explained Variance")
plt.grid(True)
plt.show()
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X = df_filt[check_columns].copy()
X = X.dropna()

scaler_x = StandardScaler()
X_scaled_pca = scaler_x.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled_pca)

explained_variance = pca.explained_variance_ratio_
cum_variance = np.cumsum(explained_variance)

display(
    pd.DataFrame(
        {
            "PC": np.arange(1, len(explained_variance) + 1),
            "Explained Variance Ratio": explained_variance,
            "Cumulative Variance": cum_variance,
        }
    ).head(15)
)

plt.figure(figsize=(5, 3))
plt.plot(cum_variance, marker="o")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Cumulative Explained Variance")
plt.grid(True)
plt.show()

In [ ]:

Copied!





import numpy as np

n_80 = int(np.argmax(cum_variance >= 0.80) + 1)
n_90 = int(np.argmax(cum_variance >= 0.90) + 1)
n_95 = int(np.argmax(cum_variance >= 0.95) + 1)

print("Components for 80% variance:", n_80)
print("Components for 90% variance:", n_90)
print("Components for 95% variance:", n_95)
import numpy as np

n_80 = int(np.argmax(cum_variance >= 0.80) + 1)
n_90 = int(np.argmax(cum_variance >= 0.90) + 1)
n_95 = int(np.argmax(cum_variance >= 0.95) + 1)

print("Components for 80% variance:", n_80)
print("Components for 90% variance:", n_90)
print("Components for 95% variance:", n_95)

In [ ]:

Copied!

from sklearn.linear_model import LinearRegression

y_pc = df_filt.loc[X.index, key_column]

N = n_95
X_reduced = X_pca[:, :N]

reg_pc = LinearRegression()
reg_pc.fit(X_reduced, y_pc)

r2_pca = reg_pc.score(X_reduced, y_pc)
print("R² using first", N, "PCs:", r2_pca)
from sklearn.linear_model import LinearRegression

y_pc = df_filt.loc[X.index, key_column]

N = n_95
X_reduced = X_pca[:, :N]

reg_pc = LinearRegression()
reg_pc.fit(X_reduced, y_pc)

r2_pca = reg_pc.score(X_reduced, y_pc)
print("R² using first", N, "PCs:", r2_pca)

PLS regression¶

PLSRegression on X_scaled_pca with y_pls aligned to X.index (same rows as the PCA cell).

In [ ]:

Copied!

from sklearn.cross_decomposition import PLSRegression

y_pls = df_filt.loc[X.index, key_column].values

n_components = 10
pls = PLSRegression(n_components=n_components)
pls.fit(X_scaled_pca, y_pls)

r2_train = pls.score(X_scaled_pca, y_pls)
print(f"R² using {n_components} PLS components: {r2_train:.3f}")
from sklearn.cross_decomposition import PLSRegression

y_pls = df_filt.loc[X.index, key_column].values

n_components = 10
pls = PLSRegression(n_components=n_components)
pls.fit(X_scaled_pca, y_pls)

r2_train = pls.score(X_scaled_pca, y_pls)
print(f"R² using {n_components} PLS components: {r2_train:.3f}")

Ranked correlations and per-column NaN footprint¶

Inspect corrs and quantify rows lost if each column were required individually.

In [ ]:

Copied!





import pandas as pd

corrs.head(40)

cols_ms = [key_column] + check_columns
rows_before = len(df_filt)

rows_lost = {}
for col in cols_ms:
    rows_after = df_filt.dropna(subset=[col]).shape[0]
    rows_lost[col] = rows_before - rows_after

rows_lost_series = pd.Series(rows_lost).sort_values(ascending=False)
rows_lost_series.head(20)
import pandas as pd

corrs.head(40)

cols_ms = [key_column] + check_columns
rows_before = len(df_filt)

rows_lost = {}
for col in cols_ms:
    rows_after = df_filt.dropna(subset=[col]).shape[0]
    rows_lost[col] = rows_before - rows_after

rows_lost_series = pd.Series(rows_lost).sort_values(ascending=False)
rows_lost_series.head(20)

Hyperparameter search with Optuna¶

Objective matches basic_usage.ipynb (linear / ridge / lasso + top_n_features). The gbr branch stays present but unreachable unless you widen model_name suggestions.

In [ ]:

Copied!





import optuna
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Lasso, LinearRegression, Ridge
from sklearn.model_selection import ShuffleSplit, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

optuna.logging.set_verbosity(optuna.logging.WARNING)

df_clean = df_filt.dropna(subset=[key_column] + check_columns)


def objective(trial):
    """Return mean ShuffleSplit CV R² for the suggested model and feature count."""
    model_name = trial.suggest_categorical("model", ["linear", "ridge", "lasso"])
    top_n = trial.suggest_int("top_n_features", 5, len(check_columns))

    features = corrs.index[:top_n].tolist()

    X = df_clean[features].values
    y = df_clean[key_column].values

    if model_name == "linear":
        model = LinearRegression()
    elif model_name == "ridge":
        alpha = trial.suggest_float("alpha", 1e-3, 100, log=True)
        model = Ridge(alpha=alpha)
    elif model_name == "lasso":
        alpha = trial.suggest_float("alpha", 1e-4, 10, log=True)
        model = Lasso(alpha=alpha, max_iter=5000)
    elif model_name == "gbr":
        model = GradientBoostingRegressor(
            n_estimators=trial.suggest_int("n_estimators", 50, 200),
            max_depth=trial.suggest_int("max_depth", 2, 6),
            learning_rate=trial.suggest_float("lr", 0.01, 0.3, log=True),
            subsample=trial.suggest_float("subsample", 0.6, 1.0),
            random_state=42,
        )

    pipe = Pipeline([("scaler", StandardScaler()), ("model", model)])
    cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=42)
    scores = cross_val_score(pipe, X, y, cv=cv, scoring="r2")
    return scores.mean()


study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=60, show_progress_bar=True)

print(f"\nBest CV R²: {study.best_value:.3f}")
print(f"Best params: {study.best_params}")
import optuna
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Lasso, LinearRegression, Ridge
from sklearn.model_selection import ShuffleSplit, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

optuna.logging.set_verbosity(optuna.logging.WARNING)

df_clean = df_filt.dropna(subset=[key_column] + check_columns)


def objective(trial):
    """Return mean ShuffleSplit CV R² for the suggested model and feature count."""
    model_name = trial.suggest_categorical("model", ["linear", "ridge", "lasso"])
    top_n = trial.suggest_int("top_n_features", 5, len(check_columns))

    features = corrs.index[:top_n].tolist()

    X = df_clean[features].values
    y = df_clean[key_column].values

    if model_name == "linear":
        model = LinearRegression()
    elif model_name == "ridge":
        alpha = trial.suggest_float("alpha", 1e-3, 100, log=True)
        model = Ridge(alpha=alpha)
    elif model_name == "lasso":
        alpha = trial.suggest_float("alpha", 1e-4, 10, log=True)
        model = Lasso(alpha=alpha, max_iter=5000)
    elif model_name == "gbr":
        model = GradientBoostingRegressor(
            n_estimators=trial.suggest_int("n_estimators", 50, 200),
            max_depth=trial.suggest_int("max_depth", 2, 6),
            learning_rate=trial.suggest_float("lr", 0.01, 0.3, log=True),
            subsample=trial.suggest_float("subsample", 0.6, 1.0),
            random_state=42,
        )

    pipe = Pipeline([("scaler", StandardScaler()), ("model", model)])
    cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=42)
    scores = cross_val_score(pipe, X, y, cv=cv, scoring="r2")
    return scores.mean()


study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=60, show_progress_bar=True)

print(f"\nBest CV R²: {study.best_value:.3f}")
print(f"Best params: {study.best_params}")

In [ ]:

Copied!





import matplotlib.pyplot as plt
import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import ElasticNet, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

bp = study.best_params
top_n = bp["top_n_features"]
features = corrs.index[:top_n].tolist()

sub = df_filt[[key_column] + features].dropna()
X_ot = sub[features].values
y_ot = sub[key_column].values

model_name = bp["model"]
if model_name == "ridge":
    best_model = Ridge(alpha=bp["alpha"])
elif model_name == "lasso":
    best_model = Lasso(alpha=bp["alpha"], max_iter=5000)
elif model_name == "elasticnet":
    best_model = ElasticNet(alpha=bp["alpha"], l1_ratio=bp["l1_ratio"], max_iter=5000)
elif model_name == "rf":
    best_model = RandomForestRegressor(
        n_estimators=bp["n_estimators"],
        max_depth=bp["max_depth"],
        random_state=42,
    )
elif model_name == "pls":
    best_model = PLSRegression(n_components=bp["n_components"])
else:
    best_model = LinearRegression()

pipe = Pipeline([("scaler", StandardScaler()), ("model", best_model)])

X_train, X_test, y_train, y_test = train_test_split(
    X_ot, y_ot, test_size=0.3, random_state=42
)
pipe.fit(X_train, y_train)
y_pred_ot = pipe.predict(X_test).ravel()

r2_test = pipe.score(X_test, y_test)
residuals = y_test - y_pred_ot

fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))

axes[0].scatter(y_test, y_pred_ot, alpha=0.4, s=10)
lims = [min(y_test.min(), y_pred_ot.min()), max(y_test.max(), y_pred_ot.max())]
axes[0].plot(lims, lims, "r--", lw=1)
axes[0].set_xlabel("Actual")
axes[0].set_ylabel("Predicted")
axes[0].set_title(f"Predicted vs Actual (R²={r2_test:.3f})")

axes[1].hist(y_test, bins=40, alpha=0.5, label="Actual", density=True)
axes[1].hist(y_pred_ot, bins=40, alpha=0.5, label="Predicted", density=True)
axes[1].set_xlabel(key_column.split(".")[-1])
axes[1].set_title("Distribution comparison")
axes[1].legend()

axes[2].scatter(y_pred_ot, residuals, alpha=0.4, s=10)
axes[2].axhline(0, color="r", linestyle="--", lw=1)
axes[2].set_xlabel("Predicted")
axes[2].set_ylabel("Residual")
axes[2].set_title("Residuals")

plt.tight_layout()
plt.show()

print(f"Test R²: {r2_test:.3f}")
print(f"Residual RMS: {np.std(residuals):.3f}")
print(f"Signal RMS: {np.std(y_test):.3f}")
print(f"Residual/Mean: {np.std(residuals) / np.mean(y_test):.1%}")

fitted_model = pipe.named_steps["model"]
if hasattr(fitted_model, "feature_importances_"):
    imp = fitted_model.feature_importances_
    idx = np.argsort(imp)[-15:]
    plt.figure(figsize=(8, 5))
    plt.barh([features[i] for i in idx], imp[idx])
    plt.xlabel("Feature Importance")
    plt.title(f"Top features ({model_name})")
    plt.tight_layout()
    plt.show()
elif hasattr(fitted_model, "coef_"):
    coefs = np.abs(np.array(fitted_model.coef_).ravel())
    idx = np.argsort(coefs)[-15:]
    plt.figure(figsize=(8, 5))
    plt.barh([features[i] for i in idx], coefs[idx])
    plt.xlabel("|Coefficient| (on scaled features)")
    plt.title(f"Top features ({model_name})")
    plt.tight_layout()
    plt.show()
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import ElasticNet, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

bp = study.best_params
top_n = bp["top_n_features"]
features = corrs.index[:top_n].tolist()

sub = df_filt[[key_column] + features].dropna()
X_ot = sub[features].values
y_ot = sub[key_column].values

model_name = bp["model"]
if model_name == "ridge":
    best_model = Ridge(alpha=bp["alpha"])
elif model_name == "lasso":
    best_model = Lasso(alpha=bp["alpha"], max_iter=5000)
elif model_name == "elasticnet":
    best_model = ElasticNet(alpha=bp["alpha"], l1_ratio=bp["l1_ratio"], max_iter=5000)
elif model_name == "rf":
    best_model = RandomForestRegressor(
        n_estimators=bp["n_estimators"],
        max_depth=bp["max_depth"],
        random_state=42,
    )
elif model_name == "pls":
    best_model = PLSRegression(n_components=bp["n_components"])
else:
    best_model = LinearRegression()

pipe = Pipeline([("scaler", StandardScaler()), ("model", best_model)])

X_train, X_test, y_train, y_test = train_test_split(
    X_ot, y_ot, test_size=0.3, random_state=42
)
pipe.fit(X_train, y_train)
y_pred_ot = pipe.predict(X_test).ravel()

r2_test = pipe.score(X_test, y_test)
residuals = y_test - y_pred_ot

fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))

axes[0].scatter(y_test, y_pred_ot, alpha=0.4, s=10)
lims = [min(y_test.min(), y_pred_ot.min()), max(y_test.max(), y_pred_ot.max())]
axes[0].plot(lims, lims, "r--", lw=1)
axes[0].set_xlabel("Actual")
axes[0].set_ylabel("Predicted")
axes[0].set_title(f"Predicted vs Actual (R²={r2_test:.3f})")

axes[1].hist(y_test, bins=40, alpha=0.5, label="Actual", density=True)
axes[1].hist(y_pred_ot, bins=40, alpha=0.5, label="Predicted", density=True)
axes[1].set_xlabel(key_column.split(".")[-1])
axes[1].set_title("Distribution comparison")
axes[1].legend()

axes[2].scatter(y_pred_ot, residuals, alpha=0.4, s=10)
axes[2].axhline(0, color="r", linestyle="--", lw=1)
axes[2].set_xlabel("Predicted")
axes[2].set_ylabel("Residual")
axes[2].set_title("Residuals")

plt.tight_layout()
plt.show()

print(f"Test R²: {r2_test:.3f}")
print(f"Residual RMS: {np.std(residuals):.3f}")
print(f"Signal RMS: {np.std(y_test):.3f}")
print(f"Residual/Mean: {np.std(residuals) / np.mean(y_test):.1%}")

fitted_model = pipe.named_steps["model"]
if hasattr(fitted_model, "feature_importances_"):
    imp = fitted_model.feature_importances_
    idx = np.argsort(imp)[-15:]
    plt.figure(figsize=(8, 5))
    plt.barh([features[i] for i in idx], imp[idx])
    plt.xlabel("Feature Importance")
    plt.title(f"Top features ({model_name})")
    plt.tight_layout()
    plt.show()
elif hasattr(fitted_model, "coef_"):
    coefs = np.abs(np.array(fitted_model.coef_).ravel())
    idx = np.argsort(coefs)[-15:]
    plt.figure(figsize=(8, 5))
    plt.barh([features[i] for i in idx], coefs[idx])
    plt.xlabel("|Coefficient| (on scaled features)")
    plt.title(f"Top features ({model_name})")
    plt.tight_layout()
    plt.show()

In [ ]:

Copied!





top_keep = study.best_params["top_n_features"]
features_best = corrs.index[:top_keep].tolist()

df_clean = df_filt.dropna(subset=[key_column] + check_columns)
sub_best = df_filt[[key_column] + features_best].dropna()

print(f"All ranked features: {len(df_clean)} rows, {len(check_columns)} features")
print(f"Optuna best subset: {len(sub_best)} rows, {top_keep} features")
top_keep = study.best_params["top_n_features"]
features_best = corrs.index[:top_keep].tolist()

df_clean = df_filt.dropna(subset=[key_column] + check_columns)
sub_best = df_filt[[key_column] + features_best].dropna()

print(f"All ranked features: {len(df_clean)} rows, {len(check_columns)} features")
print(f"Optuna best subset: {len(sub_best)} rows, {top_keep} features")

Plain pipeline CV (linear regression)¶

Pipeline + cross_val_score with 3-fold, 5-fold, and ShuffleSplit — same checks as basic_usage.ipynb.

In [ ]:

Copied!





from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

df_clean = df_filt.dropna(subset=[key_column] + check_columns)
X_cv = df_clean[check_columns].values
y_cv = df_clean[key_column].values

pipe = Pipeline([("scaler", StandardScaler()), ("model", LinearRegression())])

cv3 = cross_val_score(pipe, X_cv, y_cv, cv=3, scoring="r2")
cv5 = cross_val_score(pipe, X_cv, y_cv, cv=5, scoring="r2")

print(f"3-fold CV R²: {cv3.mean():.3f} ± {cv3.std():.3f}  (per fold: {cv3})")
print(f"5-fold CV R²: {cv5.mean():.3f} ± {cv5.std():.3f}  (per fold: {cv5})")

cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=42)
scores = cross_val_score(pipe, X_cv, y_cv, cv=cv, scoring="r2")
print(f"ShuffleSplit CV R²: {scores.mean():.3f} ± {scores.std():.3f}")
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

df_clean = df_filt.dropna(subset=[key_column] + check_columns)
X_cv = df_clean[check_columns].values
y_cv = df_clean[key_column].values

pipe = Pipeline([("scaler", StandardScaler()), ("model", LinearRegression())])

cv3 = cross_val_score(pipe, X_cv, y_cv, cv=3, scoring="r2")
cv5 = cross_val_score(pipe, X_cv, y_cv, cv=5, scoring="r2")

print(f"3-fold CV R²: {cv3.mean():.3f} ± {cv3.std():.3f}  (per fold: {cv3})")
print(f"5-fold CV R²: {cv5.mean():.3f} ± {cv5.std():.3f}  (per fold: {cv5})")

cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=42)
scores = cross_val_score(pipe, X_cv, y_cv, cv=cv, scoring="r2")
print(f"ShuffleSplit CV R²: {scores.mean():.3f} ± {scores.std():.3f}")

Packaged fit / persist / reload (`RegressionTrainer`)¶

Earlier cells used sklearn directly for teaching; MLDatasetBuilder only standardized the tabular X / y matrix.

The modeling.ml layer adds RegressionTrainer (default preprocessing pipeline + linear / ridge / elastic-net), TrainingMetrics (optional cv summary), and save_model_artifact / load_model_artifact for a directory of joblib + JSON sidecars. Run after the modeling setup cell that defines ml_ds.

In [ ]:

Copied!





import tempfile
from pathlib import Path

import pandas as pd

from geecs_data_utils.modeling.ml import (
    RegressionTrainer,
    load_model_artifact,
    save_model_artifact,
)

# Uses ml_ds from the modeling setup cell (same merged-scan table).
top_k = min(25, len(ml_ds.feature_columns))
feat_cols = ml_ds.feature_columns[:top_k]
X_demo = ml_ds.frame[feat_cols]
y_demo = ml_ds.frame[ml_ds.target_column]

trainer = RegressionTrainer(model="ridge", model_params={"alpha": 1.0})
artifact = trainer.fit(
    X_demo,
    y_demo,
    cv=5,
    target_name="ict_charge_pc",
    scan_info={"notebook": "basic_usage_refactor"},
)

print(f"In-sample R²: {artifact.metrics.r2:.4f}")
if artifact.metrics.cv_r2_mean is not None:
    print(
        "5-fold CV R²:",
        f"{artifact.metrics.cv_r2_mean:.4f} ± {artifact.metrics.cv_r2_std:.4f}",
    )

tmpdir = Path(tempfile.mkdtemp(prefix="geecs_ml_demo_"))
artifact_dir = tmpdir / "ridge_charge_demo"
save_model_artifact(artifact, artifact_dir)
print("Saved artifact dir:", artifact_dir)

loaded = load_model_artifact(artifact_dir)
preview = X_demo.iloc[:3]
pred = loaded.predict(preview)
actual = y_demo.iloc[:3].to_numpy()
compare = pd.DataFrame({"actual": actual, "predicted": pred, "residual": pred - actual})
print(compare.round(4).to_string(index=False))
import tempfile
from pathlib import Path

import pandas as pd

from geecs_data_utils.modeling.ml import (
    RegressionTrainer,
    load_model_artifact,
    save_model_artifact,
)

# Uses ml_ds from the modeling setup cell (same merged-scan table).
top_k = min(25, len(ml_ds.feature_columns))
feat_cols = ml_ds.feature_columns[:top_k]
X_demo = ml_ds.frame[feat_cols]
y_demo = ml_ds.frame[ml_ds.target_column]

trainer = RegressionTrainer(model="ridge", model_params={"alpha": 1.0})
artifact = trainer.fit(
    X_demo,
    y_demo,
    cv=5,
    target_name="ict_charge_pc",
    scan_info={"notebook": "basic_usage_refactor"},
)

print(f"In-sample R²: {artifact.metrics.r2:.4f}")
if artifact.metrics.cv_r2_mean is not None:
    print(
        "5-fold CV R²:",
        f"{artifact.metrics.cv_r2_mean:.4f} ± {artifact.metrics.cv_r2_std:.4f}",
    )

tmpdir = Path(tempfile.mkdtemp(prefix="geecs_ml_demo_"))
artifact_dir = tmpdir / "ridge_charge_demo"
save_model_artifact(artifact, artifact_dir)
print("Saved artifact dir:", artifact_dir)

loaded = load_model_artifact(artifact_dir)
preview = X_demo.iloc[:3]
pred = loaded.predict(preview)
actual = y_demo.iloc[:3].to_numpy()
compare = pd.DataFrame({"actual": actual, "predicted": pred, "residual": pred - actual})
print(compare.round(4).to_string(index=False))

Correlation analysis and predictive modeling¶

Correlation ranking (geecs_data_utils.analysis)¶

Modeling with sklearn (notebook-first)¶

Shared modeling table (MLDatasetBuilder)¶

In-sample multivariate linear fit¶

Train / test splits (Monte Carlo loop)¶

PCA on ranked predictors¶

PLS regression¶

Ranked correlations and per-column NaN footprint¶

Hyperparameter search with Optuna¶

Plain pipeline CV (linear regression)¶

Packaged fit / persist / reload (RegressionTrainer)¶

Correlation ranking (`geecs_data_utils.analysis`)¶

Shared modeling table (`MLDatasetBuilder`)¶

Packaged fit / persist / reload (`RegressionTrainer`)¶