pyphi.calc module

Phi for Python (pyPhi) — Version 2.0

By Sal Garcia (sgarciam@ic.ac.uk salvadorgarciamunoz@gmail.com)

Added Feb 23 2026
  • Added _validate_inputs function for input validation and observation reconciliation

  • Integrated validation into pca, pls, lpls entry points

  • Replaced np.tile with numpy broadcasting throughout

  • Optimized _Ab_btbinv with fast path for complete data

  • var_t (score covariance matrix) stored in model objects to avoid recalculation

  • Added _extract_array and _calc_r2 helper functions to reduce duplication

  • Replaced hardcoded F-distribution and chi2 lookup tables with scipy.stats

  • Replaced hardcoded t-distribution with scipy.stats

Added Feb 07 2026
  • fixed cat_2_matrix for the output to be consistent with MBPLS

Added Jan 30 2025
  • Added a pinv alternative protection in spectra_savgol for the case where inv fails

Added Jan 20 2025
  • Added the ‘cca’ flag to the pls routine to calculate CCA between the Ts and each of the Ys (one by one), calculating loadings and scores equivalent to a perfectly orthogonalized OPLS model. The covariant scores (Tcv) the covariant Loadings (Pcv) and predictive weights (Wcv) are added as keys to the model object. [The covariant loadings(Pcv) are equivalent to the predictive loadings in OPLS]

  • Added cca and cca-multi routines to perform PLS-CCA (alternative to OPLS) as of now cca-multi remains unused.

Added Nov 18th, 2024
  • replaced interp2d with RectBivariateSpline

  • Protected SPE lim calculations for near zero residuals

  • Added build_polynomial function to create linear regression models with variable selection assited by PLS

by merge from James
  • Added spectra preprocessing methods

  • bootstrap PLS

by Salvador Garcia (sgarciam@ic.ac.uk salvadorgarciamunoz@gmail.com) Added Dec 19th 2023

  • phi.clean_htmls removes all html files in the working directory

  • clean_empty_rows returns also the names of the rows removed

Added May 1st
  • YMB is now added in the same structure as the XMB

  • Corrected the dimensionality of the lwpls prediction, it was a double-nested array.

Added Apr 30
  • Modified Multi-block PLS to include the block name in the variable name

Added Apr 29
  • Included the unique routine and adjusted the parse_materials routine so materials and lots are in the same order as in the raw data

Added Apr 27
  • Enhanced adapt_pls_4_pyomo to use variable names as indices if flag is sent

Added Apr 25
  • Enhanced the varimax_rotation to adjust the r2 and r2pv to the rotated loadings

Added Apr 21
  • Re added varimax_rotation with complete model rotation for PCA and PLS

Added Apr 17
  • Added tpls and tpls_pred

Added Apr 15
  • Added jrpls model and jrpls_pred

  • Added routines to reconcile columns to rows identifier so that X and R materices correspond correctly

  • Added routines to reconcile rows across a list of dataframes and produces a list of dataframes containing only those observations present in all dataframes

Added on Apr 9 2023
  • Added lpls and lpls_pred routines

  • Added parse_materials to read linear table and produce R or Ri

Release as of Nov 23 2022
  • Added a function to export PLS model to gPROMS code

Release as of Aug 22 2022
  • Fixed access to NEOS server and use of GAMS instead of IPOPT

Release as of Aug 12 2022
  • Fixed the SPE calculations in pls_pred and pca_pred

  • Changed to a more efficient inversion in pca_pred (=pls_pred)

  • Added a pseudo-inverse option in pmp for pca_pred

Release as of Aug 2 2022
  • Added replicate_data

pyphi.calc.ma57_dummy_check()[source]
pyphi.calc.f99(i, j)[source]

Return the F-distribution critical value at 99% confidence.

Parameters:
  • df1 (int) – Numerator degrees of freedom.

  • df2 (int) – Denominator degrees of freedom.

Returns:

F critical value at alpha = 0.01.

Return type:

float

pyphi.calc.f95(i, j)[source]

Return the F-distribution critical value at 95% confidence.

Parameters:
  • df1 (int) – Numerator degrees of freedom.

  • df2 (int) – Denominator degrees of freedom.

Returns:

F critical value at alpha = 0.05.

Return type:

float

pyphi.calc.spe_ci(spe)[source]

Estimate SPE control limits from training data using a chi-squared approximation.

Parameters:
  • spe_values (ndarray) – SPE values from the training set (n_obs × 1).

  • alpha (float) – Confidence level. Default 0.95 (also returns 99%).

Returns:

(lim95, lim99) — SPE control limits at 95% and 99%.

Return type:

tuple

pyphi.calc.single_score_conf_int(t)[source]

Calculate confidence ellipse radii for score scatter plots.

Parameters:
  • mvmobj (dict) – Fitted PCA or PLS model.

  • alpha (float) – Confidence level. Default 0.95.

Returns:

Ellipse radii for each pair of scores.

Return type:

ndarray

pyphi.calc.scores_conf_int_calc(st, N)[source]

Calculate per-score univariate confidence intervals.

Parameters:
  • mvmobj (dict) – Fitted PCA or PLS model.

  • alpha (float) – Confidence level. Default 0.95.

Returns:

Confidence interval half-widths for each latent variable (A,).

Return type:

ndarray

pyphi.calc.clean_htmls()[source]

Deletes all .html files in the current directory.

pyphi.calc.z2n(X, X_nan_map)[source]

Replace zeros with NaN (zero to NaN).

Parameters:

X (np.ndarray) – Input array.

Returns:

Array with zeros replaced by np.nan.

Return type:

np.ndarray

pyphi.calc.n2z(X)[source]

Replace NaN with zero (NaN to zero).

Parameters:

X (np.ndarray) – Input array.

Returns:

(X_filled, nan_map) — array with NaNs replaced by 0, and a boolean mask where True indicates original non-NaN positions.

Return type:

tuple

pyphi.calc.mean(X)[source]
pyphi.calc.std(X)[source]
pyphi.calc.meancenterscale(X, *, mcs=True)[source]

Mean-center and/or scale a data matrix.

Parameters:
  • X (np.ndarray) – Data matrix to preprocess (n_obs × n_vars).

  • mcs (str or bool) – Preprocessing mode. 'autoscale': mean-center and scale to unit variance. 'center': mean-center only. False: return unchanged.

Returns:

(X_processed, x_mean, x_std) — preprocessed matrix, column means, and column standard deviations.

Return type:

tuple

pyphi.calc.find(a, func)[source]

Find row indices where the first column equals a given value.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Data matrix to search.

  • value – Value to search for in the first column.

Returns:

Row indices where the match was found.

Return type:

list

pyphi.calc.pca(X, A, *, mcs=True, md_algorithm='nipals', force_nipals=True, shush=False, cross_val=0)[source]

Fit a Principal Component Analysis (PCA) model.

Supports missing data via NIPALS. Can use SVD for complete data as well.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Observations × variables matrix. If a DataFrame, the first column must contain observation IDs.

  • A (int) – Number of principal components to extract.

  • mcs (str or bool) – Mean-centering/scaling flag. 'autoscale' (default): mean-center and scale to unit variance. 'center': mean-center only. False: no preprocessing.

  • md_algorithm (str) – Missing-data algorithm. 'nipals' (default) or 'nlp'.

  • force_nipals (bool) – If True, forces NIPALS even when data is complete. Default False.

  • cross_val (int) – Cross-validation percentage of elements to remove per round. 0 = no CV, 100 = leave-one-out, 1–99 = element-wise removal. Default 0.

  • shush (bool) – If True, suppresses printed output. Default False.

  • tolerance (float) – NIPALS convergence tolerance. Default 1e-10.

  • maxit (int) – Maximum NIPALS iterations per component. Default 5000.

Returns:

Fitted PCA model with keys:

  • T (ndarray): Scores matrix (n_obs × A).

  • P (ndarray): Loadings matrix (n_vars × A).

  • r2x (float): Cumulative R² for X.

  • r2xpv (ndarray): Per-variable R² (n_vars × A).

  • mx (ndarray): Variable means used for preprocessing.

  • sx (ndarray): Variable std devs used for preprocessing.

  • var_t (ndarray): Score covariance matrix (A × A).

  • T2 (ndarray): Hotelling’s T² for training observations.

  • T2_lim95 (float): 95% T² control limit.

  • T2_lim99 (float): 99% T² control limit.

  • speX (ndarray): X-space SPE for training observations.

  • speX_lim95 (float): 95% SPE control limit.

  • speX_lim99 (float): 99% SPE control limit.

  • obsidX (list): Observation IDs (only if X was a DataFrame).

  • varidX (list): Variable IDs (only if X was a DataFrame).

  • q2x (float): Cross-validated Q² (only if cross_val > 0).

Return type:

dict

NLP algorithn for missing data as in:

de la Fuente, R.L.N., García‐Muñoz, S. and Biegler, L.T., 2010. An efficient nonlinear programming strategy for PCA models with incomplete data sets. Journal of Chemometrics, 24(6), pp.301-311.

pyphi.calc.pca_(X, A, *, mcs=True, md_algorithm='nipals', force_nipals=True, shush=False)[source]
pyphi.calc.pls_cca(pls_obj, Xmcs, Ymcs, not_Xmiss)[source]
pyphi.calc.pls(X, Y, A, *, mcsX=True, mcsY=True, md_algorithm='nipals', force_nipals=True, shush=False, cross_val=0, cross_val_X=False, cca=False)[source]

Fit a Partial Least Squares (PLS) regression model.

Supports missing data in both X and Y via NIPALS. Optionally computes CCA-based covariant components (equivalent to OPLS predictive space).

Parameters:
  • X (pd.DataFrame or np.ndarray) – Predictor matrix (n_obs × n_x). If a DataFrame, the first column must contain observation IDs.

  • Y (pd.DataFrame or np.ndarray) – Response matrix (n_obs × n_y). If a DataFrame, the first column must contain observation IDs. Observation IDs are reconciled with X automatically.

  • A (int) – Number of latent variables.

  • mcsX – Preprocessing flags Each can be 'autoscale', 'center', or False. Default 'autoscale'.

  • mcsY – Preprocessing flags Each can be 'autoscale', 'center', or False. Default 'autoscale'.

  • md_algorithm (str) – Missing-data algorithm. 'nipals' or 'nlp' 'nipals' is (default).

  • force_nipals (bool) – Force NIPALS even for complete data. Default False.

  • cross_val (int) – Cross-validation level. 0 = none, 100 = LOO, 1–99 = element-wise. Default 0.

  • cross_val_X (bool) – Also cross-validate X-space. Default False.

  • shush (bool) – Suppress printed output. Default False.

  • tolerance (float) – NIPALS convergence tolerance. Default 1e-10.

  • maxit (int) – Max NIPALS iterations per component. Default 5000.

  • cca (bool) – If True, compute CCA-based covariant components and add Tcv, Pcv, Wcv to the model. Default False.

Returns:

Fitted PLS model with keys:

  • T (ndarray): X-scores (n_obs × A).

  • P (ndarray): X-loadings (n_x × A).

  • Q (ndarray): Y-loadings (n_y × A).

  • W (ndarray): X-weights (n_x × A).

  • Ws (ndarray): Rotated weights W*(P’W)⁻¹ (n_x × A).

  • r2x (float): Cumulative R² for X.

  • r2xpv (ndarray): Per-variable R² for X (n_x × A).

  • r2y (float): Cumulative R² for Y.

  • r2ypv (ndarray): Per-variable R² for Y (n_y × A).

  • mx, sx (ndarray): X preprocessing parameters.

  • my, sy (ndarray): Y preprocessing parameters.

  • var_t (ndarray): Score covariance matrix (A × A).

  • T2, T2_lim95, T2_lim99: Hotelling’s T² and limits.

  • speX, speX_lim95, speX_lim99: X-space SPE and limits.

  • speY, speY_lim95, speY_lim99: Y-space SPE and limits.

  • obsidX, varidX: IDs (only if X was a DataFrame).

  • obsidY, varidY: IDs (only if Y was a DataFrame).

  • q2x, q2y (float): Cross-validated Q² (if cross_val > 0).

  • Tcv, Pcv, Wcv: CCA covariant components (if cca=True).

Return type:

dict

NLP approach to missing data as in:

Puwakkatiya‐Kankanamage, E.H., García‐Muñoz, S. and Biegler, L.T., 2014. An optimization‐based undeflated PLS (OUPLS) method to handle missing data in the training set. Journal of Chemometrics, 28(7), pp.575-584.

pyphi.calc.pls_(X, Y, A, *, mcsX=True, mcsY=True, md_algorithm='nipals', force_nipals=True, shush=False, cca=False)[source]
pyphi.calc.hott2(mvmobj, *, Xnew=False, Tnew=False)[source]

Compute Hotelling’s T² statistic.

Parameters:
  • mvmobj (dict) – Fitted PCA or PLS model.

  • Xnew (pd.DataFrame or np.ndarray) – New X observations (optional). If provided, scores are computed internally before T² calculation.

  • Tnew (np.ndarray) – Pre-computed scores (optional). Used directly if provided; avoids redundant projection.

Returns:

T² value for each observation (n_obs,).

Return type:

ndarray

Note

If neither Xnew nor Tnew is provided, returns T² for the training set stored in mvmobj.

pyphi.calc.pca_pred(Xnew, pcaobj, *, algorithm='p2mp')[source]

Project new observations onto a fitted PCA model.

Parameters:
  • Xnew (pd.DataFrame or np.ndarray) – New observations to project. Variables must match those used to train pcaobj.

  • pcaobj (dict) – Fitted PCA model from pca().

  • algorithm (str) – Projection algorithm. 'p2mp' (default) handles missing data; 'standard' uses direct matrix multiplication and requires complete data.

Returns:

Prediction results with keys:

  • Tnew (ndarray): Projected scores (n_new × A).

  • Xhat (ndarray): Reconstructed X in original scale.

  • speX (ndarray): SPE for each new observation.

  • T2 (ndarray): Hotelling’s T² for each new observation.

Return type:

dict

pyphi.calc.pls_pred(Xnew, plsobj)[source]

Predict Y for new observations using a fitted PLS model.

Parameters:
  • Xnew (pd.DataFrame or np.ndarray) – New predictor observations. Variables must match those used to train plsobj.

  • plsobj (dict) – Fitted PLS model from pls().

  • algorithm (str) – Projection algorithm. 'p2mp' (default) handles missing data; 'standard' requires complete data.

Returns:

Prediction results with keys:

  • Tnew (ndarray): X-scores for new observations (n_new × A).

  • Yhat (ndarray): Predicted Y in original scale (n_new × n_y).

  • Xhat (ndarray): Reconstructed X in original scale.

  • speX (ndarray): X-space SPE for each new observation.

  • T2 (ndarray): Hotelling’s T² for each new observation.

  • Tcv (ndarray): CCA covariant scores (only if model has Wcv).

Return type:

dict

pyphi.calc.spe(mvmobj, Xnew, *, Ynew=False)[source]

Compute Squared Prediction Error (SPE / Q statistic).

Parameters:
  • mvmobj (dict) – Fitted PCA or PLS model.

  • Xnew (pd.DataFrame or np.ndarray) – New X observations.

  • Ynew (pd.DataFrame or np.ndarray) – New Y observations (optional). Only used for PLS models to also return Y-space SPE.

Returns:

  • If Ynew is not provided (or model is PCA): returns speX (ndarray, shape n_obs × 1).

  • If Ynew is provided and model is PLS: returns (speX, speY) tuple of arrays.

Return type:

ndarray or tuple

pyphi.calc.lwpls(xnew, loc_par, mvmobj, X, Y, *, shush=False)[source]

Locally Weighted PLS (LWPLS) prediction for a single new observation.

Per Kim et al. Int. J. Pharmaceutics 421 (2011) 269–274.

Parameters:
  • xnew (np.ndarray or pd.DataFrame) – Single new observation (1 × n_x).

  • loc_par (float) – Locality parameter controlling the width of the Gaussian kernel. Larger values include more training observations.

  • mvmobj (dict) – Global PLS model from pls(), used to define the score space for distance calculation.

  • X (pd.DataFrame or np.ndarray) – Training X data.

  • Y (pd.DataFrame or np.ndarray) – Training Y data.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Prediction results with keys:

  • Yhat (ndarray): Locally predicted Y (1 × n_y).

  • weights (ndarray): Observation weights used in local model.

Return type:

dict

pyphi.calc.contributions(mvmobj, X, cont_type, *, Y=False, from_obs=False, to_obs=False, lv_space=False)[source]

Compute variable contributions to monitoring statistics.

Args:

mvmobj (dict): Fitted PCA or PLS model. Xnew (pd.DataFrame or np.ndarray): Observations to diagnose. cont_type (str): Type of contribution to compute.

'scores': contribution to each score. 'spex': contribution to X-space SPE. 'spey': contribution to Y-space SPE (PLS only). 't2': contribution to Hotelling’s T².

Ynew (pd.DataFrame or np.ndarray): Y observations (optional,

required for cont_type='spey').

Returns:

ndarray: Contribution values (n_obs × n_vars).

Ref: Miller, P., Swanson, R.E. and Heckler, C.E., 1998. Contribution plots: a missing link

in multivariate quality control. Applied mathematics and computer science, 8(4), pp.775-792.

pyphi.calc.clean_empty_rows(X, *, shush=False)[source]

Remove rows that are entirely NaN.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Input data matrix.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Data with fully empty rows removed.

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.clean_low_variances(X, *, shush=False, min_var=1e-10)[source]

Remove columns with variance below a threshold.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Input data matrix.

  • min_var (float) – Minimum acceptable variance. Default 1e-10.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Data with low-variance columns removed.

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_snv(x)[source]

Apply Standard Normal Variate (SNV) correction to spectra.

Each spectrum (row) is mean-centered and scaled by its own standard deviation. Removes multiplicative scatter effects.

Parameters:

X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths). If a DataFrame, the first column must contain sample IDs.

Returns:

SNV-corrected spectra (same type as input).

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_savgol(ws, od, op, Dm)[source]

Apply Savitzky-Golay smoothing and/or differentiation to spectra.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths). If a DataFrame, the first column must contain sample IDs.

  • window (int) – Window length (must be odd and greater than poly).

  • poly (int) – Polynomial order for the filter.

  • deriv (int) – Derivative order. 0 = smoothing only, 1 = first derivative, 2 = second derivative.

Returns:

Filtered spectra (same type as input).

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_mean_center(Dm)[source]

Mean-center each wavelength across the sample set.

Parameters:

X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths).

Returns:

Mean-centered spectra.

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_autoscale(Dm)[source]

Autoscale spectra (mean-center and scale each wavelength to unit variance).

Parameters:

X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths).

Returns:

Autoscaled spectra.

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_baseline_correction(Dm)[source]

Apply piecewise linear baseline correction to spectra.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths). If a DataFrame, the first column must contain sample IDs.

  • anchor_points (list of int) – Column indices to use as baseline anchor points for the piecewise linear interpolation.

Returns:

Baseline-corrected spectra.

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_msc(Dm, reference_spectra=None)[source]

Apply Multiplicative Scatter Correction (MSC) to spectra.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths). If a DataFrame, the first column must contain sample IDs.

  • reference (np.ndarray) – Reference spectrum to correct against. Defaults to the mean spectrum of X.

Returns:

MSC-corrected spectra (same type as input).

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.bootstrap_pls(X, Y, num_latents, num_samples, **kwargs)[source]

Estimate PLS loading uncertainty via bootstrap resampling.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Training X data.

  • Y (pd.DataFrame or np.ndarray) – Training Y data.

  • A (int) – Number of latent variables.

  • n_boots (int) – Number of bootstrap iterations.

  • mcs (tuple) – Preprocessing flags. Default ('autoscale', 'autoscale').

  • shush (bool) – Suppress per-iteration output. Default True.

Returns:

Bootstrap results with keys:

  • W_boot (ndarray): Bootstrap distribution of W (n_boots × n_x × A).

  • Q_boot (ndarray): Bootstrap distribution of Q (n_boots × n_y × A).

  • W_mean, W_std: Mean and std of bootstrap W.

  • Q_mean, Q_std: Mean and std of bootstrap Q.

Return type:

dict

pyphi.calc.bootstrap_pls_pred(X_new, bootstrap_pls_obj, quantiles=[0.025, 0.975])[source]

Predict Y with uncertainty estimates using a bootstrap PLS ensemble.

Parameters:
  • Xnew (pd.DataFrame or np.ndarray) – New X observations to predict.

  • boot_obj (dict) – Bootstrap model from bootstrap_pls().

  • alpha (float) – Confidence level for prediction intervals. Default 0.95.

Returns:

Prediction results with keys:

  • Yhat (ndarray): Mean predicted Y (n_new × n_y).

  • Yhat_lb (ndarray): Lower bound of prediction interval.

  • Yhat_ub (ndarray): Upper bound of prediction interval.

  • Yhat_std (ndarray): Std dev of bootstrap predictions.

Return type:

dict

pyphi.calc.np2D2pyomo(arr, *, varids=False)[source]

Convert a 2D NumPy array to a Pyomo-compatible dictionary.

Parameters:

data (np.ndarray) – 2D array to convert.

Returns:

Dictionary keyed by (i, j) integer index tuples.

Return type:

dict

pyphi.calc.np1D2pyomo(arr, *, indexes=False)[source]

Convert a 1D NumPy array to a Pyomo-compatible dictionary.

Parameters:

data (np.ndarray) – 1D array to convert.

Returns:

Dictionary keyed by integer index.

Return type:

dict

pyphi.calc.adapt_pls_4_pyomo(plsobj, *, use_var_ids=False)[source]

Convert PLS model arrays to Pyomo-compatible dictionaries.

Transforms P, Q, W, Ws, mx, sx, my, sy into the indexed dict format required by Pyomo Param objects.

Parameters:

plsobj (dict) – Fitted PLS model from pls().

Returns:

Model parameters as Pyomo-indexed dictionaries.

Return type:

dict

pyphi.calc.prep_pca_4_MDbyNLP(pcaobj, X)[source]

Prepare a PCA model for missing-data imputation by NLP.

Extracts and formats the loadings and preprocessing parameters needed to set up a Pyomo optimization problem for MD imputation.

Parameters:

pcaobj (dict) – Fitted PCA model from pca().

Returns:

Parameters formatted for use in a Pyomo MD-by-NLP formulation.

Return type:

dict

pyphi.calc.prep_pls_4_MDbyNLP(plsobj, X, Y)[source]

Prepare a PLS model for missing-data imputation by NLP.

Parameters:

plsobj (dict) – Fitted PLS model from pls().

Returns:

Parameters formatted for use in a Pyomo MD-by-NLP formulation.

Return type:

dict

pyphi.calc.conv_pls_2_eiot(plsobj, *, r_length=False)[source]

Convert a PLS model for use in EIOT (Extended Iterative Optimization Technology).

Parameters:
  • plsobj (dict) – Fitted PLS model from pls().

  • r2y_threshold (float) – Minimum cumulative R²Y to determine the number of LVs to retain. Default 0.95.

Returns:

EIOT-compatible model parameters.

Return type:

dict

pyphi.calc.cat_2_matrix(X)[source]

Convert a categorical variable column to a binary indicator matrix.

Parameters:
  • x (pd.DataFrame) – Data frame with columns of categorical data First column is the variable ID.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Binary indicator matrix with one column per unique category (same type as input), all categories concatenated

xmb (pd.DataFrame): Binary indicator matrix with one column per unique category (same type as input) categories organized by block for multi-block models (if DataFrame has multiple columns)

Return type:

x_binary (pd.DataFrame)

pyphi.calc.mbpls(XMB, YMB, A, *, mcsX=True, mcsY=True, md_algorithm_='nipals', force_nipals_=False, shush_=False, cross_val_=0, cross_val_X_=False, cca=False)[source]

Fit a Multi-Block PLS (MBPLS) model.

Parameters:
  • Xmb (dict) – Dictionary of X blocks {'block_name': pd.DataFrame}. Each DataFrame’s first column must contain observation IDs.

  • Y (pd.DataFrame or np.ndarray) – Response matrix. First column is observation IDs if a DataFrame.

  • A (int) – Number of latent variables.

  • mcs (tuple) – Preprocessing flags (mcs_X, mcs_Y). Default ('autoscale', 'autoscale').

  • shush (bool) – Suppress printed output. Default False.

  • cross_val (int) – Cross-validation level (same as pls()).

  • cross_val_X (bool) – Cross-validate X-space. Default False.

Returns:

Fitted MBPLS model, extending the standard PLS model dict with per-block keys:

  • T (ndarray): Super-scores.

  • Tb (dict): Per-block scores keyed by block name.

  • Pb (dict): Per-block loadings.

  • Wb (dict): Per-block weights.

  • r2xb (dict): Per-block R² contributions.

  • block_importance (ndarray): Variance importance per block.

Plus all standard PLS keys (Q, r2y, speX, etc.).

Return type:

dict

pyphi.calc.replicate_data(mvm_obj, X, num_replicates, *, as_set=False, rep_Y=False, Y=False)[source]

Augment a dataset by adding small noise replicates.

Useful for regularizing models when training data is limited.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Original data matrix.

  • n_reps (int) – Number of noisy replicates to add. Default 2.

  • noise_level (float) – Standard deviation of additive Gaussian noise relative to each variable’s std dev. Default 0.01.

Returns:

Augmented matrix with original + replicated rows (same type as input).

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.export_2_gproms(mvmobj, *, fname='phi_export.txt')[source]

Export PLS model to gPROMS syntax.

pyphi.calc.unique(df, colid)[source]

Return unique values preserving original order.

Parameters:

x (list or np.ndarray) – Input sequence.

Returns:

Unique values in the order they first appear.

Return type:

list

pyphi.calc.parse_materials(filename, sheetname)[source]

Build R matrices for JRPLS from linear table in Excel.

pyphi.calc.isin_ordered_col0(df, alist)[source]
pyphi.calc.reconcile_rows(df_list)[source]

Align two DataFrames by their observation IDs (first column).

Reorders Y to match the row order of X. Observations present in one but not the other are dropped, with a warning printed.

Parameters:
  • X (pd.DataFrame) – Reference DataFrame. First column is observation IDs.

  • Y (pd.DataFrame) – DataFrame to align. First column is observation IDs.

Returns:

(X_aligned, Y_aligned) — DataFrames sharing the same ordered set of observation IDs.

Return type:

tuple

pyphi.calc.reconcile_rows_to_columns(df_list_r, df_list_c)[source]

Map DataFrame rows to the columns of another DataFrame.

Used in L-shaped data structures where material lot IDs appear as column headers in X and as row IDs in R.

Parameters:
  • X (pd.DataFrame) – Process data where columns (after the first) correspond to lot IDs.

  • R (pd.DataFrame) – Material property data where the first column contains lot IDs.

Returns:

(X_matched, R_matched) — aligned matrices ready for LPLS.

Return type:

tuple

pyphi.calc.lpls(X, R, Y, A, *, shush=False)[source]

Fit an L-shaped PLS (LPLS) model.

Models the relationship between lot physical properties (R), process observations (X), and product quality (Y), where X rows correspond to lots described by R columns.

Per Muteki et al., Chemom. Intell. Lab. Syst. 85 (2007) 186–194.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Process data matrix (n_obs × n_x). First column is observation IDs if a DataFrame.

  • R (pd.DataFrame or np.ndarray) – Raw material property matrix (n_lots × n_r). Columns of X map to rows of R.

  • Y (pd.DataFrame or np.ndarray) – Quality/response matrix (n_lots × n_y). Rows match rows of R.

  • A (int) – Number of latent variables.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Fitted LPLS model with keys:

  • T (ndarray): X-space scores (n_obs × A).

  • P (ndarray): X-loadings (n_x × A).

  • Q (ndarray): Y-loadings (n_y × A).

  • H (ndarray): R-space scores (n_lots × A).

  • V (ndarray): R-space loadings (n_r × A).

  • Rscores (ndarray): R projected scores.

  • Ss (ndarray): Rotated R weights S*(V’S)⁻¹.

  • r2x, r2xpv: R² for X space.

  • r2y, r2ypv: R² for Y space.

  • r2r, r2rpv: R² for R space.

  • mx, sx, my, sy, mr, sr: Preprocessing params.

  • var_t: Score covariance matrix.

  • T2, T2_lim95, T2_lim99: Hotelling’s T² and limits.

  • speX, speX_lim95, speX_lim99: X SPE and limits.

  • speY, speY_lim95, speY_lim99: Y SPE and limits.

  • speR, speR_lim95, speR_lim99: R SPE and limits.

Return type:

dict

pyphi.calc.lpls_pred(rnew, lpls_obj)[source]

Predict Y for new lot(s) using a fitted LPLS model.

Parameters:
  • rnew (np.ndarray or pd.DataFrame) – R-space observation(s) for new lot(s). Variables must match those in lpls_obj.

  • lpls_obj (dict) – Fitted LPLS model from lpls().

Returns:

Prediction results with keys:

  • Tnew (ndarray): Projected scores (n_new × A).

  • Yhat (ndarray): Predicted Y in original scale.

  • speR (ndarray): R-space SPE for each new lot.

Return type:

dict

pyphi.calc.jrpls(Xi, Ri, Y, A, *, shush=False)[source]

Fit a Joint R-LPLS (JRPLS) model across multiple campaigns.

Extends LPLS to handle multiple manufacturing campaigns, each with their own X (process) and R (raw material) blocks sharing a common Y.

Per Garcia-Munoz, Chemom. Intell. Lab. Syst. 133 (2014) 49–62.

Parameters:
  • Xi (dict) – Process data blocks {'campaign': pd.DataFrame}. Each DataFrame’s first column is observation IDs.

  • Ri (dict) – Raw material property blocks {'campaign': pd.DataFrame}. Keys must match Xi. First column is lot IDs.

  • Y (pd.DataFrame or np.ndarray) – Shared response matrix. Rows match lots across all campaigns.

  • A (int) – Number of latent variables.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Fitted JRPLS model with per-campaign sub-dicts and shared keys.

Structure mirrors lpls() output but indexed by campaign.

Return type:

dict

pyphi.calc.jrpls_pred(rnew, jrplsobj)[source]

Predict Y for a new observation using a fitted JRPLS model.

Args:
xnew (pd.DataFrame or np.ndarray): New process observation(s).

Variables must match the specified campaign’s X block.

rnew (pd.DataFrame or np.ndarray): New raw material lot properties.

Variables must match the specified campaign’s R block.

campaign (str): Name of the campaign this observation belongs to. jrpls_obj (dict): Fitted JRPLS model from jrpls().

Returns:

dict: Prediction results with keys:

  • Tnew (ndarray): Projected X-scores.

  • Yhat (ndarray): Predicted Y in original scale.

  • speX (ndarray): X-space SPE.

  • speR (ndarray): R-space SPE.

  • T2 (ndarray): Hotelling’s T².

Example

rnew={

‘MAT1’: [(‘A0129’,0.557949425 ),(‘A0130’,0.442050575 )], ‘MAT2’: [(‘Lac0003’,1)], ‘MAT3’: [(‘TLC018’, 1) ], ‘MAT4’: [(‘M0012’, 1) ], ‘MAT5’:[(‘CS0017’, 1) ] }

pyphi.calc.tpls(Xi, Ri, Z, Y, A, *, shush=False)[source]

Fit a TPLS model.

Models relationships between time-varying process trajectories (Z), raw material properties (R), and product quality (Y).

Parameters:
  • Z (pd.DataFrame or np.ndarray) – Process trajectory matrix. First column is observation IDs if a DataFrame.

  • Xi (dict) – Process data blocks {'campaign': pd.DataFrame}. Each DataFrame’s first column is observation IDs.

  • Ri (dict) – Raw material property blocks {'campaign': pd.DataFrame}. Keys must match Xi. First column is lot IDs.

  • Y (pd.DataFrame or np.ndarray) – Shared response matrix. Rows match lots across all campaigns.

  • A (int) – Number of latent variables.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Fitted TPLS model. Keys mirror jrpls() with an additional

Ws (ndarray) rotated weight matrix for Z-space.

Return type:

dict

pyphi.calc.jypls(Xi, Yi, A, *, shush=False)[source]

Fit a Joint-Y PLS (JYPLS) model across multiple campaigns.

Each campaign has its own X block (different variables allowed), but all campaigns share a common Y column space and a jointly estimated Q matrix.

Per Garcia-Munoz, MacGregor, Kourti, Chemom. Intell. Lab. Syst. 79 (2005) 101–114.

Parameters:
  • Xi (dict) – Predictor blocks {'campaign_name': pd.DataFrame}. Each X can have a different number of columns. First column of each DataFrame is observation IDs.

  • Yi (dict) – Response blocks {'campaign_name': pd.DataFrame}. Keys must match Xi. All Y blocks must have identical columns (same Y variable space across campaigns). First column of each DataFrame is observation IDs.

  • A (int) – Number of latent variables.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Fitted JYPLS model with keys:

  • Q (ndarray): Shared Y-loadings (n_y × A).

  • T (dict): Per-campaign X-scores.

  • P (dict): Per-campaign X-loadings.

  • W (dict): Per-campaign X-weights.

  • Ws (dict): Per-campaign rotated weights W*(P’W)⁻¹.

  • r2xi (dict): Per-campaign R² for X.

  • r2yi (dict): Per-campaign R² for Y.

  • r2y (float): Overall R² for Y.

  • mx, sx (dict): Per-campaign X preprocessing params.

  • my, sy (ndarray): Shared Y preprocessing params.

  • blk_scale (dict): Per-campaign block scaling factors.

  • var_t (ndarray): Pooled score covariance matrix.

  • campaigns (list): Ordered list of campaign names.

Return type:

dict

pyphi.calc.jypls_pred(xnew, campaign, jypls_obj)[source]

Predict Y for a new observation using a fitted JYPLS model.

Parameters:
  • xnew (pd.DataFrame or np.ndarray) – New X observation(s). Variables must match those of the specified campaign.

  • campaign (str) – Campaign name this observation belongs to. Must match a key used when building the model with jypls().

  • jypls_obj (dict) – Fitted JYPLS model from jypls().

Returns:

Prediction results with keys:

  • Tnew (ndarray): Projected X-scores (n_new × A).

  • Yhat (ndarray): Predicted Y in original scale (n_new × n_y).

  • speX (ndarray): X-space SPE for each new observation.

  • T2 (ndarray): Hotelling’s T² using pooled score covariance.

Return type:

dict

pyphi.calc.tpls_pred(rnew, znew, tplsobj)[source]

Predict Y for new observations using a fitted TPLS model.

Args:

rnew (np.ndarray or pd.DataFrame): New R-space (raw material) data. znew (np.ndarray or pd.DataFrame): New Z-space (trajectory) data. tpls_obj (dict): Fitted TPLS model from tpls().

Returns:

dict: Prediction results with keys:

  • Tnew (ndarray): Projected scores.

  • Yhat (ndarray): Predicted Y in original scale.

  • speR (ndarray): R-space SPE.

  • speZ (ndarray): Z-space SPE.

  • T2 (ndarray): Hotelling’s T².

Example for rnew:

rnew={

‘MAT1’: [(‘A0129’,0.557949425 ),(‘A0130’,0.442050575 )], ‘MAT2’: [(‘Lac0003’,1)], ‘MAT3’: [(‘TLC018’, 1) ], ‘MAT4’: [(‘M0012’, 1) ], ‘MAT5’:[(‘CS0017’, 1) ] }

pyphi.calc.varimax_(X, gamma=1.0, q=20, tol=1e-06)[source]
pyphi.calc.varimax_rotation(mvm_obj, X, *, Y=False)[source]

Apply Varimax rotation to PCA or PLS loadings.

Rotates loadings toward a simple structure (sparse, interpretable). Updates the model object in-place and returns the rotated model.

Parameters:
  • mvm_obj (dict) – Fitted PCA or PLS model.

  • X (pd.DataFrame or np.ndarray) – Training X data used to reproject scores after rotation.

  • Y (pd.DataFrame or np.ndarray) – Training Y data (optional, for PLS).

Returns:

Model with rotated loadings and reprojected scores.

Return type:

dict

pyphi.calc.findstr(string)[source]

Find indices of strings containing a given pattern.

Parameters:
  • str_list (list of str) – List of strings to search.

  • pattern (str) – Substring to search for.

Returns:

Indices of elements in str_list that contain pattern.

Return type:

list

pyphi.calc.evalvar(data, vname)[source]
pyphi.calc.writeeq(beta_, features_)[source]
pyphi.calc.build_polynomial(data, factors, response, *, bias_term=True)[source]

Linear regression with variable selection assisted by PLS.

pyphi.calc.cca(X, Y, tol=1e-06, max_iter=1000)[source]

Canonical Correlation Analysis (CCA) between PLS scores and Y.

Computes the maximum covariance directions between the score matrix T and response Y. Equivalent to computing the predictive component in OPLS.

Parameters:
  • T (np.ndarray) – Score matrix from a fitted PLS model (n_obs × A).

  • Y (pd.DataFrame or np.ndarray) – Response matrix (n_obs × n_y).

  • mcs (tuple) – Preprocessing flags for T and Y. Default ('autoscale', 'autoscale').

Returns:

CCA results with keys:

  • Tcv (ndarray): Covariant scores.

  • Pcv (ndarray): Covariant loadings (predictive loadings in OPLS sense).

  • Wcv (ndarray): Covariant weights.

Return type:

dict

pyphi.calc.cca_multi(X, Y, num_components=1, tol=1e-06, max_iter=1000)[source]

CCA with multiple canonical variates.