pyphi.calc module

Phi for Python (pyPhi) — Version 2.0

By Sal Garcia (sgarciam@ic.ac.uk salvadorgarciamunoz@gmail.com)

Added March 13

  • Added routine to fix duplicaed obsID

Added Feb 23 2026

  • Added _validate_inputs function for input validation and observation reconciliation

  • Integrated validation into pca, pls, lpls entry points

  • Replaced np.tile with numpy broadcasting throughout

  • Optimized _Ab_btbinv with fast path for complete data

  • var_t (score covariance matrix) stored in model objects to avoid recalculation

  • Added _extract_array and _calc_r2 helper functions to reduce duplication

  • Replaced hardcoded F-distribution and chi2 lookup tables with scipy.stats

  • Replaced hardcoded t-distribution with scipy.stats

Added Feb 07 2026

  • fixed cat_2_matrix for the output to be consistent with MBPLS

Added Jan 30 2025

  • Added a pinv alternative protection in spectra_savgol for the case where inv fails

Added Jan 20 2025

  • Added the ‘cca’ flag to the pls routine to calculate CCA between the Ts and each of the Ys (one by one), calculating loadings and scores equivalent to a perfectly orthogonalized OPLS model. The covariant scores (Tcv) the covariant Loadings (Pcv) and predictive weights (Wcv) are added as keys to the model object. [The covariant loadings(Pcv) are equivalent to the predictive loadings in OPLS]

  • Added cca and cca-multi routines to perform PLS-CCA (alternative to OPLS) as of now cca-multi remains unused.

Added Nov 18th, 2024

  • replaced interp2d with RectBivariateSpline

  • Protected SPE lim calculations for near zero residuals

  • Added build_polynomial function to create linear regression models with variable selection assited by PLS

by merge from James

  • Added spectra preprocessing methods

  • bootstrap PLS

by Salvador Garcia (sgarciam@ic.ac.uk salvadorgarciamunoz@gmail.com) Added Dec 19th 2023

  • phi.clean_htmls removes all html files in the working directory

  • clean_empty_rows returns also the names of the rows removed

Added May 1st

  • YMB is now added in the same structure as the XMB

  • Corrected the dimensionality of the lwpls prediction, it was a double-nested array.

Added Apr 30

  • Modified Multi-block PLS to include the block name in the variable name

Added Apr 29

  • Included the unique routine and adjusted the parse_materials routine so materials and lots are in the same order as in the raw data

Added Apr 27

  • Enhanced adapt_pls_4_pyomo to use variable names as indices if flag is sent

Added Apr 25

  • Enhanced the varimax_rotation to adjust the r2 and r2pv to the rotated loadings

Added Apr 21

  • Re added varimax_rotation with complete model rotation for PCA and PLS

Added Apr 17

  • Added tpls and tpls_pred

Added Apr 15

  • Added jrpls model and jrpls_pred

  • Added routines to reconcile columns to rows identifier so that X and R materices correspond correctly

  • Added routines to reconcile rows across a list of dataframes and produces a list of dataframes containing only those observations present in all dataframes

Added on Apr 9 2023

  • Added lpls and lpls_pred routines

  • Added parse_materials to read linear table and produce R or Ri

Release as of Nov 23 2022

  • Added a function to export PLS model to gPROMS code

Release as of Aug 22 2022

  • Fixed access to NEOS server and use of GAMS instead of IPOPT

Release as of Aug 12 2022

  • Fixed the SPE calculations in pls_pred and pca_pred

  • Changed to a more efficient inversion in pca_pred (=pls_pred)

  • Added a pseudo-inverse option in pmp for pca_pred

Release as of Aug 2 2022

  • Added replicate_data

pyphi.calc.ma57_dummy_check()[source]
pyphi.calc.add_auto_obs_id(df: DataFrame) DataFrame[source]

Add an Auto Obs ID column that makes duplicate first-column values unique.

Scans the first column of df for duplicate values. Unique values are copied as-is; duplicated values receive a zero-padded numeric suffix (e.g. A.1, A.01, A.001) whose width is determined by the total occurrence count for that specific ID:

  • fewer than 10 occurrences → no padding (A.1A.9)

  • 10–99 occurrences → one leading zero (A.01A.99)

  • 100–999 occurrences → two leading zeros (A.001A.999)

The new column is inserted at position 0; the original identifier column is preserved unchanged.

Parameters:

df (pd.DataFrame) – Input dataframe whose first column contains observation identifiers. Must have at least one column.

Returns:

A copy of df with Auto Obs ID prepended as the

first column.

df_classid (pd.DataFrame): A dataframe with Auto Obs ID as

first column and the original column ID as second column.

Return type:

df (pd.DataFrame)

Raises:

ValueError – If df is empty (no columns).

Example

>>> data = {"Batch ID": ["A", "A", "B"], "Temp": [10, 11, 22]}
>>> df = pd.DataFrame(data)
>>> add_auto_obs_id(df)["Auto Obs ID"].tolist()
['A.1', 'A.2', 'B']
pyphi.calc.f99(i, j)[source]

Return the F-distribution critical value at 99% confidence.

Parameters:
  • df1 (int) – Numerator degrees of freedom.

  • df2 (int) – Denominator degrees of freedom.

Returns:

F critical value at alpha = 0.01.

Return type:

float

pyphi.calc.f95(i, j)[source]

Return the F-distribution critical value at 95% confidence.

Parameters:
  • df1 (int) – Numerator degrees of freedom.

  • df2 (int) – Denominator degrees of freedom.

Returns:

F critical value at alpha = 0.05.

Return type:

float

pyphi.calc.spe_ci(spe)[source]

Estimate SPE control limits from training data using a chi-squared approximation.

Parameters:
  • spe_values (ndarray) – SPE values from the training set (n_obs × 1).

  • alpha (float) – Confidence level. Default 0.95 (also returns 99%).

Returns:

(lim95, lim99) — SPE control limits at 95% and 99%.

Return type:

tuple

pyphi.calc.single_score_conf_int(t)[source]

Calculate confidence ellipse radii for score scatter plots.

Parameters:
  • mvmobj (dict) – Fitted PCA or PLS model.

  • alpha (float) – Confidence level. Default 0.95.

Returns:

Ellipse radii for each pair of scores.

Return type:

ndarray

pyphi.calc.scores_conf_int_calc(st, N)[source]

Calculate per-score univariate confidence intervals.

Parameters:
  • mvmobj (dict) – Fitted PCA or PLS model.

  • alpha (float) – Confidence level. Default 0.95.

Returns:

Confidence interval half-widths for each latent variable (A,).

Return type:

ndarray

pyphi.calc.clean_htmls()[source]

Deletes all .html files in the current directory.

pyphi.calc.z2n(X, X_nan_map)[source]

Replace zeros with NaN (zero to NaN).

Parameters:

X (np.ndarray) – Input array.

Returns:

Array with zeros replaced by np.nan.

Return type:

np.ndarray

pyphi.calc.n2z(X)[source]

Replace NaN with zero (NaN to zero).

Parameters:

X (np.ndarray) – Input array.

Returns:

(X_filled, nan_map) — array with NaNs replaced by 0, and a boolean mask where True indicates original non-NaN positions.

Return type:

tuple

pyphi.calc.mean(X)[source]
pyphi.calc.std(X)[source]
pyphi.calc.meancenterscale(X, *, mcs=True)[source]

Mean-center and/or scale a data matrix.

Parameters:
  • X (np.ndarray) – Data matrix to preprocess (n_obs × n_vars).

  • mcs (str or bool) – Preprocessing mode. 'autoscale': mean-center and scale to unit variance. 'center': mean-center only. False: return unchanged.

Returns:

(X_processed, x_mean, x_std) — preprocessed matrix, column means, and column standard deviations.

Return type:

tuple

pyphi.calc.find(a, func)[source]

Find row indices where the first column equals a given value.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Data matrix to search.

  • value – Value to search for in the first column.

Returns:

Row indices where the match was found.

Return type:

list

pyphi.calc.pca(X, A, *, mcs=True, md_algorithm='nipals', force_nipals=True, shush=False, cross_val=0)[source]

Fit a Principal Component Analysis (PCA) model.

Supports missing data via NIPALS. Can use SVD for complete data as well.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Observations × variables matrix. If a DataFrame, the first column must contain observation IDs.

  • A (int) – Number of principal components to extract.

  • mcs (str or bool) – Mean-centering/scaling flag. 'autoscale' (default): mean-center and scale to unit variance. 'center': mean-center only. False: no preprocessing.

  • md_algorithm (str) – Missing-data algorithm. 'nipals' (default) or 'nlp'.

  • force_nipals (bool) – If True, forces NIPALS even when data is complete. Default False.

  • cross_val (int) – Cross-validation percentage of elements to remove per round. 0 = no CV, 100 = leave-one-out, 1–99 = element-wise removal. Default 0.

  • shush (bool) – If True, suppresses printed output. Default False.

  • tolerance (float) – NIPALS convergence tolerance. Default 1e-10.

  • maxit (int) – Maximum NIPALS iterations per component. Default 5000.

Returns:

Fitted PCA model with keys:

  • T (ndarray): Scores matrix (n_obs × A).

  • P (ndarray): Loadings matrix (n_vars × A).

  • r2x (float): Cumulative R² for X.

  • r2xpv (ndarray): Per-variable R² (n_vars × A).

  • mx (ndarray): Variable means used for preprocessing.

  • sx (ndarray): Variable std devs used for preprocessing.

  • var_t (ndarray): Score covariance matrix (A × A).

  • T2 (ndarray): Hotelling’s T² for training observations.

  • T2_lim95 (float): 95% T² control limit.

  • T2_lim99 (float): 99% T² control limit.

  • speX (ndarray): X-space SPE for training observations.

  • speX_lim95 (float): 95% SPE control limit.

  • speX_lim99 (float): 99% SPE control limit.

  • obsidX (list): Observation IDs (only if X was a DataFrame).

  • varidX (list): Variable IDs (only if X was a DataFrame).

  • q2x (float): Cross-validated Q² (only if cross_val > 0).

Return type:

dict

NLP algorithn for missing data as in:

de la Fuente, R.L.N., García‐Muñoz, S. and Biegler, L.T., 2010. An efficient nonlinear programming strategy for PCA models with incomplete data sets. Journal of Chemometrics, 24(6), pp.301-311.

pyphi.calc.pca_(X, A, *, mcs=True, md_algorithm='nipals', force_nipals=True, shush=False)[source]
pyphi.calc.pls_cca(pls_obj, Xmcs, Ymcs, not_Xmiss)[source]
pyphi.calc.pls(X, Y, A, *, mcsX=True, mcsY=True, md_algorithm='nipals', force_nipals=True, shush=False, cross_val=0, cross_val_X=False, cca=False)[source]

Fit a Partial Least Squares (PLS) regression model.

Supports missing data in both X and Y via NIPALS. Optionally computes CCA-based covariant components (equivalent to OPLS predictive space).

Parameters:
  • X (pd.DataFrame or np.ndarray) – Predictor matrix (n_obs × n_x). If a DataFrame, the first column must contain observation IDs.

  • Y (pd.DataFrame or np.ndarray) – Response matrix (n_obs × n_y). If a DataFrame, the first column must contain observation IDs. Observation IDs are reconciled with X automatically.

  • A (int) – Number of latent variables.

  • mcsX – Preprocessing flags Each can be 'autoscale', 'center', or False. Default 'autoscale'.

  • mcsY – Preprocessing flags Each can be 'autoscale', 'center', or False. Default 'autoscale'.

  • md_algorithm (str) – Missing-data algorithm. 'nipals' or 'nlp' 'nipals' is (default).

  • force_nipals (bool) – Force NIPALS even for complete data. Default False.

  • cross_val (int) – Cross-validation level. 0 = none, 100 = LOO, 1–99 = element-wise. Default 0.

  • cross_val_X (bool) – Also cross-validate X-space. Default False.

  • shush (bool) – Suppress printed output. Default False.

  • tolerance (float) – NIPALS convergence tolerance. Default 1e-10.

  • maxit (int) – Max NIPALS iterations per component. Default 5000.

  • cca (bool) – If True, compute CCA-based covariant components and add Tcv, Pcv, Wcv to the model. Default False.

Returns:

Fitted PLS model with keys:

  • T (ndarray): X-scores (n_obs × A).

  • P (ndarray): X-loadings (n_x × A).

  • Q (ndarray): Y-loadings (n_y × A).

  • W (ndarray): X-weights (n_x × A).

  • Ws (ndarray): Rotated weights W*(P’W)⁻¹ (n_x × A).

  • r2x (float): Cumulative R² for X.

  • r2xpv (ndarray): Per-variable R² for X (n_x × A).

  • r2y (float): Cumulative R² for Y.

  • r2ypv (ndarray): Per-variable R² for Y (n_y × A).

  • mx, sx (ndarray): X preprocessing parameters.

  • my, sy (ndarray): Y preprocessing parameters.

  • var_t (ndarray): Score covariance matrix (A × A).

  • T2, T2_lim95, T2_lim99: Hotelling’s T² and limits.

  • speX, speX_lim95, speX_lim99: X-space SPE and limits.

  • speY, speY_lim95, speY_lim99: Y-space SPE and limits.

  • obsidX, varidX: IDs (only if X was a DataFrame).

  • obsidY, varidY: IDs (only if Y was a DataFrame).

  • q2x, q2y (float): Cross-validated Q² (if cross_val > 0).

  • Tcv, Pcv, Wcv: CCA covariant components (if cca=True).

Return type:

dict

NLP approach to missing data as in:

Puwakkatiya‐Kankanamage, E.H., García‐Muñoz, S. and Biegler, L.T., 2014. An optimization‐based undeflated PLS (OUPLS) method to handle missing data in the training set. Journal of Chemometrics, 28(7), pp.575-584.

pyphi.calc.pls_(X, Y, A, *, mcsX=True, mcsY=True, md_algorithm='nipals', force_nipals=True, shush=False, cca=False)[source]
pyphi.calc.hott2(mvmobj, *, Xnew=False, Tnew=False)[source]

Compute Hotelling’s T² statistic.

Parameters:
  • mvmobj (dict) – Fitted PCA or PLS model.

  • Xnew (pd.DataFrame or np.ndarray) – New X observations (optional). If provided, scores are computed internally before T² calculation.

  • Tnew (np.ndarray) – Pre-computed scores (optional). Used directly if provided; avoids redundant projection.

Returns:

T² value for each observation (n_obs,).

Return type:

ndarray

Note

If neither Xnew nor Tnew is provided, returns T² for the training set stored in mvmobj.

pyphi.calc.pca_pred(Xnew, pcaobj, *, algorithm='p2mp')[source]

Project new observations onto a fitted PCA model.

Parameters:
  • Xnew (pd.DataFrame or np.ndarray) – New observations to project. Variables must match those used to train pcaobj.

  • pcaobj (dict) – Fitted PCA model from pca().

  • algorithm (str) – Projection algorithm. 'p2mp' (default) handles missing data; 'standard' uses direct matrix multiplication and requires complete data.

Returns:

Prediction results with keys:

  • Tnew (ndarray): Projected scores (n_new × A).

  • Xhat (ndarray): Reconstructed X in original scale.

  • speX (ndarray): SPE for each new observation.

  • T2 (ndarray): Hotelling’s T² for each new observation.

Return type:

dict

pyphi.calc.pls_pred(Xnew, plsobj)[source]

Predict Y for new observations using a fitted PLS model.

Parameters:
  • Xnew (pd.DataFrame or np.ndarray) – New predictor observations. Variables must match those used to train plsobj.

  • plsobj (dict) – Fitted PLS model from pls().

  • algorithm (str) – Projection algorithm. 'p2mp' (default) handles missing data; 'standard' requires complete data.

Returns:

Prediction results with keys:

  • Tnew (ndarray): X-scores for new observations (n_new × A).

  • Yhat (ndarray): Predicted Y in original scale (n_new × n_y).

  • Xhat (ndarray): Reconstructed X in original scale.

  • speX (ndarray): X-space SPE for each new observation.

  • T2 (ndarray): Hotelling’s T² for each new observation.

  • Tcv (ndarray): CCA covariant scores (only if model has Wcv).

Return type:

dict

pyphi.calc.spe(mvmobj, Xnew, *, Ynew=False)[source]

Compute Squared Prediction Error (SPE / Q statistic).

Parameters:
  • mvmobj (dict) – Fitted PCA or PLS model.

  • Xnew (pd.DataFrame or np.ndarray) – New X observations.

  • Ynew (pd.DataFrame or np.ndarray) – New Y observations (optional). Only used for PLS models to also return Y-space SPE.

Returns:

  • If Ynew is not provided (or model is PCA): returns speX (ndarray, shape n_obs × 1).

  • If Ynew is provided and model is PLS: returns (speX, speY) tuple of arrays.

Return type:

ndarray or tuple

pyphi.calc.lwpls(xnew, loc_par, mvmobj, X, Y, *, shush=False)[source]

Locally Weighted PLS (LWPLS) prediction for a single new observation.

Per Kim et al. Int. J. Pharmaceutics 421 (2011) 269–274.

Parameters:
  • xnew (np.ndarray or pd.DataFrame) – Single new observation (1 × n_x).

  • loc_par (float) – Locality parameter controlling the width of the Gaussian kernel. Larger values include more training observations.

  • mvmobj (dict) – Global PLS model from pls(), used to define the score space for distance calculation.

  • X (pd.DataFrame or np.ndarray) – Training X data.

  • Y (pd.DataFrame or np.ndarray) – Training Y data.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Prediction results with keys:

  • Yhat (ndarray): Locally predicted Y (1 × n_y).

  • weights (ndarray): Observation weights used in local model.

Return type:

dict

pyphi.calc.contributions(mvmobj, X, cont_type, *, Y=False, from_obs=False, to_obs=False, lv_space=False)[source]

Compute variable contributions to monitoring statistics.

Args:

mvmobj (dict): Fitted PCA or PLS model. Xnew (pd.DataFrame or np.ndarray): Observations to diagnose. cont_type (str): Type of contribution to compute.

'scores': contribution to each score. 'spex': contribution to X-space SPE. 'spey': contribution to Y-space SPE (PLS only). 't2': contribution to Hotelling’s T².

Ynew (pd.DataFrame or np.ndarray): Y observations (optional,

required for cont_type='spey').

Returns:

ndarray: Contribution values (n_obs × n_vars).

Ref: Miller, P., Swanson, R.E. and Heckler, C.E., 1998. Contribution plots: a missing link

in multivariate quality control. Applied mathematics and computer science, 8(4), pp.775-792.

pyphi.calc.clean_empty_rows(X, *, shush=False)[source]

Remove rows that are entirely NaN.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Input data matrix.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Data with fully empty rows removed.

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.clean_low_variances(X, *, shush=False, min_var=1e-10)[source]

Remove columns with variance below a threshold.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Input data matrix.

  • min_var (float) – Minimum acceptable variance. Default 1e-10.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Data with low-variance columns removed.

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_snv(x)[source]

Apply Standard Normal Variate (SNV) correction to spectra.

Each spectrum (row) is mean-centered and scaled by its own standard deviation. Removes multiplicative scatter effects.

Parameters:

X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths). If a DataFrame, the first column must contain sample IDs.

Returns:

SNV-corrected spectra (same type as input).

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_savgol(ws, od, op, Dm)[source]

Apply Savitzky-Golay smoothing and/or differentiation to spectra.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths). If a DataFrame, the first column must contain sample IDs.

  • window (int) – Window length (must be odd and greater than poly).

  • poly (int) – Polynomial order for the filter.

  • deriv (int) – Derivative order. 0 = smoothing only, 1 = first derivative, 2 = second derivative.

Returns:

Filtered spectra (same type as input).

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_mean_center(Dm)[source]

Mean-center each wavelength across the sample set.

Parameters:

X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths).

Returns:

Mean-centered spectra.

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_autoscale(Dm)[source]

Autoscale spectra (mean-center and scale each wavelength to unit variance).

Parameters:

X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths).

Returns:

Autoscaled spectra.

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_baseline_correction(Dm)[source]

Apply piecewise linear baseline correction to spectra.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths). If a DataFrame, the first column must contain sample IDs.

  • anchor_points (list of int) – Column indices to use as baseline anchor points for the piecewise linear interpolation.

Returns:

Baseline-corrected spectra.

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.spectra_msc(Dm, reference_spectra=None)[source]

Apply Multiplicative Scatter Correction (MSC) to spectra.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Spectra matrix (n_samples × n_wavelengths). If a DataFrame, the first column must contain sample IDs.

  • reference (np.ndarray) – Reference spectrum to correct against. Defaults to the mean spectrum of X.

Returns:

MSC-corrected spectra (same type as input).

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.bootstrap_pls(X, Y, num_latents, num_samples, **kwargs)[source]

Estimate PLS loading uncertainty via bootstrap resampling.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Training X data.

  • Y (pd.DataFrame or np.ndarray) – Training Y data.

  • A (int) – Number of latent variables.

  • n_boots (int) – Number of bootstrap iterations.

  • mcs (tuple) – Preprocessing flags. Default ('autoscale', 'autoscale').

  • shush (bool) – Suppress per-iteration output. Default True.

Returns:

Bootstrap results with keys:

  • W_boot (ndarray): Bootstrap distribution of W (n_boots × n_x × A).

  • Q_boot (ndarray): Bootstrap distribution of Q (n_boots × n_y × A).

  • W_mean, W_std: Mean and std of bootstrap W.

  • Q_mean, Q_std: Mean and std of bootstrap Q.

Return type:

dict

pyphi.calc.bootstrap_pls_pred(X_new, bootstrap_pls_obj, quantiles=[0.025, 0.975])[source]

Predict Y with uncertainty estimates using a bootstrap PLS ensemble.

Parameters:
  • Xnew (pd.DataFrame or np.ndarray) – New X observations to predict.

  • boot_obj (dict) – Bootstrap model from bootstrap_pls().

  • alpha (float) – Confidence level for prediction intervals. Default 0.95.

Returns:

Prediction results with keys:

  • Yhat (ndarray): Mean predicted Y (n_new × n_y).

  • Yhat_lb (ndarray): Lower bound of prediction interval.

  • Yhat_ub (ndarray): Upper bound of prediction interval.

  • Yhat_std (ndarray): Std dev of bootstrap predictions.

Return type:

dict

pyphi.calc.np2D2pyomo(arr, *, varids=False)[source]

Convert a 2D NumPy array to a Pyomo-compatible dictionary.

Parameters:

data (np.ndarray) – 2D array to convert.

Returns:

Dictionary keyed by (i, j) integer index tuples.

Return type:

dict

pyphi.calc.np1D2pyomo(arr, *, indexes=False)[source]

Convert a 1D NumPy array to a Pyomo-compatible dictionary.

Parameters:

data (np.ndarray) – 1D array to convert.

Returns:

Dictionary keyed by integer index.

Return type:

dict

pyphi.calc.adapt_pls_4_pyomo(plsobj, *, use_var_ids=False)[source]

Convert PLS model arrays to Pyomo-compatible dictionaries.

Transforms P, Q, W, Ws, mx, sx, my, sy into the indexed dict format required by Pyomo Param objects.

Parameters:

plsobj (dict) – Fitted PLS model from pls().

Returns:

Model parameters as Pyomo-indexed dictionaries.

Return type:

dict

pyphi.calc.prep_pca_4_MDbyNLP(pcaobj, X)[source]

Prepare a PCA model for missing-data imputation by NLP.

Extracts and formats the loadings and preprocessing parameters needed to set up a Pyomo optimization problem for MD imputation.

Parameters:

pcaobj (dict) – Fitted PCA model from pca().

Returns:

Parameters formatted for use in a Pyomo MD-by-NLP formulation.

Return type:

dict

pyphi.calc.prep_pls_4_MDbyNLP(plsobj, X, Y)[source]

Prepare a PLS model for missing-data imputation by NLP.

Parameters:

plsobj (dict) – Fitted PLS model from pls().

Returns:

Parameters formatted for use in a Pyomo MD-by-NLP formulation.

Return type:

dict

pyphi.calc.conv_pls_2_eiot(plsobj, *, r_length=False)[source]

Convert a PLS model for use in EIOT (Extended Iterative Optimization Technology).

Parameters:
  • plsobj (dict) – Fitted PLS model from pls().

  • r2y_threshold (float) – Minimum cumulative R²Y to determine the number of LVs to retain. Default 0.95.

Returns:

EIOT-compatible model parameters.

Return type:

dict

pyphi.calc.cat_2_matrix(X)[source]

Convert a categorical variable column to a binary indicator matrix.

Parameters:
  • x (pd.DataFrame) – Data frame with columns of categorical data First column is the variable ID.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Binary indicator matrix with one column per unique category (same type as input), all categories concatenated

xmb (pd.DataFrame): Binary indicator matrix with one column per unique category (same type as input) categories organized by block for multi-block models (if DataFrame has multiple columns)

Return type:

x_binary (pd.DataFrame)

pyphi.calc.mbpls(XMB, YMB, A, *, mcsX=True, mcsY=True, md_algorithm_='nipals', force_nipals_=False, shush_=False, cross_val_=0, cross_val_X_=False, cca=False)[source]

Fit a Multi-Block PLS (MBPLS) model.

Parameters:
  • Xmb (dict) – Dictionary of X blocks {'block_name': pd.DataFrame}. Each DataFrame’s first column must contain observation IDs.

  • Y (pd.DataFrame or np.ndarray) – Response matrix. First column is observation IDs if a DataFrame.

  • A (int) – Number of latent variables.

  • mcs (tuple) – Preprocessing flags (mcs_X, mcs_Y). Default ('autoscale', 'autoscale').

  • shush (bool) – Suppress printed output. Default False.

  • cross_val (int) – Cross-validation level (same as pls()).

  • cross_val_X (bool) – Cross-validate X-space. Default False.

Returns:

Fitted MBPLS model, extending the standard PLS model dict with per-block keys:

  • T (ndarray): Super-scores.

  • Tb (dict): Per-block scores keyed by block name.

  • Pb (dict): Per-block loadings.

  • Wb (dict): Per-block weights.

  • r2xb (dict): Per-block R² contributions.

  • block_importance (ndarray): Variance importance per block.

Plus all standard PLS keys (Q, r2y, speX, etc.).

Return type:

dict

pyphi.calc.replicate_data(mvm_obj, X, num_replicates, *, as_set=False, rep_Y=False, Y=False)[source]

Augment a dataset by adding small noise replicates.

Useful for regularizing models when training data is limited.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Original data matrix.

  • n_reps (int) – Number of noisy replicates to add. Default 2.

  • noise_level (float) – Standard deviation of additive Gaussian noise relative to each variable’s std dev. Default 0.01.

Returns:

Augmented matrix with original + replicated rows (same type as input).

Return type:

pd.DataFrame or np.ndarray

pyphi.calc.export_2_gproms(mvmobj, *, fname='phi_export.txt')[source]

Export PLS model to gPROMS syntax.

pyphi.calc.unique(df, colid)[source]

Return unique values preserving original order.

Parameters:

x (list or np.ndarray) – Input sequence.

Returns:

Unique values in the order they first appear.

Return type:

list

pyphi.calc.parse_materials(filename, sheetname)[source]

Build JR matrices for JRPLS from a linear materials table in Excel.

Reads a structured Excel sheet describing the composition of finished product lots in terms of their constituent material lots and quantities (or ratios). Validates that every finished product lot has a material lot assigned for each material, then constructs one JR matrix per material.

Each JR matrix is a DataFrame with one row per finished product lot and one column per material lot for that material. The cell values are the ratio or quantity of that material lot used in that finished product lot (0 if not used).

The input Excel sheet must contain the following columns (in any order):

  • Finished Product Lot: identifier for the finished product lot.

  • Material Lot: identifier for the specific lot of the raw material.

  • Ratio or Quantity: numeric contribution of this material lot to the finished product lot (e.g. mass fraction or absolute quantity).

  • Material: name or identifier for the material type.

Note

A summary of the ratio/quantity sum is printed for each finished product lot as a simple consistency check. The function prints diagnostic messages to stdout and returns (False, False) if any material lot assignment is missing.

Parameters:
  • filename (str) – Path to the Excel workbook containing the materials table.

  • sheetname (str) – Name of the worksheet within the workbook to read.

Returns:

A two-element tuple (JR, materials_used):

  • JR (list of pandas.DataFrame): One DataFrame per material (in the same order as materials_used). Each DataFrame has shape (n_fp_lots, n_material_lots + 1), where the first column FPLot holds the finished product lot identifiers and the remaining columns are named after the material lots.

  • materials_used (list of str): Ordered list of unique material names found in the sheet, corresponding to the entries in JR.

If validation fails, both elements are False.

Return type:

tuple

Raises:
  • FileNotFoundError – If filename does not point to an existing file.

  • ValueError – If sheetname is not found in the workbook.

Example

>>> JR, materials = parse_materials("batch_records.xlsx", "Blends")
Lot :Lot_A ratio/qty adds to 1.0
Lot :Lot_B ratio/qty adds to 1.0
>>> len(JR)          # one matrix per material
3
>>> JR[0].columns.tolist()
['FPLot', 'MatLot_1', 'MatLot_2']
>>> materials
['Excipient_A', 'Excipient_B', 'API']
pyphi.calc.isin_ordered_col0(df, alist)[source]
pyphi.calc.reconcile_rows(df_list)[source]

Align two DataFrames by their observation IDs (first column).

Reorders Y to match the row order of X. Observations present in one but not the other are dropped, with a warning printed.

Parameters:
  • X (pd.DataFrame) – Reference DataFrame. First column is observation IDs.

  • Y (pd.DataFrame) – DataFrame to align. First column is observation IDs.

Returns:

(X_aligned, Y_aligned) — DataFrames sharing the same ordered set of observation IDs.

Return type:

tuple

pyphi.calc.reconcile_rows_to_columns(df_list_r, df_list_c)[source]

Map DataFrame rows to the columns of another DataFrame.

Used in L-shaped data structures where material lot IDs appear as column headers in X and as row IDs in R.

Parameters:
  • X (pd.DataFrame) – Process data where columns (after the first) correspond to lot IDs.

  • R (pd.DataFrame) – Material property data where the first column contains lot IDs.

Returns:

(X_matched, R_matched) — aligned matrices ready for LPLS.

Return type:

tuple

pyphi.calc.lpls(X, R, Y, A, *, shush=False)[source]

Fit an L-shaped PLS (LPLS) model.

Models the relationship between lot physical properties (R), process observations (X), and product quality (Y), where X rows correspond to lots described by R columns.

Per Muteki et al., Chemom. Intell. Lab. Syst. 85 (2007) 186–194.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Process data matrix (n_obs × n_x). First column is observation IDs if a DataFrame.

  • R (pd.DataFrame or np.ndarray) – Raw material property matrix (n_lots × n_r). Columns of X map to rows of R.

  • Y (pd.DataFrame or np.ndarray) – Quality/response matrix (n_lots × n_y). Rows match rows of R.

  • A (int) – Number of latent variables.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Fitted LPLS model with keys:

  • T (ndarray): X-space scores (n_obs × A).

  • P (ndarray): X-loadings (n_x × A).

  • Q (ndarray): Y-loadings (n_y × A).

  • H (ndarray): R-space scores (n_lots × A).

  • V (ndarray): R-space loadings (n_r × A).

  • Rscores (ndarray): R projected scores.

  • Ss (ndarray): Rotated R weights S*(V’S)⁻¹.

  • r2x, r2xpv: R² for X space.

  • r2y, r2ypv: R² for Y space.

  • r2r, r2rpv: R² for R space.

  • mx, sx, my, sy, mr, sr: Preprocessing params.

  • var_t: Score covariance matrix.

  • T2, T2_lim95, T2_lim99: Hotelling’s T² and limits.

  • speX, speX_lim95, speX_lim99: X SPE and limits.

  • speY, speY_lim95, speY_lim99: Y SPE and limits.

  • speR, speR_lim95, speR_lim99: R SPE and limits.

Return type:

dict

pyphi.calc.lpls_pred(rnew, lpls_obj)[source]

Predict Y for new lot(s) using a fitted LPLS model.

Parameters:
  • rnew (np.ndarray or pd.DataFrame) – R-space observation(s) for new lot(s). Variables must match those in lpls_obj.

  • lpls_obj (dict) – Fitted LPLS model from lpls().

Returns:

Prediction results with keys:

  • Tnew (ndarray): Projected scores (n_new × A).

  • Yhat (ndarray): Predicted Y in original scale.

  • speR (ndarray): R-space SPE for each new lot.

Return type:

dict

pyphi.calc.jrpls(Xi, Ri, Y, A, *, shush=False)[source]

Fit a Joint R-LPLS (JRPLS) model across multiple campaigns.

Extends LPLS to handle multiple manufacturing campaigns, each with their own X (process) and R (raw material) blocks sharing a common Y.

Per Garcia-Munoz, Chemom. Intell. Lab. Syst. 133 (2014) 49–62.

Parameters:
  • Xi (dict) – Raw material property blocks {'Material Type': pd.DataFrame}. Each DataFrame’s first column is observation IDs.

  • Ri (dict) – Blending matrices {'Material Type': pd.DataFrame}. Keys must match Xi. First column is lot IDs.

  • Y (pd.DataFrame or np.ndarray) – Shared response matrix. Rows match lots across all campaigns.

  • A (int) – Number of latent variables.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Fitted JRPLS model with per-campaign sub-dicts and shared keys.

Structure mirrors lpls() output but indexed by campaign.

Return type:

dict

pyphi.calc.jrpls_pred(rnew, jrplsobj)[source]

Predict Y for a new observation using a fitted JRPLS model.

Parameters:
  • xnew (pd.DataFrame or np.ndarray) – New process observation(s). Variables must match the specified campaign’s X block.

  • rnew (pd.DataFrame or np.ndarray) – New raw material lot properties. Variables must match the specified campaign’s R block.

  • campaign (str) – Name of the campaign this observation belongs to.

  • jrpls_obj (dict) – Fitted JRPLS model from jrpls().

Returns:

Prediction results with keys:

  • Tnew (ndarray): Projected X-scores.

  • Yhat (ndarray): Predicted Y in original scale.

  • speX (ndarray): X-space SPE.

  • speR (ndarray): R-space SPE.

  • T2 (ndarray): Hotelling’s T².

Return type:

dict

Example

rnew={

‘MAT1’: [(‘A0129’,0.557949425 ),(‘A0130’,0.442050575 )], ‘MAT2’: [(‘Lac0003’,1)], ‘MAT3’: [(‘TLC018’, 1) ], ‘MAT4’: [(‘M0012’, 1) ], ‘MAT5’:[(‘CS0017’, 1) ] }

pyphi.calc.tpls(Xi, Ri, Z, Y, A, *, shush=False)[source]

Fit a TPLS model.

Models relationships between time-varying process trajectories (Z), raw material properties (R), and product quality (Y).

Parameters:
  • Z (pd.DataFrame or np.ndarray) – Data from Process matrix. First column is observation IDs if a DataFrame.

  • Xi (dict) – Material Property data blocks {'Material Type': pd.DataFrame}. Each DataFrame’s first column is observation IDs.

  • Ri (dict) – Blending information property blocks {'Material Type': pd.DataFrame}. Keys must match Xi. First column is lot IDs.

  • Y (pd.DataFrame or np.ndarray) – Shared response matrix. Rows match lots across all campaigns.

  • A (int) – Number of latent variables.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Fitted TPLS model. Keys mirror jrpls() with an additional

Ws (ndarray) rotated weight matrix for Z-space.

Return type:

dict

pyphi.calc.jypls(Xi, Yi, A, *, shush=False)[source]

Fit a Joint-Y PLS (JYPLS) model across multiple campaigns.

Each campaign has its own X block (different variables allowed), but all campaigns share a common Y column space and a jointly estimated Q matrix.

Per Garcia-Munoz, MacGregor, Kourti, Chemom. Intell. Lab. Syst. 79 (2005) 101–114.

Parameters:
  • Xi (dict) – Predictor blocks {'campaign_name': pd.DataFrame}. Each X can have a different number of columns. First column of each DataFrame is observation IDs.

  • Yi (dict) – Response blocks {'campaign_name': pd.DataFrame}. Keys must match Xi. All Y blocks must have identical columns (same Y variable space across campaigns). First column of each DataFrame is observation IDs.

  • A (int) – Number of latent variables.

  • shush (bool) – Suppress printed output. Default False.

Returns:

Fitted JYPLS model with keys:

  • Q (ndarray): Shared Y-loadings (n_y × A).

  • T (dict): Per-campaign X-scores.

  • P (dict): Per-campaign X-loadings.

  • W (dict): Per-campaign X-weights.

  • Ws (dict): Per-campaign rotated weights W*(P’W)⁻¹.

  • r2xi (dict): Per-campaign R² for X.

  • r2yi (dict): Per-campaign R² for Y.

  • r2y (float): Overall R² for Y.

  • mx, sx (dict): Per-campaign X preprocessing params.

  • my, sy (ndarray): Shared Y preprocessing params.

  • blk_scale (dict): Per-campaign block scaling factors.

  • var_t (ndarray): Pooled score covariance matrix.

  • campaigns (list): Ordered list of campaign names.

Return type:

dict

pyphi.calc.jypls_pred(xnew, campaign, jypls_obj)[source]

Predict Y for a new observation using a fitted JYPLS model.

Parameters:
  • xnew (pd.DataFrame or np.ndarray) – New X observation(s). Variables must match those of the specified campaign.

  • campaign (str) – Campaign name this observation belongs to. Must match a key used when building the model with jypls().

  • jypls_obj (dict) – Fitted JYPLS model from jypls().

Returns:

Prediction results with keys:

  • Tnew (ndarray): Projected X-scores (n_new × A).

  • Yhat (ndarray): Predicted Y in original scale (n_new × n_y).

  • speX (ndarray): X-space SPE for each new observation.

  • T2 (ndarray): Hotelling’s T² using pooled score covariance.

Return type:

dict

pyphi.calc.tpls_pred(rnew, znew, tplsobj)[source]

Predict Y for new observations using a fitted TPLS model.

Args:

rnew (np.ndarray or pd.DataFrame): New R-space (raw material) data. znew (np.ndarray or pd.DataFrame): New Z-space (trajectory) data. tpls_obj (dict): Fitted TPLS model from tpls().

Returns:

dict: Prediction results with keys:

  • Tnew (ndarray): Projected scores.

  • Yhat (ndarray): Predicted Y in original scale.

  • speR (ndarray): R-space SPE.

  • speZ (ndarray): Z-space SPE.

  • T2 (ndarray): Hotelling’s T².

Example for rnew:

rnew={

‘MAT1’: [(‘A0129’,0.557949425 ),(‘A0130’,0.442050575 )], ‘MAT2’: [(‘Lac0003’,1)], ‘MAT3’: [(‘TLC018’, 1) ], ‘MAT4’: [(‘M0012’, 1) ], ‘MAT5’:[(‘CS0017’, 1) ] }

pyphi.calc.varimax_(X, gamma=1.0, q=20, tol=1e-06)[source]
pyphi.calc.varimax_rotation(mvm_obj, X, *, Y=False)[source]

Apply Varimax rotation to PCA or PLS loadings.

Rotates loadings toward a simple structure (sparse, interpretable). Updates the model object in-place and returns the rotated model.

Parameters:
  • mvm_obj (dict) – Fitted PCA or PLS model.

  • X (pd.DataFrame or np.ndarray) – Training X data used to reproject scores after rotation.

  • Y (pd.DataFrame or np.ndarray) – Training Y data (optional, for PLS).

Returns:

Model with rotated loadings and reprojected scores.

Return type:

dict

pyphi.calc.findstr(string)[source]

Find indices of strings containing a given pattern.

Parameters:
  • str_list (list of str) – List of strings to search.

  • pattern (str) – Substring to search for.

Returns:

Indices of elements in str_list that contain pattern.

Return type:

list

pyphi.calc.evalvar(data, vname)[source]
pyphi.calc.writeeq(beta_, features_)[source]
pyphi.calc.build_polynomial(data, factors, response, *, bias_term=True)[source]

Linear regression with variable selection assisted by PLS.

pyphi.calc.cca(X, Y, tol=1e-06, max_iter=1000)[source]

Canonical Correlation Analysis (CCA) between PLS scores and Y.

Computes the maximum covariance directions between the score matrix T and response Y. Equivalent to computing the predictive component in OPLS.

Parameters:
  • T (np.ndarray) – Score matrix from a fitted PLS model (n_obs × A).

  • Y (pd.DataFrame or np.ndarray) – Response matrix (n_obs × n_y).

  • mcs (tuple) – Preprocessing flags for T and Y. Default ('autoscale', 'autoscale').

Returns:

CCA results with keys:

  • Tcv (ndarray): Covariant scores.

  • Pcv (ndarray): Covariant loadings (predictive loadings in OPLS sense).

  • Wcv (ndarray): Covariant weights.

Return type:

dict

pyphi.calc.cca_multi(X, Y, num_components=1, tol=1e-06, max_iter=1000)[source]

CCA with multiple canonical variates.