⚙️Data Preprocessing

Data preprocessing and Transformations available in Trinity-Neo

Data Preparation

Missing Values

Datasets for various reasons may have missing values or empty records, often encoded as blanks or NaN. Most machine learning algorithms can't deal with values that are missing or blank. Removing samples with missing values is a basic strategy that is sometimes used, but it comes with a cost of losing probable valuable data and the associated information or patterns. A better strategy is to impute the missing values.

PARAMETERS

  • imputation_type: string, default = 'simple' The type of imputation to use. It can be either simple or iterative. If None, no imputation of missing values is performed.

  • numeric_imputation: int, float, or string, default = ‘mean’ Imputing strategy for numerical columns. Ignored when imputation_type= iterative. Choose from:

    • drop: Drop rows containing missing values.

    • mean: Impute with mean of column.

    • median: Impute with median of column.

    • mode: Impute with most frequent value.

    • knn: Impute using a K-Nearest Neighbors approach.

    • int or float: Impute with provided numerical value.

  • categorical_imputation: string, default = ‘mode’ Imputing strategy for categorical columns. Ignored when imputation_type= iterative. Choose from:

    • drop: Drop rows containing missing values.

    • mode: Impute with most frequent value.

    • str: Impute with provided string.

  • iterative_imputation_iters: int, default = 5 The number of iterations. Ignored when imputation_type=simple.

  • numeric_iterative_imputer: str or sklearn estimator, default = 'lightgbm' Regressor for iterative imputation of missing values in numeric features. If None, it uses LGBClassifier. Ignored when imputation_type=simple.'

  • categorical_iterative_imputer: str or sklearn estimator, default = 'lightgbm' Regressor for iterative imputation of missing values in categorical features. If None, it uses LGBClassifier. Ignored when imputation_type=simple.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
hepatitis = get_data('hepatitis')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = hepatitis, target = 'Class')

Before

After

Comparison of Simple imputer vs. Iterative imputer

Data Types

Each feature in the dataset has an associated data type such as numeric, categorical, or Datetime. trinity’s inference algorithm automatically detects the data type of each feature. However, sometimes the data types inferred by trinity are incorrect. Ensuring data types are correct is important as several downstream processes depend on the data type of the features. One example could be that missing for numeric and categorical features in the dataset are imputed differently. To overwrite the inferred data types, numeric_features, categorical_features and date_features parameters can be used in the setup function. You can also use ignore_features to ignore certain features for model training.

PARAMETERS

  • numeric_features: list of string, default = None If the inferred data types are not correct, numeric_features can be used to overwrite the inferred data types.

  • categorical_features: list of string, default = None If the inferred data types are not correct, categorical_features can be used to overwrite the inferred data types.

  • date_features: list of string, default = None If the data has a Datetime column that is not automatically inferred when running the setup, date_features can be used to force the data type. It can work with multiple date columns. Datetime related features are not used in modeling. Instead, feature extraction is performed and original Datetime columns are ignored during model training. If the Datetime column includes a timestamp, features related to time will also be extracted.

  • create_date_columns: list of str, default = ["day", "month", "year"]

    Columns to create from the date features. Note that created features with zero variance (e.g. the feature hour in a column that only contains dates) are ignored. Allowed values are datetime attributes from pandas.Series.dt. The datetime format of the feature is inferred automatically from the first non NaN value.

  • text_features: list of str, default = None Column names that contain a text corpus. If None, no text features are selected.

  • text_features_method: str, default = 'tf-idf' Method with which to embed the text features in the dataset. Choose between 'bow' (Bag of Words - CountVectorizer) or 'tf-idf' (TfidfVectorizer). Be aware that the sparse matrix output of the transformer is converted internally to its full array. This can cause memory issues for large text embeddings.

  • ignore_features: list of string, default = None ignore_features can be used to ignore features during model training. It takes a list of strings with column names that are to be ignored.

  • keep_features: list of str, default = None

    keep_features parameter can be used to always keep specific features during preprocessing, i.e. these features are never dropped by any kind of feature selection. It takes a list of strings with column names that are to be kept.

Example 1 - Categorical Features

Copy

# load dataset
from trinity-neo.datasets import get_data
hepatitis = get_data('hepatitis')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = hepatitis, target = 'Class', categorical_features = ['AGE'])

Before

After

Example 2 - Ignore Features

Copy

# load dataset
from trinity-neo.datasets import get_data
pokemon = get_data('pokemon')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = pokemon, target = 'Legendary', ignore_features = ['#', 'Name'])

Before

After

One-Hot Encoding

Categorical features in the dataset contain the label values (ordinal or nominal) rather than continuous numbers. The majority of the machine learning algorithms cannot directly deal with categorical features and they must be transformed into numeric values before training a model. The most common type of categorical encoding is One-Hot Encoding (also known as dummy encoding) where each categorical level becomes a separate feature in the dataset containing binary values (1 or 0).

Since this is an imperative step to perform an ML experiment, trinity-neo will transform all categorical features in the dataset using one-hot encoding. This is ideal for features having nominal categorical data i.e. data cannot be ordered. In other different scenarios, other methods of encoding must be used. For example, when the data is ordinal i.e. data has intrinsic levels, ordinal encoding must be used. One-Hot Encoding works on all features that are either inferred as categorical or are forced as categorical using categorical_features in the setup function.

PARAMETERS

  • max_encoding_ohe: int, default = 25 Categorical columns with max_encoding_ohe or less unique values are encoded using OneHotEncoding. If more, the encoding method estimator is used. Note that columns with exactly two classes are always encoded ordinally. Set to below 0 to always use OneHotEncoding.

    encoding_method: category-encoders estimator, default = None A category-encoders estimator to encode the categorical columns with more than max_encoding_ohe unique values. If None, category_encoders.leave_one_out.LeaveOneOutEncoder is used by default.

Copy

# load dataset
from trinity-neo.datasets import get_data
pokemon = get_data('pokemon')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = pokemon, target = 'Legendary')

Before

After

Ordinal Encoding

When the categorical features in the dataset contain variables with intrinsic natural order such as Low, Medium, and High, these must be encoded differently than nominal variables (where there is no intrinsic order for e.g. Male or Female). This can be achieved using the ordinal_features parameter in the setup function that accepts a dictionary with feature names and the levels in the increasing order from lowest to highest.

PARAMETERS

  • ordinal_features: dictionary, default = None When the data contains ordinal features, they must be encoded differently using the ordinal_features. If the data has a categorical variable with values of low, medium, high and it is known that low < medium < high, then it can be passed as ordinal_features = { 'column_name' : ['low', 'medium', 'high'] }. The list sequence must be in increasing order from lowest to highest.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
employee = get_data('employee')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = employee, target = 'left', ordinal_features = {'salary' : ['low', 'medium', 'high']})

Before

Target Imbalance

When the training dataset has an unequal distribution of target class it can be fixed using the fix_imbalance parameter in the setup. When set to True, SMOTE (Synthetic Minority Over-sampling Technique) is used as a default method for resampling. The method for resampling can be changed using the fix_imbalance_method within the setup.

PARAMETERS

  • fix_imbalance: bool, default = False When set to True, the training dataset is resampled using the algorithm defined in fix_imbalance_method . When None, SMOTE is used by default.

  • fix_imbalance_method: str or imblearn estimator, default = 'SMOTE' Estimator with which to perform class balancing. Choose from the name of an imblearn estimator, or a custom instance of such. Ignored whenfix_imbalance=False.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
credit = get_data('credit')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = credit, target = 'default', fix_imbalance = True)

Before and After SMOTE

Remove Outliers

The remove_outliers function in trinity-neo allows you to identify and remove outliers from the dataset before training the model. Outliers are identified through PCA linear dimensionality reduction using the Singular Value Decomposition technique. It can be achieved using remove_outliers parameter within setup. The proportion of outliers are controlled through outliers_threshold parameter.

PARAMETERS

  • remove_outliers: bool, default = False When set to True, outliers from the training data are removed using an Isolation Forest.

  • outliers_method: str, default = 'iforest' Method with which to remove outliers. Ignored when remove_outliers=False. Possible values are:

    • 'iforest': Uses sklearn's IsolationForest.

    • 'ee': Uses sklearn's EllipticEnvelope.

    • 'lof': Uses sklearn's LocalOutlierFactor.

  • outliers_threshold: float, default = 0.05 The percentage of outliers to be removed from the dataset. Ignored when remove_outliers=False.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
insurance = get_data('insurance')

# init setup
from trinity-neo.regression import *
reg1 = setup(data = insurance, target = 'charges', remove_outliers = True)

Before and After removing outliers

Scale and Transform

Normalize

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to rescale the values of numeric columns in the dataset without distorting differences in the ranges of values or losing information. There are several methods available for normalization, by default, trinity-neo uses zscore.

PARAMETERS

  • normalize: bool, default = False When set to True, the feature space is transformed using the method defined under the normalized_method parameter.

  • normalize_method: string, default = ‘zscore’ Defines the method to be used for normalization. By default, the method is set to zscore. The other available options are:

    • z-score The standard zscore is calculated as z = (x – u) / s

    • minmax scales and translates each feature individually such that it is in the range of 0 – 1.

    • maxabs scales and translates each feature individually such that the maximal absolute value of each feature will be 1.0. It does not shift/center the data and thus does not destroy any sparsity.

    • robust scales and translates each feature according to the Interquartile range. When the dataset contains outliers, the robust scaler often gives better results.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
pokemon = get_data('pokemon')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = pokemon, target = 'Legendary', normalize = True)

Before

After

Effect of Normalization:

Feature Transform

While normalization rescales the data within new limits to reduce the impact of magnitude in the variance, Feature transformation is a more radical technique. Transformation changes the shape of the distribution such that the transformed data can be represented by a normal or approximate normal distribution. There are two methods available for transformation yeo-johnson and quantile.

PARAMETERS

  • transformation: bool, default = False When set to True, a power transformer is applied to make the data more normal / Gaussian-like. This is useful for modeling issues related to heteroscedasticity or other situations where normality is desired. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

  • transformation_method: string, default = ‘yeo-johnson’ Defines the method for transformation. By default, the transformation method is set to yeo-johnson. The other available option is quantile transformation. Both the transformation transforms the feature set to follow a Gaussian-like or normal distribution. Quantile transformer is non-linear and may distort linear correlations between variables measured at the same scale.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
pokemon = get_data('pokemon')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = pokemon, target = 'Legendary', transformation = True)

Before

Dataframe view before transformation

After

Dataframe view after transformation

Effect of Feature Transformation:

Target Transform

Target Transformation is similar to Feature Transformation as it will change the shape of the distribution of the target variable instead of Features. This feature is only available in trinity-neo.regression module.

PARAMETERS

  • transform_target: bool, default = False When set to True, target variable is transformed using the method defined in transform_target_method parameter. Target transformation is applied separately from feature transformations.

  • transform_target_method: string, default = ‘yeo-johnson’ Defines the method for transformation. By default, the transformation method is set to yeo-johnson. The other available option for transformation is quantile. Ignored when transform_target = False.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
diamond = get_data('diamond')

# init setup
from trinity-neo.regression import *
reg1 = setup(data = diamond, target = 'Price', transform_target = True)

Before

Dataframe view before target transformation

After

Dataframe view after target transformationn

Feature Engineering

Polynomial Features

In machine learning experiments, the relationship between the dependent and independent variables is often assumed to be linear; however, this is not always the case. Sometimes the relationship between dependent and independent variables is more complex. Creating new polynomial features sometimes might help in capturing that relationship, which otherwise may go unnoticed.

PARAMETERS

  • polynomial_features: bool, default = False When set to True, new features are created based on all polynomial combinations that exist within the numeric features in a dataset to the degree defined in the polynomial_degree parameter.

  • polynomial_degree: int, default = 2 Degree of polynomial features. For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2].

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
juice = get_data('juice')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = juice, target = 'Purchase', polynomial_features = True)

Before

Dataframe view before polynomial features

After

Dataframe view after polynomial features

Group Features

When dataset contains features that are related to each other in someway, for example: features recorded at some fixed time intervals, then new statistical features such as mean, median, variance and standard deviation for a group of such features can be created from existing features using group_features parameter.

PARAMETERS

  • group_features: list or list of list, default = None When a dataset contains features that have related characteristics, the group_features param can be used for statistical feature extraction. For example, if a dataset has numeric features that are related with each other (i.e ‘Col1’, ‘Col2’, ‘Col3’), a list containing the column names can be passed under group_features to extract statistical information such as the mean, median, mode and standard deviation.

  • group_names: list, default = None When group_features is passed, a name of the group can be passed into the group_names parameter as a list containing strings. The length of a group_names list must equal to the length of group_features. When the length doesn’t match or the name is not passed, new features are sequentially named such as group_1, group_2 etc.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
credit = get_data('credit')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = credit, target = 'default', group_features = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'])

Before

Dataframe before group features

After

Dataframe after group features

Bin Numeric Features

Feature binning is a method of turning continuous variables into categorical values using pre-defined number of bins. It is effective when a continuous feature has too many unique values or few extreme values outside the expected range. Such extreme values influence on the trained model, thereby affecting the prediction accuracy of the model. In trinity-neo, continuous numeric features can be binned into intervals using bin_numeric_features parameter. trinity-neo uses the ‘sturges’ rule to determine the number of bins and uses K-Means clustering to convert continuous numeric features into categorical features.

PARAMETERS

  • bin_numeric_features: list, default = None When a list of numeric features is passed they are transformed into categorical features using K-Means, where values in each bin have the same nearest center of a 1D k-means cluster. The number of clusters are determined based on the 'sturges' method. It is only optimal for gaussian data and underestimates the number of bins for large non-gaussian datasets.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
income = get_data('income')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = income, target = 'income >50K', bin_numeric_features = ['age'])

Before

Dataframe view before bin numeric bin features

After

Dataframe view after numeric bin features

Combine Rare Levels

Sometimes a dataset can have a categorical feature (or multiple categorical features) that has a very high number of levels (i.e. high cardinality features). If such feature (or features) are encoded into numeric values, then the resultant matrix is a sparse matrix. This not only makes experiment slow due to manifold increment in the number of features and hence the size of the dataset, but also introduces noise in the experiment. Sparse matrix can be avoided by combining the rare levels in the feature(or features) having high cardinality. This can be achieved in trinity-neo using rare_to_value parameter.

PARAMETERS

  • rare_to_value: float or None, default=None

    Minimum fraction of category occurrences in a categorical column. If a category is less frequent than rare_to_value * len(X), it is replaced with the string in rare_value. Use this parameter to group rare categories before encoding the column. If None, ignores this step.

  • rare_value: str, default="rare"

    Value with which to replace rare categories. Ignored when rare_to_value is None

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
income = get_data('income')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = income, target = 'income >50K', rare_to_value = 0.1)

Before

Dataframe view before combine rare levels

After

Dataframe view after combine rare levels

Effect of combining rare levels

Feature Selection

Feature Selection

Feature Importance is a process used to select features in the dataset that contribute the most in predicting the target variable. Working with selected features instead of all the features reduces the risk of over-fitting, improves accuracy, and decreases the training time. In trinity-neo, this can be achieved using feature_selection parameter.

PARAMETERS

  • feature_selection: bool, default = False When set to True, a subset of features is selected based on a feature importance score determined by feature_selection_estimator.

  • feature_selection_method: str, default = 'classic'

    Algorithm for feature selection. Choose from:

    • 'univariate': Uses sklearn's SelectKBest.

    • 'classic': Uses sklearn's SelectFromModel.

    • 'sequential': Uses sklearn's SequentialFeatureSelector.

  • feature_selection_estimator: str or sklearn estimator, default = 'lightgbm'

    Classifier used to determine the feature importance. The estimator should have a feature_importances_ or coef_ attribute after fitting. If None, it uses LGBClassifier. This parameter is ignored when feature_selection_method=univariate.

  • n_features_to_select: int or float, default = 0.2

    The maximum number of features to select with feature_selection. If <1, it's the fraction of starting features. Note that this parameter doesn't take features in ignore_features or keep_features into account when counting.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from trinity-neo.regression import *
clf1 = setup(data = diabetes, target = 'Class variable', feature_selection = True)

Before

Dataframe before feature importance

After

Dataframe after feature importance

Remove Multicollinearity

Multicollinearity (also called collinearity) is a phenomenon in which one feature variable in the dataset is highly linearly correlated with another feature variable in the same dataset. Multicollinearity increases the variance of the coefficients, thus making them unstable and noisy for linear models. One such way to deal with Multicollinearity is to drop one of the two features that are highly correlated with each other. This can be achieved in Trinity-Neo using remove_multicollinearity parameter.

PARAMETERS

  • remove_multicollinearity: bool, default = False When set to True, features with the inter-correlations higher than the defined threshold are removed. For each group, it removes all except the feature with the highest correlation to y.

  • multicollinearity_threshold: float, default = 0.9 Minimum absolute Pearson correlation to identify correlated features. The default value removes equal columns. Ignored when remove_multicollinearity is not True.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
concrete = get_data('concrete')

# init setup
from trinity-neo.regression import *
reg1 = setup(data = concrete, target = 'strength', remove_multicollinearity = True, multicollinearity_threshold = 0.3)

Before

Dataframe view before remove multicollinearity

After

Dataframe view after remove multicollinearity

Principal Component Analysis

Principal Component Analysis (PCA) is an unsupervised technique used in machine learning to reduce the dimensionality of a data. It does so by compressing the feature space by identifying a subspace that captures most of the information in the complete feature matrix. It projects the original feature space into lower dimensionality.

PARAMETERS

  • pca: bool, default = False When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in pca_method parameter.

  • pca_method: string, default = ‘linear’ Method with which to apply PCA. Possible values are:

    • 'linear': Uses Singular Value Decomposition.

    • 'kernel': Dimensionality reduction through the use of RBF kernel.

    • 'incremental': Similar to 'linear', but more efficient for large datasets.

  • pca_components: int/float, default = 0.99 Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be strictly less than the original number of features in the dataset.

  • pca_components: int, float, str or None, default = None Number of components to keep. This parameter is ignored when pca=False.

    • If None: All components are kept.

    • If int: Absolute number of components. -

    • If float: Such an amount that the variance that needs to be explained is greater than the percentage specified by n_components. Value should lie between 0 and 1 (ony for pca_method='linear').

    • If 'mle': Minka’s MLE is used to guess the dimension (ony for pca_method='linear').

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
income = get_data('income')

# init setup
from trinity-neo.classification import *
clf1 = setup(data = income, target = 'income >50K', pca = True, pca_components = 10)

Before

Dataframe view before pca

After

Dataframe view after pca

Ignore Low Variance

Sometimes a dataset may have a categorical feature with multiple levels, where distribution of such levels are skewed and one level may dominate over other levels. This means there is not much variation in the information provided by such feature. For a ML model, such feature may not add a lot of information and thus can be ignored for modeling. This can be achieved in trinity-neo using low_variance_threshold parameter.

PARAMETERS

  • low_variance_threshold: float or None, default = None

    Remove features with a training-set variance lower than the provided threshold. If 0, keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. If None, skip this transformation step.

Example

Copy

# load dataset
from trinity-neo.datasets import get_data
mice = get_data('mice')

# filter dataset
mice = mice[mice['Genotype'] == 'Control']

# init setup
from trinity-neo.classification import *
clf1 = setup(data = mice, target = 'class', low_variance_threshold = 0.1)

Before

Dataframe view before ignore low variance

After

Dataframe view after ignore low variance

Last updated