⚙️Data Preprocessing
Data preprocessing and Transformations available in Trinity-Neo
Data Preparation
Missing Values
Datasets for various reasons may have missing values or empty records, often encoded as blanks or NaN
. Most machine learning algorithms can't deal with values that are missing or blank. Removing samples with missing values is a basic strategy that is sometimes used, but it comes with a cost of losing probable valuable data and the associated information or patterns. A better strategy is to impute the missing values.
PARAMETERS
imputation_type: string, default = 'simple' The type of imputation to use. It can be either
simple
oriterative
. If None, no imputation of missing values is performed.numeric_imputation: int, float, or string, default = ‘mean’ Imputing strategy for numerical columns. Ignored when
imputation_type= iterative
. Choose from:drop: Drop rows containing missing values.
mean: Impute with mean of column.
median: Impute with median of column.
mode: Impute with most frequent value.
knn: Impute using a K-Nearest Neighbors approach.
int or float: Impute with provided numerical value.
categorical_imputation: string, default = ‘mode’ Imputing strategy for categorical columns. Ignored when
imputation_type= iterative
. Choose from:drop: Drop rows containing missing values.
mode: Impute with most frequent value.
str: Impute with provided string.
iterative_imputation_iters: int, default = 5 The number of iterations. Ignored when
imputation_type=simple
.numeric_iterative_imputer: str or sklearn estimator, default = 'lightgbm' Regressor for iterative imputation of missing values in numeric features. If None, it uses LGBClassifier. Ignored when
imputation_type=simple
.'categorical_iterative_imputer: str or sklearn estimator, default = 'lightgbm' Regressor for iterative imputation of missing values in categorical features. If None, it uses LGBClassifier. Ignored when
imputation_type=simple
.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
hepatitis = get_data('hepatitis')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = hepatitis, target = 'Class')
Before
After
Comparison of Simple imputer vs. Iterative imputer
Data Types
Each feature in the dataset has an associated data type such as numeric, categorical, or Datetime. trinity’s inference algorithm automatically detects the data type of each feature. However, sometimes the data types inferred by trinity are incorrect. Ensuring data types are correct is important as several downstream processes depend on the data type of the features. One example could be that missing for numeric and categorical features in the dataset are imputed differently. To overwrite the inferred data types, numeric_features
, categorical_features
and date_features
parameters can be used in the setup function. You can also use ignore_features
to ignore certain features for model training.
PARAMETERS
numeric_features: list of string, default = None If the inferred data types are not correct,
numeric_features
can be used to overwrite the inferred data types.categorical_features: list of string, default = None If the inferred data types are not correct,
categorical_features
can be used to overwrite the inferred data types.date_features: list of string, default = None If the data has a
Datetime
column that is not automatically inferred when running the setup,date_features
can be used to force the data type. It can work with multiple date columns. Datetime related features are not used in modeling. Instead, feature extraction is performed and originalDatetime
columns are ignored during model training. If theDatetime
column includes a timestamp, features related to time will also be extracted.create_date_columns: list of str, default = ["day", "month", "year"]
Columns to create from the date features. Note that created features with zero variance (e.g. the feature hour in a column that only contains dates) are ignored. Allowed values are datetime attributes from
pandas.Series.dt
. The datetime format of the feature is inferred automatically from the first non NaN value.text_features: list of str, default = None Column names that contain a text corpus. If None, no text features are selected.
text_features_method: str, default = 'tf-idf' Method with which to embed the text features in the dataset. Choose between 'bow' (Bag of Words -
CountVectorizer
) or 'tf-idf' (TfidfVectorizer
). Be aware that the sparse matrix output of the transformer is converted internally to its full array. This can cause memory issues for large text embeddings.ignore_features: list of string, default = None
ignore_features
can be used to ignore features during model training. It takes a list of strings with column names that are to be ignored.keep_features: list of str, default = None
keep_features
parameter can be used to always keep specific features during preprocessing, i.e. these features are never dropped by any kind of feature selection. It takes a list of strings with column names that are to be kept.
Example 1 - Categorical Features
Copy
# load dataset
from trinity-neo.datasets import get_data
hepatitis = get_data('hepatitis')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = hepatitis, target = 'Class', categorical_features = ['AGE'])
Before
After
Example 2 - Ignore Features
Copy
# load dataset
from trinity-neo.datasets import get_data
pokemon = get_data('pokemon')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = pokemon, target = 'Legendary', ignore_features = ['#', 'Name'])
Before
After
One-Hot Encoding
Categorical features in the dataset contain the label values (ordinal or nominal) rather than continuous numbers. The majority of the machine learning algorithms cannot directly deal with categorical features and they must be transformed into numeric values before training a model. The most common type of categorical encoding is One-Hot Encoding (also known as dummy encoding) where each categorical level becomes a separate feature in the dataset containing binary values (1 or 0).
Since this is an imperative step to perform an ML experiment, trinity-neo will transform all categorical features in the dataset using one-hot encoding. This is ideal for features having nominal categorical data i.e. data cannot be ordered. In other different scenarios, other methods of encoding must be used. For example, when the data is ordinal i.e. data has intrinsic levels, ordinal encoding must be used. One-Hot Encoding works on all features that are either inferred as categorical or are forced as categorical using categorical_features
in the setup function.
PARAMETERS
max_encoding_ohe: int, default = 25 Categorical columns with
max_encoding_ohe
or less unique values are encoded using OneHotEncoding. If more, theencoding method
estimator is used. Note that columns with exactly two classes are always encoded ordinally. Set to below 0 to always use OneHotEncoding.encoding_method: category-encoders estimator, default = None A
category-encoders
estimator to encode the categorical columns with more thanmax_encoding_ohe
unique values. If None,category_encoders.leave_one_out.LeaveOneOutEncoder
is used by default.
Copy
# load dataset
from trinity-neo.datasets import get_data
pokemon = get_data('pokemon')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = pokemon, target = 'Legendary')
Before
After
Ordinal Encoding
When the categorical features in the dataset contain variables with intrinsic natural order such as Low, Medium, and High, these must be encoded differently than nominal variables (where there is no intrinsic order for e.g. Male or Female). This can be achieved using the ordinal_features
parameter in the setup function that accepts a dictionary with feature names and the levels in the increasing order from lowest to highest.
PARAMETERS
ordinal_features: dictionary, default = None When the data contains ordinal features, they must be encoded differently using the
ordinal_features
. If the data has a categorical variable with values oflow
,medium
,high
and it is known that low < medium < high, then it can be passed asordinal_features = { 'column_name' : ['low', 'medium', 'high'] }
. The list sequence must be in increasing order from lowest to highest.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
employee = get_data('employee')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = employee, target = 'left', ordinal_features = {'salary' : ['low', 'medium', 'high']})
Before
Target Imbalance
When the training dataset has an unequal distribution of target class it can be fixed using the fix_imbalance
parameter in the setup. When set to True
, SMOTE (Synthetic Minority Over-sampling Technique) is used as a default method for resampling. The method for resampling can be changed using the fix_imbalance_method
within the setup.
PARAMETERS
fix_imbalance: bool, default = False When set to
True
, the training dataset is resampled using the algorithm defined infix_imbalance_method
. WhenNone
, SMOTE is used by default.fix_imbalance_method: str or imblearn estimator, default = 'SMOTE' Estimator with which to perform class balancing. Choose from the name of an
imblearn
estimator, or a custom instance of such. Ignored whenfix_imbalance=False
.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
credit = get_data('credit')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = credit, target = 'default', fix_imbalance = True)
Before and After SMOTE
Remove Outliers
The remove_outliers
function in trinity-neo allows you to identify and remove outliers from the dataset before training the model. Outliers are identified through PCA linear dimensionality reduction using the Singular Value Decomposition technique. It can be achieved using remove_outliers
parameter within setup. The proportion of outliers are controlled through outliers_threshold
parameter.
PARAMETERS
remove_outliers: bool, default = False When set to True, outliers from the training data are removed using an Isolation Forest.
outliers_method: str, default = 'iforest' Method with which to remove outliers. Ignored when
remove_outliers=False
. Possible values are:'iforest': Uses sklearn's IsolationForest.
'ee': Uses sklearn's EllipticEnvelope.
'lof': Uses sklearn's LocalOutlierFactor.
outliers_threshold: float, default = 0.05 The percentage of outliers to be removed from the dataset. Ignored when
remove_outliers=False
.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
insurance = get_data('insurance')
# init setup
from trinity-neo.regression import *
reg1 = setup(data = insurance, target = 'charges', remove_outliers = True)
Before and After removing outliers
Scale and Transform
Normalize
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to rescale the values of numeric columns in the dataset without distorting differences in the ranges of values or losing information. There are several methods available for normalization, by default, trinity-neo uses zscore
.
PARAMETERS
normalize: bool, default = False When set to
True
, the feature space is transformed using the method defined under thenormalized_method
parameter.normalize_method: string, default = ‘zscore’ Defines the method to be used for normalization. By default, the method is set to
zscore
. The other available options are:z-score
The standard zscore is calculated as z = (x – u) / sminmax
scales and translates each feature individually such that it is in the range of 0 – 1.maxabs
scales and translates each feature individually such that the maximal absolute value of each feature will be 1.0. It does not shift/center the data and thus does not destroy any sparsity.robust
scales and translates each feature according to the Interquartile range. When the dataset contains outliers, the robust scaler often gives better results.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
pokemon = get_data('pokemon')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = pokemon, target = 'Legendary', normalize = True)
Before
After
Effect of Normalization:
Feature Transform
While normalization rescales the data within new limits to reduce the impact of magnitude in the variance, Feature transformation is a more radical technique. Transformation changes the shape of the distribution such that the transformed data can be represented by a normal or approximate normal distribution. There are two methods available for transformation yeo-johnson
and quantile
.
PARAMETERS
transformation: bool, default = False When set to
True
, a power transformer is applied to make the data more normal / Gaussian-like. This is useful for modeling issues related to heteroscedasticity or other situations where normality is desired. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.transformation_method: string, default = ‘yeo-johnson’ Defines the method for transformation. By default, the transformation method is set to
yeo-johnson
. The other available option isquantile
transformation. Both the transformation transforms the feature set to follow a Gaussian-like or normal distribution. Quantile transformer is non-linear and may distort linear correlations between variables measured at the same scale.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
pokemon = get_data('pokemon')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = pokemon, target = 'Legendary', transformation = True)
Before
Dataframe view before transformation
After
Dataframe view after transformation
Effect of Feature Transformation:
Target Transform
Target Transformation is similar to Feature Transformation as it will change the shape of the distribution of the target variable instead of Features. This feature is only available in trinity-neo.regression
module.
PARAMETERS
transform_target: bool, default = False When set to True, target variable is transformed using the method defined in
transform_target_method
parameter. Target transformation is applied separately from feature transformations.transform_target_method: string, default = ‘yeo-johnson’ Defines the method for transformation. By default, the transformation method is set to
yeo-johnson
. The other available option for transformation isquantile
. Ignored whentransform_target = False
.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
diamond = get_data('diamond')
# init setup
from trinity-neo.regression import *
reg1 = setup(data = diamond, target = 'Price', transform_target = True)
Before
Dataframe view before target transformation
After
Dataframe view after target transformationn
Feature Engineering
Polynomial Features
In machine learning experiments, the relationship between the dependent and independent variables is often assumed to be linear; however, this is not always the case. Sometimes the relationship between dependent and independent variables is more complex. Creating new polynomial features sometimes might help in capturing that relationship, which otherwise may go unnoticed.
PARAMETERS
polynomial_features: bool, default = False When set to True, new features are created based on all polynomial combinations that exist within the numeric features in a dataset to the degree defined in the
polynomial_degree
parameter.polynomial_degree: int, default = 2 Degree of polynomial features. For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2].
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
juice = get_data('juice')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = juice, target = 'Purchase', polynomial_features = True)
Before
Dataframe view before polynomial features
After
Dataframe view after polynomial features
Group Features
When dataset contains features that are related to each other in someway, for example: features recorded at some fixed time intervals, then new statistical features such as mean, median, variance and standard deviation for a group of such features can be created from existing features using group_features
parameter.
PARAMETERS
group_features: list or list of list, default = None When a dataset contains features that have related characteristics, the group_features param can be used for statistical feature extraction. For example, if a dataset has numeric features that are related with each other (i.e ‘Col1’, ‘Col2’, ‘Col3’), a list containing the column names can be passed under
group_features
to extract statistical information such as the mean, median, mode and standard deviation.group_names: list, default = None When group_features is passed, a name of the group can be passed into the
group_names
parameter as a list containing strings. The length of agroup_names
list must equal to the length ofgroup_features
. When the length doesn’t match or the name is not passed, new features are sequentially named such as group_1, group_2 etc.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
credit = get_data('credit')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = credit, target = 'default', group_features = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'])
Before
Dataframe before group features
After
Dataframe after group features
Bin Numeric Features
Feature binning is a method of turning continuous variables into categorical values using pre-defined number of bins. It is effective when a continuous feature has too many unique values or few extreme values outside the expected range. Such extreme values influence on the trained model, thereby affecting the prediction accuracy of the model. In trinity-neo, continuous numeric features can be binned into intervals using bin_numeric_features
parameter. trinity-neo uses the ‘sturges’ rule to determine the number of bins and uses K-Means clustering to convert continuous numeric features into categorical features.
PARAMETERS
bin_numeric_features: list, default = None When a list of numeric features is passed they are transformed into categorical features using K-Means, where values in each bin have the same nearest center of a 1D k-means cluster. The number of clusters are determined based on the 'sturges' method. It is only optimal for gaussian data and underestimates the number of bins for large non-gaussian datasets.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
income = get_data('income')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = income, target = 'income >50K', bin_numeric_features = ['age'])
Before
Dataframe view before bin numeric bin features
After
Dataframe view after numeric bin features
Combine Rare Levels
Sometimes a dataset can have a categorical feature (or multiple categorical features) that has a very high number of levels (i.e. high cardinality features). If such feature (or features) are encoded into numeric values, then the resultant matrix is a sparse matrix. This not only makes experiment slow due to manifold increment in the number of features and hence the size of the dataset, but also introduces noise in the experiment. Sparse matrix can be avoided by combining the rare levels in the feature(or features) having high cardinality. This can be achieved in trinity-neo using rare_to_value
parameter.
PARAMETERS
rare_to_value: float or None, default=None
Minimum fraction of category occurrences in a categorical column. If a category is less frequent than
rare_to_value * len(X)
, it is replaced with the string inrare_value
. Use this parameter to group rare categories before encoding the column. If None, ignores this step.rare_value: str, default="rare"
Value with which to replace rare categories. Ignored when
rare_to_value
is None
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
income = get_data('income')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = income, target = 'income >50K', rare_to_value = 0.1)
Before
Dataframe view before combine rare levels
After
Dataframe view after combine rare levels
Effect of combining rare levels
Feature Selection
Feature Selection
Feature Importance is a process used to select features in the dataset that contribute the most in predicting the target variable. Working with selected features instead of all the features reduces the risk of over-fitting, improves accuracy, and decreases the training time. In trinity-neo, this can be achieved using feature_selection
parameter.
PARAMETERS
feature_selection: bool, default = False When set to True, a subset of features is selected based on a feature importance score determined by
feature_selection_estimator
.feature_selection_method: str, default = 'classic'
Algorithm for feature selection. Choose from:
'univariate': Uses sklearn's SelectKBest.
'classic': Uses sklearn's SelectFromModel.
'sequential': Uses sklearn's SequentialFeatureSelector.
feature_selection_estimator: str or sklearn estimator, default = 'lightgbm'
Classifier used to determine the feature importance. The estimator should have a
feature_importances_
orcoef_
attribute after fitting. If None, it uses LGBClassifier. This parameter is ignored when feature_selection_method=univariate.n_features_to_select: int or float, default = 0.2
The maximum number of features to select with feature_selection. If <1, it's the fraction of starting features. Note that this parameter doesn't take features in
ignore_features
orkeep_features
into account when counting.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from trinity-neo.regression import *
clf1 = setup(data = diabetes, target = 'Class variable', feature_selection = True)
Before
Dataframe before feature importance
After
Dataframe after feature importance
Remove Multicollinearity
Multicollinearity (also called collinearity) is a phenomenon in which one feature variable in the dataset is highly linearly correlated with another feature variable in the same dataset. Multicollinearity increases the variance of the coefficients, thus making them unstable and noisy for linear models. One such way to deal with Multicollinearity is to drop one of the two features that are highly correlated with each other. This can be achieved in Trinity-Neo using remove_multicollinearity
parameter.
PARAMETERS
remove_multicollinearity: bool, default = False When set to True, features with the inter-correlations higher than the defined threshold are removed. For each group, it removes all except the feature with the highest correlation to
y
.multicollinearity_threshold: float, default = 0.9 Minimum absolute Pearson correlation to identify correlated features. The default value removes equal columns. Ignored when
remove_multicollinearity
is not True.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
concrete = get_data('concrete')
# init setup
from trinity-neo.regression import *
reg1 = setup(data = concrete, target = 'strength', remove_multicollinearity = True, multicollinearity_threshold = 0.3)
Before
Dataframe view before remove multicollinearity
After
Dataframe view after remove multicollinearity
Principal Component Analysis
Principal Component Analysis (PCA) is an unsupervised technique used in machine learning to reduce the dimensionality of a data. It does so by compressing the feature space by identifying a subspace that captures most of the information in the complete feature matrix. It projects the original feature space into lower dimensionality.
PARAMETERS
pca: bool, default = False When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in
pca_method
parameter.pca_method: string, default = ‘linear’ Method with which to apply PCA. Possible values are:
'linear': Uses Singular Value Decomposition.
'kernel': Dimensionality reduction through the use of RBF kernel.
'incremental': Similar to 'linear', but more efficient for large datasets.
pca_components: int/float, default = 0.99 Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be strictly less than the original number of features in the dataset.
pca_components: int, float, str or None, default = None Number of components to keep. This parameter is ignored when
pca=False
.If None: All components are kept.
If int: Absolute number of components. -
If float: Such an amount that the variance that needs to be explained is greater than the percentage specified by
n_components
. Value should lie between 0 and 1 (ony for pca_method='linear').If 'mle': Minka’s MLE is used to guess the dimension (ony for pca_method='linear').
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
income = get_data('income')
# init setup
from trinity-neo.classification import *
clf1 = setup(data = income, target = 'income >50K', pca = True, pca_components = 10)
Before
Dataframe view before pca
After
Dataframe view after pca
Ignore Low Variance
Sometimes a dataset may have a categorical feature with multiple levels, where distribution of such levels are skewed and one level may dominate over other levels. This means there is not much variation in the information provided by such feature. For a ML model, such feature may not add a lot of information and thus can be ignored for modeling. This can be achieved in trinity-neo using low_variance_threshold
parameter.
PARAMETERS
low_variance_threshold: float or None, default = None
Remove features with a training-set variance lower than the provided threshold. If 0, keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. If None, skip this transformation step.
Example
Copy
# load dataset
from trinity-neo.datasets import get_data
mice = get_data('mice')
# filter dataset
mice = mice[mice['Genotype'] == 'Control']
# init setup
from trinity-neo.classification import *
clf1 = setup(data = mice, target = 'class', low_variance_threshold = 0.1)
Before
Dataframe view before ignore low variance
After
Dataframe view after ignore low variance
Last updated