Effectiveness of Standardization and Normalization in Machine Learning

by Manu Joseph   Last Updated January 14, 2018 12:19 PM

I am just studying the basics of machine learning and had a question about the standardisation and normalisation of the features and its effectiveness.

I have read this CrossValidated question and this blog among many other articles but still was not clear on when I should apply which technique.

And to make things worse, in sklearn there is a StandardScaler, a MinMaxScaler, a MaxAbsScaler, RobustScaler, and Normalization.

So, I took the SECOM Manufacturing dataset and decided to run almost all the classification algorithms I could find with the data without scaling and by applying all these scaling methodologies and calculated the test accuracy.

What I got was not in line with most of the things I had read online.

For e.g.

  1. Methods using Euclidean distance was supposed to be the most affected by scaling, but in fact, I saw the accuracy decrease when I scaled in the KNN algorithm.

  2. Logistic Regression and SVM were algorithms which used Gradient Descent and they were also supposed to show an improvement, but could not really see that also, especially in SVM (or SVC, the classification variant)

  3. Most of the algorithms shows RobustScaler to have lower accuracy and Normalization to have a higher accuracy.

Please below the results of the experiment I did. Can you please tell me if I am doing something wrong or interpreting this wrong?

enter image description here

Code I used

def benchmark_ml_algo_v2(df_features, df_target):
    #Machine Learning Algorithm (MLA) Selection and initialization
    MLA = [
        #Ensemble Methods
        AdaBoostClassifier(),
        BaggingClassifier(),
        ExtraTreesClassifier(),
        GradientBoostingClassifier(),
        RandomForestClassifier(n_estimators = 100),
        xgb.XGBClassifier(),
        lgb.LGBMClassifier(),
    #     CatBoostClassifier(),
        #Gaussian Processes
        GaussianProcessClassifier(),

        #GLM
        LogisticRegressionCV(),
        PassiveAggressiveClassifier(max_iter=1000),
        RidgeClassifierCV(),
        SGDClassifier(max_iter= 1000),
        Perceptron(max_iter=1000),

        #Navies Bayes
        GaussianNB(),

        #Nearest Neighbor
        KNeighborsClassifier(n_neighbors = 3),

        #SVM
        SVC(probability=True),
        LinearSVC(),

        #Trees    
        DecisionTreeClassifier(),
        ExtraTreeClassifier(),

        ]

    scalers = [
        None,
        MinMaxScaler(),
        MaxAbsScaler(),
        RobustScaler(),
        Normalizer()
    ]

    import time

    #create table to compare MLA
    MLA_columns = ['MLA Name','Scaler', 'MLA Test Accuracy','MLA Time']
    MLA_compare = pd.DataFrame(columns = MLA_columns)

    train_x, test_x, train_y, test_y = train_test_split(df_features, df_target,random_state  = 40)

    #index through MLA and save performance to table
    row_index = 0
    for alg in MLA:
        print ("Running {}...".format(alg.__class__.__name__))
        for scaler in scalers:

            #set name and parameters
            MLA_compare.loc[row_index, 'MLA Name'] = alg.__class__.__name__
            MLA_compare.loc[row_index, 'Scaler'] = scaler.__class__.__name__

            start = time.clock()

            #initialize scaler
            if scaler:
                #score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
                alg.fit(X=scaler.fit_transform(train_x), y=train_y.values.ravel())
            else:
                #score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
                alg.fit(X=train_x, y=train_y.values.ravel())
            MLA_compare.loc[row_index, 'MLA Time'] = time.clock() - start
            if scaler:
                MLA_compare.loc[row_index, 'MLA Test Accuracy'] = alg.score(X=scaler.fit_transform(test_x), y=test_y)
            else:
                MLA_compare.loc[row_index, 'MLA Test Accuracy'] = alg.score(X=test_x, y=test_y)

            row_index+=1


    #print and sort table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
    return MLA_compare.sort_values(by = ['MLA Test Accuracy'], ascending = False)
    #print(MLA_compare)


Related Questions


Rescaling vs Standardization of features

Updated August 02, 2017 14:19 PM

Standardization before applying ANOVA?

Updated July 30, 2015 13:08 PM

How to train data with different feature size

Updated September 07, 2017 09:19 AM