Sklearn DecisionTreeClassifier F-Score Different Results with Each run











up vote
0
down vote

favorite












I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.



data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.



The following code is what I did to preprocess and format my data:



import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score


# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")

X = data.iloc[:, :-1]
y = data.iloc[:, 6]

# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)

for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])


# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)





The next code is for the actual decision tree model training:



dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')

print("Score is = {}".format(score))





The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.



On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."



I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:







  1. Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?


  2. I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?



Thank you.










share|improve this question




















  • 1




    try with dectree = DecisionTreeClassifier(random_state=42)
    – Sociopath
    Nov 22 at 16:04










  • Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for random_state produce the warning message but others don't... would you happen to know why?
    – Seankala
    Nov 22 at 16:09






  • 1




    Use random_state everywhere where it is applicable. In your case, its train_test_split() and DecisionTreeClassifier(). Also, use stratify option in train_test_split() to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric warning).
    – Vivek Kumar
    Nov 23 at 5:58










  • Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
    – Seankala
    Nov 24 at 2:10















up vote
0
down vote

favorite












I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.



data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.



The following code is what I did to preprocess and format my data:



import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score


# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")

X = data.iloc[:, :-1]
y = data.iloc[:, 6]

# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)

for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])


# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)





The next code is for the actual decision tree model training:



dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')

print("Score is = {}".format(score))





The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.



On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."



I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:







  1. Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?


  2. I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?



Thank you.










share|improve this question




















  • 1




    try with dectree = DecisionTreeClassifier(random_state=42)
    – Sociopath
    Nov 22 at 16:04










  • Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for random_state produce the warning message but others don't... would you happen to know why?
    – Seankala
    Nov 22 at 16:09






  • 1




    Use random_state everywhere where it is applicable. In your case, its train_test_split() and DecisionTreeClassifier(). Also, use stratify option in train_test_split() to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric warning).
    – Vivek Kumar
    Nov 23 at 5:58










  • Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
    – Seankala
    Nov 24 at 2:10













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.



data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.



The following code is what I did to preprocess and format my data:



import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score


# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")

X = data.iloc[:, :-1]
y = data.iloc[:, 6]

# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)

for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])


# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)





The next code is for the actual decision tree model training:



dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')

print("Score is = {}".format(score))





The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.



On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."



I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:







  1. Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?


  2. I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?



Thank you.










share|improve this question















I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.



data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.



The following code is what I did to preprocess and format my data:



import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score


# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")

X = data.iloc[:, :-1]
y = data.iloc[:, 6]

# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)

for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])


# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)





The next code is for the actual decision tree model training:



dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')

print("Score is = {}".format(score))





The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.



On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."



I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:







  1. Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?


  2. I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?



Thank you.







python machine-learning scikit-learn






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 at 16:06

























asked Nov 22 at 16:00









Seankala

3421213




3421213








  • 1




    try with dectree = DecisionTreeClassifier(random_state=42)
    – Sociopath
    Nov 22 at 16:04










  • Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for random_state produce the warning message but others don't... would you happen to know why?
    – Seankala
    Nov 22 at 16:09






  • 1




    Use random_state everywhere where it is applicable. In your case, its train_test_split() and DecisionTreeClassifier(). Also, use stratify option in train_test_split() to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric warning).
    – Vivek Kumar
    Nov 23 at 5:58










  • Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
    – Seankala
    Nov 24 at 2:10














  • 1




    try with dectree = DecisionTreeClassifier(random_state=42)
    – Sociopath
    Nov 22 at 16:04










  • Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for random_state produce the warning message but others don't... would you happen to know why?
    – Seankala
    Nov 22 at 16:09






  • 1




    Use random_state everywhere where it is applicable. In your case, its train_test_split() and DecisionTreeClassifier(). Also, use stratify option in train_test_split() to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric warning).
    – Vivek Kumar
    Nov 23 at 5:58










  • Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
    – Seankala
    Nov 24 at 2:10








1




1




try with dectree = DecisionTreeClassifier(random_state=42)
– Sociopath
Nov 22 at 16:04




try with dectree = DecisionTreeClassifier(random_state=42)
– Sociopath
Nov 22 at 16:04












Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for random_state produce the warning message but others don't... would you happen to know why?
– Seankala
Nov 22 at 16:09




Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for random_state produce the warning message but others don't... would you happen to know why?
– Seankala
Nov 22 at 16:09




1




1




Use random_state everywhere where it is applicable. In your case, its train_test_split() and DecisionTreeClassifier(). Also, use stratify option in train_test_split() to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric warning).
– Vivek Kumar
Nov 23 at 5:58




Use random_state everywhere where it is applicable. In your case, its train_test_split() and DecisionTreeClassifier(). Also, use stratify option in train_test_split() to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric warning).
– Vivek Kumar
Nov 23 at 5:58












Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
– Seankala
Nov 24 at 2:10




Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
– Seankala
Nov 24 at 2:10












1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.



In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.



#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)

#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)





share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434609%2fsklearn-decisiontreeclassifier-f-score-different-results-with-each-run%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote



    accepted










    You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.



    In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.



    #train test split
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)

    #Decision tree model
    dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)





    share|improve this answer

























      up vote
      2
      down vote



      accepted










      You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.



      In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.



      #train test split
      X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)

      #Decision tree model
      dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)





      share|improve this answer























        up vote
        2
        down vote



        accepted







        up vote
        2
        down vote



        accepted






        You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.



        In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.



        #train test split
        X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)

        #Decision tree model
        dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)





        share|improve this answer












        You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.



        In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.



        #train test split
        X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)

        #Decision tree model
        dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 22 at 16:15









        Naveen

        676113




        676113






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434609%2fsklearn-decisiontreeclassifier-f-score-different-results-with-each-run%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to ignore python UserWarning in pytest?

            What visual should I use to simply compare current year value vs last year in Power BI desktop

            Script to remove string up to first number