Sklearn DecisionTreeClassifier F-Score Different Results with Each run
up vote
0
down vote
favorite
I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler()
to scale the data, and f1_score
for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data
in my code is a (2000, 7)
pandas.DataFrame
, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39
and 0.42
.
On some iterations, I even get the UndefinedMetricWarning
, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning
means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
python machine-learning scikit-learn
add a comment |
up vote
0
down vote
favorite
I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler()
to scale the data, and f1_score
for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data
in my code is a (2000, 7)
pandas.DataFrame
, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39
and 0.42
.
On some iterations, I even get the UndefinedMetricWarning
, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning
means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
python machine-learning scikit-learn
1
try withdectree = DecisionTreeClassifier(random_state=42)
– Sociopath
Nov 22 at 16:04
Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values forrandom_state
produce the warning message but others don't... would you happen to know why?
– Seankala
Nov 22 at 16:09
1
Userandom_state
everywhere where it is applicable. In your case, itstrain_test_split()
andDecisionTreeClassifier()
. Also, usestratify
option intrain_test_split()
to get a balanced split (classes) between train and test data (which may help in avoiding theUndefinedMetric
warning).
– Vivek Kumar
Nov 23 at 5:58
Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
– Seankala
Nov 24 at 2:10
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler()
to scale the data, and f1_score
for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data
in my code is a (2000, 7)
pandas.DataFrame
, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39
and 0.42
.
On some iterations, I even get the UndefinedMetricWarning
, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning
means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
python machine-learning scikit-learn
I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler()
to scale the data, and f1_score
for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data
in my code is a (2000, 7)
pandas.DataFrame
, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39
and 0.42
.
On some iterations, I even get the UndefinedMetricWarning
, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning
means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
python machine-learning scikit-learn
python machine-learning scikit-learn
edited Nov 22 at 16:06
asked Nov 22 at 16:00
Seankala
3421213
3421213
1
try withdectree = DecisionTreeClassifier(random_state=42)
– Sociopath
Nov 22 at 16:04
Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values forrandom_state
produce the warning message but others don't... would you happen to know why?
– Seankala
Nov 22 at 16:09
1
Userandom_state
everywhere where it is applicable. In your case, itstrain_test_split()
andDecisionTreeClassifier()
. Also, usestratify
option intrain_test_split()
to get a balanced split (classes) between train and test data (which may help in avoiding theUndefinedMetric
warning).
– Vivek Kumar
Nov 23 at 5:58
Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
– Seankala
Nov 24 at 2:10
add a comment |
1
try withdectree = DecisionTreeClassifier(random_state=42)
– Sociopath
Nov 22 at 16:04
Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values forrandom_state
produce the warning message but others don't... would you happen to know why?
– Seankala
Nov 22 at 16:09
1
Userandom_state
everywhere where it is applicable. In your case, itstrain_test_split()
andDecisionTreeClassifier()
. Also, usestratify
option intrain_test_split()
to get a balanced split (classes) between train and test data (which may help in avoiding theUndefinedMetric
warning).
– Vivek Kumar
Nov 23 at 5:58
Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
– Seankala
Nov 24 at 2:10
1
1
try with
dectree = DecisionTreeClassifier(random_state=42)
– Sociopath
Nov 22 at 16:04
try with
dectree = DecisionTreeClassifier(random_state=42)
– Sociopath
Nov 22 at 16:04
Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for
random_state
produce the warning message but others don't... would you happen to know why?– Seankala
Nov 22 at 16:09
Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for
random_state
produce the warning message but others don't... would you happen to know why?– Seankala
Nov 22 at 16:09
1
1
Use
random_state
everywhere where it is applicable. In your case, its train_test_split()
and DecisionTreeClassifier()
. Also, use stratify
option in train_test_split()
to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric
warning).– Vivek Kumar
Nov 23 at 5:58
Use
random_state
everywhere where it is applicable. In your case, its train_test_split()
and DecisionTreeClassifier()
. Also, use stratify
option in train_test_split()
to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric
warning).– Vivek Kumar
Nov 23 at 5:58
Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
– Seankala
Nov 24 at 2:10
Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
– Seankala
Nov 24 at 2:10
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state
parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state
parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)
add a comment |
up vote
2
down vote
accepted
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state
parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state
parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state
parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)
answered Nov 22 at 16:15
Naveen
676113
676113
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434609%2fsklearn-decisiontreeclassifier-f-score-different-results-with-each-run%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
try with
dectree = DecisionTreeClassifier(random_state=42)
– Sociopath
Nov 22 at 16:04
Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for
random_state
produce the warning message but others don't... would you happen to know why?– Seankala
Nov 22 at 16:09
1
Use
random_state
everywhere where it is applicable. In your case, itstrain_test_split()
andDecisionTreeClassifier()
. Also, usestratify
option intrain_test_split()
to get a balanced split (classes) between train and test data (which may help in avoiding theUndefinedMetric
warning).– Vivek Kumar
Nov 23 at 5:58
Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)
– Seankala
Nov 24 at 2:10