finding rows with one difference in DataFrame
up vote
2
down vote
favorite
I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.
A B C D ..... Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
2 40 'Kansas' 'Dem' 34 1
3 30 'Kansas' 'Dem' 45 2
4 55 'Texas' 'Rep' 2 7
....
38 55 'Texas' 'Dem' 2 7
I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.
A B C D ......Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
38 55 'Texas' 'Dem' 2 7
I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.
pandas dataframe duplicates
add a comment |
up vote
2
down vote
favorite
I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.
A B C D ..... Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
2 40 'Kansas' 'Dem' 34 1
3 30 'Kansas' 'Dem' 45 2
4 55 'Texas' 'Rep' 2 7
....
38 55 'Texas' 'Dem' 2 7
I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.
A B C D ......Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
38 55 'Texas' 'Dem' 2 7
I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.
pandas dataframe duplicates
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.
A B C D ..... Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
2 40 'Kansas' 'Dem' 34 1
3 30 'Kansas' 'Dem' 45 2
4 55 'Texas' 'Rep' 2 7
....
38 55 'Texas' 'Dem' 2 7
I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.
A B C D ......Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
38 55 'Texas' 'Dem' 2 7
I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.
pandas dataframe duplicates
I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.
A B C D ..... Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
2 40 'Kansas' 'Dem' 34 1
3 30 'Kansas' 'Dem' 45 2
4 55 'Texas' 'Rep' 2 7
....
38 55 'Texas' 'Dem' 2 7
I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.
A B C D ......Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
38 55 'Texas' 'Dem' 2 7
I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.
pandas dataframe duplicates
pandas dataframe duplicates
edited Nov 22 at 16:45
asked Nov 22 at 13:54
Mike
65121123
65121123
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
Get all columns without C
by difference
to list, then sort_values
per column C
and convert it to tuples
per groups. Last join
to original, compare by Rep,Dem
and filter by boolean indexing
:
cols = df.columns.difference(['C']).tolist()
s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
df = df[df.join(s, on=cols)['m']]
Another solution is compare by set
s, but if possible because multiple same values per groups like Rep,Dem,Dem
is possible chain condition with size
:
g = df.groupby(cols)['C']
m1 = g.transform('size') == 2
m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))
df = df[m1 & m2]
print (df)
A B C D Z
0 50 'Ohio' Rep 3 45
1 50 'Ohio' Dem 3 45
4 55 'Texas' Rep 2 7
38 55 'Texas' Dem 2 7
add a comment |
up vote
1
down vote
You can use duplicated
with the argument keep
to False
to create a mask for duplicated rows having dropped column c
and use isin
to filter the rows that have any of ['Rep','Dem']
in them:
mask = df.drop(['C'], axis = 1).duplicated(keep=False)
df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()
A B C D Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
5 55 'Texas' 'Dem' 2 7
isin
cannot be used here, because it select also 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:23
True. I guess addingdrop_duplicates()
in the end solves it for the cases where there's onlyRep
andRep
inc
?
– nixon
Nov 22 at 14:28
hmmm, not sure if possible general solution withisin
...
– jezrael
Nov 22 at 14:30
From what I understand @Mike only wants rows that are identical except for columnc
, where two instances ofRep
andDem
cannot be found, only one of each. So adding adrop_duplicates
should solve it in my opinion. Thx for pointing out though!
– nixon
Nov 22 at 14:37
OP needI would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38
|
show 1 more comment
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Get all columns without C
by difference
to list, then sort_values
per column C
and convert it to tuples
per groups. Last join
to original, compare by Rep,Dem
and filter by boolean indexing
:
cols = df.columns.difference(['C']).tolist()
s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
df = df[df.join(s, on=cols)['m']]
Another solution is compare by set
s, but if possible because multiple same values per groups like Rep,Dem,Dem
is possible chain condition with size
:
g = df.groupby(cols)['C']
m1 = g.transform('size') == 2
m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))
df = df[m1 & m2]
print (df)
A B C D Z
0 50 'Ohio' Rep 3 45
1 50 'Ohio' Dem 3 45
4 55 'Texas' Rep 2 7
38 55 'Texas' Dem 2 7
add a comment |
up vote
1
down vote
Get all columns without C
by difference
to list, then sort_values
per column C
and convert it to tuples
per groups. Last join
to original, compare by Rep,Dem
and filter by boolean indexing
:
cols = df.columns.difference(['C']).tolist()
s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
df = df[df.join(s, on=cols)['m']]
Another solution is compare by set
s, but if possible because multiple same values per groups like Rep,Dem,Dem
is possible chain condition with size
:
g = df.groupby(cols)['C']
m1 = g.transform('size') == 2
m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))
df = df[m1 & m2]
print (df)
A B C D Z
0 50 'Ohio' Rep 3 45
1 50 'Ohio' Dem 3 45
4 55 'Texas' Rep 2 7
38 55 'Texas' Dem 2 7
add a comment |
up vote
1
down vote
up vote
1
down vote
Get all columns without C
by difference
to list, then sort_values
per column C
and convert it to tuples
per groups. Last join
to original, compare by Rep,Dem
and filter by boolean indexing
:
cols = df.columns.difference(['C']).tolist()
s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
df = df[df.join(s, on=cols)['m']]
Another solution is compare by set
s, but if possible because multiple same values per groups like Rep,Dem,Dem
is possible chain condition with size
:
g = df.groupby(cols)['C']
m1 = g.transform('size') == 2
m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))
df = df[m1 & m2]
print (df)
A B C D Z
0 50 'Ohio' Rep 3 45
1 50 'Ohio' Dem 3 45
4 55 'Texas' Rep 2 7
38 55 'Texas' Dem 2 7
Get all columns without C
by difference
to list, then sort_values
per column C
and convert it to tuples
per groups. Last join
to original, compare by Rep,Dem
and filter by boolean indexing
:
cols = df.columns.difference(['C']).tolist()
s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
df = df[df.join(s, on=cols)['m']]
Another solution is compare by set
s, but if possible because multiple same values per groups like Rep,Dem,Dem
is possible chain condition with size
:
g = df.groupby(cols)['C']
m1 = g.transform('size') == 2
m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))
df = df[m1 & m2]
print (df)
A B C D Z
0 50 'Ohio' Rep 3 45
1 50 'Ohio' Dem 3 45
4 55 'Texas' Rep 2 7
38 55 'Texas' Dem 2 7
edited Nov 22 at 14:28
answered Nov 22 at 14:00
jezrael
314k21252330
314k21252330
add a comment |
add a comment |
up vote
1
down vote
You can use duplicated
with the argument keep
to False
to create a mask for duplicated rows having dropped column c
and use isin
to filter the rows that have any of ['Rep','Dem']
in them:
mask = df.drop(['C'], axis = 1).duplicated(keep=False)
df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()
A B C D Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
5 55 'Texas' 'Dem' 2 7
isin
cannot be used here, because it select also 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:23
True. I guess addingdrop_duplicates()
in the end solves it for the cases where there's onlyRep
andRep
inc
?
– nixon
Nov 22 at 14:28
hmmm, not sure if possible general solution withisin
...
– jezrael
Nov 22 at 14:30
From what I understand @Mike only wants rows that are identical except for columnc
, where two instances ofRep
andDem
cannot be found, only one of each. So adding adrop_duplicates
should solve it in my opinion. Thx for pointing out though!
– nixon
Nov 22 at 14:37
OP needI would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38
|
show 1 more comment
up vote
1
down vote
You can use duplicated
with the argument keep
to False
to create a mask for duplicated rows having dropped column c
and use isin
to filter the rows that have any of ['Rep','Dem']
in them:
mask = df.drop(['C'], axis = 1).duplicated(keep=False)
df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()
A B C D Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
5 55 'Texas' 'Dem' 2 7
isin
cannot be used here, because it select also 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:23
True. I guess addingdrop_duplicates()
in the end solves it for the cases where there's onlyRep
andRep
inc
?
– nixon
Nov 22 at 14:28
hmmm, not sure if possible general solution withisin
...
– jezrael
Nov 22 at 14:30
From what I understand @Mike only wants rows that are identical except for columnc
, where two instances ofRep
andDem
cannot be found, only one of each. So adding adrop_duplicates
should solve it in my opinion. Thx for pointing out though!
– nixon
Nov 22 at 14:37
OP needI would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38
|
show 1 more comment
up vote
1
down vote
up vote
1
down vote
You can use duplicated
with the argument keep
to False
to create a mask for duplicated rows having dropped column c
and use isin
to filter the rows that have any of ['Rep','Dem']
in them:
mask = df.drop(['C'], axis = 1).duplicated(keep=False)
df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()
A B C D Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
5 55 'Texas' 'Dem' 2 7
You can use duplicated
with the argument keep
to False
to create a mask for duplicated rows having dropped column c
and use isin
to filter the rows that have any of ['Rep','Dem']
in them:
mask = df.drop(['C'], axis = 1).duplicated(keep=False)
df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()
A B C D Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
5 55 'Texas' 'Dem' 2 7
edited Nov 22 at 14:30
answered Nov 22 at 13:59
nixon
1,961117
1,961117
isin
cannot be used here, because it select also 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:23
True. I guess addingdrop_duplicates()
in the end solves it for the cases where there's onlyRep
andRep
inc
?
– nixon
Nov 22 at 14:28
hmmm, not sure if possible general solution withisin
...
– jezrael
Nov 22 at 14:30
From what I understand @Mike only wants rows that are identical except for columnc
, where two instances ofRep
andDem
cannot be found, only one of each. So adding adrop_duplicates
should solve it in my opinion. Thx for pointing out though!
– nixon
Nov 22 at 14:37
OP needI would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38
|
show 1 more comment
isin
cannot be used here, because it select also 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:23
True. I guess addingdrop_duplicates()
in the end solves it for the cases where there's onlyRep
andRep
inc
?
– nixon
Nov 22 at 14:28
hmmm, not sure if possible general solution withisin
...
– jezrael
Nov 22 at 14:30
From what I understand @Mike only wants rows that are identical except for columnc
, where two instances ofRep
andDem
cannot be found, only one of each. So adding adrop_duplicates
should solve it in my opinion. Thx for pointing out though!
– nixon
Nov 22 at 14:37
OP needI would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38
isin
cannot be used here, because it select also 'Rep' and 'Rep'– jezrael
Nov 22 at 14:23
isin
cannot be used here, because it select also 'Rep' and 'Rep'– jezrael
Nov 22 at 14:23
True. I guess adding
drop_duplicates()
in the end solves it for the cases where there's only Rep
and Rep
in c
?– nixon
Nov 22 at 14:28
True. I guess adding
drop_duplicates()
in the end solves it for the cases where there's only Rep
and Rep
in c
?– nixon
Nov 22 at 14:28
hmmm, not sure if possible general solution with
isin
...– jezrael
Nov 22 at 14:30
hmmm, not sure if possible general solution with
isin
...– jezrael
Nov 22 at 14:30
From what I understand @Mike only wants rows that are identical except for column
c
, where two instances of Rep
and Dem
cannot be found, only one of each. So adding a drop_duplicates
should solve it in my opinion. Thx for pointing out though!– nixon
Nov 22 at 14:37
From what I understand @Mike only wants rows that are identical except for column
c
, where two instances of Rep
and Dem
cannot be found, only one of each. So adding a drop_duplicates
should solve it in my opinion. Thx for pointing out though!– nixon
Nov 22 at 14:37
OP need
I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38
OP need
I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53432520%2ffinding-rows-with-one-difference-in-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown