finding rows with one difference in DataFrame

up vote
2
down vote

favorite

I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.

    A    B         C     D ..... Z

0    50  'Ohio'   'Rep'  3       45

1    50  'Ohio'   'Dem'  3       45

2    40  'Kansas' 'Dem'  34      1

3    30  'Kansas' 'Dem'  45      2

4    55  'Texas'  'Rep'  2       7

....

38   55  'Texas'  'Dem'  2       7

I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.

     A    B         C   D ......Z

0    50  'Ohio'   'Rep'  3       45

1    50  'Ohio'   'Dem'  3       45

4    55  'Texas'  'Rep'  2       7

38   55  'Texas'  'Dem'  2       7

I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.

edited Nov 22 at 16:45

asked Nov 22 at 13:54

Mike

65121123

add a comment |

up vote
2
down vote

favorite

I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.

    A    B         C     D ..... Z

0    50  'Ohio'   'Rep'  3       45

1    50  'Ohio'   'Dem'  3       45

2    40  'Kansas' 'Dem'  34      1

3    30  'Kansas' 'Dem'  45      2

4    55  'Texas'  'Rep'  2       7

....

38   55  'Texas'  'Dem'  2       7

     A    B         C   D ......Z

0    50  'Ohio'   'Rep'  3       45

1    50  'Ohio'   'Dem'  3       45

4    55  'Texas'  'Rep'  2       7

38   55  'Texas'  'Dem'  2       7

edited Nov 22 at 16:45

asked Nov 22 at 13:54

Mike

65121123

add a comment |

up vote
2
down vote

favorite

I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.

    A    B         C     D ..... Z

0    50  'Ohio'   'Rep'  3       45

1    50  'Ohio'   'Dem'  3       45

2    40  'Kansas' 'Dem'  34      1

3    30  'Kansas' 'Dem'  45      2

4    55  'Texas'  'Rep'  2       7

....

38   55  'Texas'  'Dem'  2       7

     A    B         C   D ......Z

0    50  'Ohio'   'Rep'  3       45

1    50  'Ohio'   'Dem'  3       45

4    55  'Texas'  'Rep'  2       7

38   55  'Texas'  'Dem'  2       7

edited Nov 22 at 16:45

asked Nov 22 at 13:54

Mike

65121123

I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.

    A    B         C     D ..... Z

0    50  'Ohio'   'Rep'  3       45

1    50  'Ohio'   'Dem'  3       45

2    40  'Kansas' 'Dem'  34      1

3    30  'Kansas' 'Dem'  45      2

4    55  'Texas'  'Rep'  2       7

....

38   55  'Texas'  'Dem'  2       7

     A    B         C   D ......Z

0    50  'Ohio'   'Rep'  3       45

1    50  'Ohio'   'Dem'  3       45

4    55  'Texas'  'Rep'  2       7

38   55  'Texas'  'Dem'  2       7

pandas dataframe duplicates

edited Nov 22 at 16:45

asked Nov 22 at 13:54

Mike

65121123

edited Nov 22 at 16:45

asked Nov 22 at 13:54

Mike

65121123

edited Nov 22 at 16:45

asked Nov 22 at 13:54

Mike

65121123

asked Nov 22 at 13:54

Mike

65121123

asked Nov 22 at 13:54

Mike

65121123

add a comment |

2 Answers
2

active

oldest

votes

up vote
1
down vote

Get all columns without C by difference to list, then sort_values per column C and convert it to tuples per groups. Last join to original, compare by Rep,Dem and filter by boolean indexing:

cols = df.columns.difference(['C']).tolist()



s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')

df = df[df.join(s, on=cols)['m']]

Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:

g = df.groupby(cols)['C']

m1 = g.transform('size') == 2

m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))



df = df[m1 & m2]

print (df)

     A        B    C  D   Z

0   50   'Ohio'  Rep  3  45

1   50   'Ohio'  Dem  3  45

4   55  'Texas'  Rep  2   7

38  55  'Texas'  Dem  2   7

edited Nov 22 at 14:28

answered Nov 22 at 14:00

jezrael

314k21252330

add a comment |

up vote
1
down vote

You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:

mask = df.drop(['C'], axis = 1).duplicated(keep=False)

df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()



      A        B      C  D   Z

0  50   'Ohio'  'Rep'  3  45

1  50   'Ohio'  'Dem'  3  45

4  55  'Texas'  'Rep'  2   7

5  55  'Texas'  'Dem'  2   7

edited Nov 22 at 14:30

answered Nov 22 at 13:59

nixon

1,961117

isin cannot be used here, because it select also 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:23

True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
– nixon
Nov 22 at 14:28

hmmm, not sure if possible general solution with isin...
– jezrael
Nov 22 at 14:30

From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
– nixon
Nov 22 at 14:37

OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53432520%2ffinding-rows-with-one-difference-in-dataframe%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

cols = df.columns.difference(['C']).tolist()



s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')

df = df[df.join(s, on=cols)['m']]

Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:

g = df.groupby(cols)['C']

m1 = g.transform('size') == 2

m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))



df = df[m1 & m2]

print (df)

     A        B    C  D   Z

0   50   'Ohio'  Rep  3  45

1   50   'Ohio'  Dem  3  45

4   55  'Texas'  Rep  2   7

38  55  'Texas'  Dem  2   7

edited Nov 22 at 14:28

answered Nov 22 at 14:00

jezrael

314k21252330

add a comment |

up vote
1
down vote

cols = df.columns.difference(['C']).tolist()



s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')

df = df[df.join(s, on=cols)['m']]

Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:

g = df.groupby(cols)['C']

m1 = g.transform('size') == 2

m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))



df = df[m1 & m2]

print (df)

     A        B    C  D   Z

0   50   'Ohio'  Rep  3  45

1   50   'Ohio'  Dem  3  45

4   55  'Texas'  Rep  2   7

38  55  'Texas'  Dem  2   7

edited Nov 22 at 14:28

answered Nov 22 at 14:00

jezrael

314k21252330

add a comment |

up vote
1
down vote

cols = df.columns.difference(['C']).tolist()



s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')

df = df[df.join(s, on=cols)['m']]

Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:

g = df.groupby(cols)['C']

m1 = g.transform('size') == 2

m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))



df = df[m1 & m2]

print (df)

     A        B    C  D   Z

0   50   'Ohio'  Rep  3  45

1   50   'Ohio'  Dem  3  45

4   55  'Texas'  Rep  2   7

38  55  'Texas'  Dem  2   7

edited Nov 22 at 14:28

answered Nov 22 at 14:00

jezrael

314k21252330

cols = df.columns.difference(['C']).tolist()



s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')

df = df[df.join(s, on=cols)['m']]

Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:

g = df.groupby(cols)['C']

m1 = g.transform('size') == 2

m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))



df = df[m1 & m2]

print (df)

     A        B    C  D   Z

0   50   'Ohio'  Rep  3  45

1   50   'Ohio'  Dem  3  45

4   55  'Texas'  Rep  2   7

38  55  'Texas'  Dem  2   7

edited Nov 22 at 14:28

answered Nov 22 at 14:00

jezrael

314k21252330

edited Nov 22 at 14:28

answered Nov 22 at 14:00

jezrael

314k21252330

answered Nov 22 at 14:00

jezrael

314k21252330

answered Nov 22 at 14:00

jezrael

314k21252330

add a comment |

up vote
1
down vote

You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:

mask = df.drop(['C'], axis = 1).duplicated(keep=False)

df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()



      A        B      C  D   Z

0  50   'Ohio'  'Rep'  3  45

1  50   'Ohio'  'Dem'  3  45

4  55  'Texas'  'Rep'  2   7

5  55  'Texas'  'Dem'  2   7

edited Nov 22 at 14:30

answered Nov 22 at 13:59

nixon

1,961117

isin cannot be used here, because it select also 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:23

True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
– nixon
Nov 22 at 14:28

hmmm, not sure if possible general solution with isin...
– jezrael
Nov 22 at 14:30

From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
– nixon
Nov 22 at 14:37

OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38

|
show 1 more comment

up vote
1
down vote

You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:

mask = df.drop(['C'], axis = 1).duplicated(keep=False)

df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()



      A        B      C  D   Z

0  50   'Ohio'  'Rep'  3  45

1  50   'Ohio'  'Dem'  3  45

4  55  'Texas'  'Rep'  2   7

5  55  'Texas'  'Dem'  2   7

edited Nov 22 at 14:30

answered Nov 22 at 13:59

nixon

1,961117

isin cannot be used here, because it select also 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:23

True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
– nixon
Nov 22 at 14:28

hmmm, not sure if possible general solution with isin...
– jezrael
Nov 22 at 14:30

From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
– nixon
Nov 22 at 14:37

OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38

|
show 1 more comment

up vote
1
down vote

You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:

mask = df.drop(['C'], axis = 1).duplicated(keep=False)

df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()



      A        B      C  D   Z

0  50   'Ohio'  'Rep'  3  45

1  50   'Ohio'  'Dem'  3  45

4  55  'Texas'  'Rep'  2   7

5  55  'Texas'  'Dem'  2   7

edited Nov 22 at 14:30

answered Nov 22 at 13:59

nixon

1,961117

You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:

mask = df.drop(['C'], axis = 1).duplicated(keep=False)

df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()



      A        B      C  D   Z

0  50   'Ohio'  'Rep'  3  45

1  50   'Ohio'  'Dem'  3  45

4  55  'Texas'  'Rep'  2   7

5  55  'Texas'  'Dem'  2   7

edited Nov 22 at 14:30

answered Nov 22 at 13:59

nixon

1,961117

edited Nov 22 at 14:30

answered Nov 22 at 13:59

nixon

1,961117

answered Nov 22 at 13:59

nixon

1,961117

answered Nov 22 at 13:59

nixon

1,961117

isin cannot be used here, because it select also 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:23

True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
– nixon
Nov 22 at 14:28

hmmm, not sure if possible general solution with isin...
– jezrael
Nov 22 at 14:30

From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
– nixon
Nov 22 at 14:37

OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38

|
show 1 more comment

isin cannot be used here, because it select also 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:23

True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
– nixon
Nov 22 at 14:28

hmmm, not sure if possible general solution with isin...
– jezrael
Nov 22 at 14:30

From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
– nixon
Nov 22 at 14:37

OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:38

isin cannot be used here, because it select also 'Rep' and 'Rep'
– jezrael
Nov 22 at 14:23

True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
– nixon
Nov 22 at 14:28

hmmm, not sure if possible general solution with isin...
– jezrael
Nov 22 at 14:30

From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
– nixon
Nov 22 at 14:37

OP need

I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'

– jezrael
Nov 22 at 14:38

OP need

I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'

– jezrael
Nov 22 at 14:38

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Qfyilyi