finding rows with one difference in DataFrame











up vote
2
down vote

favorite












I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.



    A    B         C     D ..... Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
2 40 'Kansas' 'Dem' 34 1
3 30 'Kansas' 'Dem' 45 2
4 55 'Texas' 'Rep' 2 7
....
38 55 'Texas' 'Dem' 2 7


I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.



     A    B         C   D ......Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
38 55 'Texas' 'Dem' 2 7


I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.










share|improve this question




























    up vote
    2
    down vote

    favorite












    I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.



        A    B         C     D ..... Z
    0 50 'Ohio' 'Rep' 3 45
    1 50 'Ohio' 'Dem' 3 45
    2 40 'Kansas' 'Dem' 34 1
    3 30 'Kansas' 'Dem' 45 2
    4 55 'Texas' 'Rep' 2 7
    ....
    38 55 'Texas' 'Dem' 2 7


    I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.



         A    B         C   D ......Z
    0 50 'Ohio' 'Rep' 3 45
    1 50 'Ohio' 'Dem' 3 45
    4 55 'Texas' 'Rep' 2 7
    38 55 'Texas' 'Dem' 2 7


    I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.










    share|improve this question


























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.



          A    B         C     D ..... Z
      0 50 'Ohio' 'Rep' 3 45
      1 50 'Ohio' 'Dem' 3 45
      2 40 'Kansas' 'Dem' 34 1
      3 30 'Kansas' 'Dem' 45 2
      4 55 'Texas' 'Rep' 2 7
      ....
      38 55 'Texas' 'Dem' 2 7


      I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.



           A    B         C   D ......Z
      0 50 'Ohio' 'Rep' 3 45
      1 50 'Ohio' 'Dem' 3 45
      4 55 'Texas' 'Rep' 2 7
      38 55 'Texas' 'Dem' 2 7


      I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.










      share|improve this question















      I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.



          A    B         C     D ..... Z
      0 50 'Ohio' 'Rep' 3 45
      1 50 'Ohio' 'Dem' 3 45
      2 40 'Kansas' 'Dem' 34 1
      3 30 'Kansas' 'Dem' 45 2
      4 55 'Texas' 'Rep' 2 7
      ....
      38 55 'Texas' 'Dem' 2 7


      I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.



           A    B         C   D ......Z
      0 50 'Ohio' 'Rep' 3 45
      1 50 'Ohio' 'Dem' 3 45
      4 55 'Texas' 'Rep' 2 7
      38 55 'Texas' 'Dem' 2 7


      I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.







      pandas dataframe duplicates






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 22 at 16:45

























      asked Nov 22 at 13:54









      Mike

      65121123




      65121123
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote













          Get all columns without C by difference to list, then sort_values per column C and convert it to tuples per groups. Last join to original, compare by Rep,Dem and filter by boolean indexing:



          cols = df.columns.difference(['C']).tolist()

          s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
          df = df[df.join(s, on=cols)['m']]


          Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:



          g = df.groupby(cols)['C']
          m1 = g.transform('size') == 2
          m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))

          df = df[m1 & m2]




          print (df)
          A B C D Z
          0 50 'Ohio' Rep 3 45
          1 50 'Ohio' Dem 3 45
          4 55 'Texas' Rep 2 7
          38 55 'Texas' Dem 2 7





          share|improve this answer






























            up vote
            1
            down vote













            You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:



            mask = df.drop(['C'], axis = 1).duplicated(keep=False)
            df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()

            A B C D Z
            0 50 'Ohio' 'Rep' 3 45
            1 50 'Ohio' 'Dem' 3 45
            4 55 'Texas' 'Rep' 2 7
            5 55 'Texas' 'Dem' 2 7





            share|improve this answer























            • isin cannot be used here, because it select also 'Rep' and 'Rep'
              – jezrael
              Nov 22 at 14:23










            • True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
              – nixon
              Nov 22 at 14:28












            • hmmm, not sure if possible general solution with isin...
              – jezrael
              Nov 22 at 14:30










            • From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
              – nixon
              Nov 22 at 14:37










            • OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
              – jezrael
              Nov 22 at 14:38











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53432520%2ffinding-rows-with-one-difference-in-dataframe%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote













            Get all columns without C by difference to list, then sort_values per column C and convert it to tuples per groups. Last join to original, compare by Rep,Dem and filter by boolean indexing:



            cols = df.columns.difference(['C']).tolist()

            s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
            df = df[df.join(s, on=cols)['m']]


            Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:



            g = df.groupby(cols)['C']
            m1 = g.transform('size') == 2
            m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))

            df = df[m1 & m2]




            print (df)
            A B C D Z
            0 50 'Ohio' Rep 3 45
            1 50 'Ohio' Dem 3 45
            4 55 'Texas' Rep 2 7
            38 55 'Texas' Dem 2 7





            share|improve this answer



























              up vote
              1
              down vote













              Get all columns without C by difference to list, then sort_values per column C and convert it to tuples per groups. Last join to original, compare by Rep,Dem and filter by boolean indexing:



              cols = df.columns.difference(['C']).tolist()

              s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
              df = df[df.join(s, on=cols)['m']]


              Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:



              g = df.groupby(cols)['C']
              m1 = g.transform('size') == 2
              m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))

              df = df[m1 & m2]




              print (df)
              A B C D Z
              0 50 'Ohio' Rep 3 45
              1 50 'Ohio' Dem 3 45
              4 55 'Texas' Rep 2 7
              38 55 'Texas' Dem 2 7





              share|improve this answer

























                up vote
                1
                down vote










                up vote
                1
                down vote









                Get all columns without C by difference to list, then sort_values per column C and convert it to tuples per groups. Last join to original, compare by Rep,Dem and filter by boolean indexing:



                cols = df.columns.difference(['C']).tolist()

                s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
                df = df[df.join(s, on=cols)['m']]


                Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:



                g = df.groupby(cols)['C']
                m1 = g.transform('size') == 2
                m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))

                df = df[m1 & m2]




                print (df)
                A B C D Z
                0 50 'Ohio' Rep 3 45
                1 50 'Ohio' Dem 3 45
                4 55 'Texas' Rep 2 7
                38 55 'Texas' Dem 2 7





                share|improve this answer














                Get all columns without C by difference to list, then sort_values per column C and convert it to tuples per groups. Last join to original, compare by Rep,Dem and filter by boolean indexing:



                cols = df.columns.difference(['C']).tolist()

                s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
                df = df[df.join(s, on=cols)['m']]


                Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:



                g = df.groupby(cols)['C']
                m1 = g.transform('size') == 2
                m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))

                df = df[m1 & m2]




                print (df)
                A B C D Z
                0 50 'Ohio' Rep 3 45
                1 50 'Ohio' Dem 3 45
                4 55 'Texas' Rep 2 7
                38 55 'Texas' Dem 2 7






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 22 at 14:28

























                answered Nov 22 at 14:00









                jezrael

                314k21252330




                314k21252330
























                    up vote
                    1
                    down vote













                    You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:



                    mask = df.drop(['C'], axis = 1).duplicated(keep=False)
                    df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()

                    A B C D Z
                    0 50 'Ohio' 'Rep' 3 45
                    1 50 'Ohio' 'Dem' 3 45
                    4 55 'Texas' 'Rep' 2 7
                    5 55 'Texas' 'Dem' 2 7





                    share|improve this answer























                    • isin cannot be used here, because it select also 'Rep' and 'Rep'
                      – jezrael
                      Nov 22 at 14:23










                    • True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
                      – nixon
                      Nov 22 at 14:28












                    • hmmm, not sure if possible general solution with isin...
                      – jezrael
                      Nov 22 at 14:30










                    • From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
                      – nixon
                      Nov 22 at 14:37










                    • OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
                      – jezrael
                      Nov 22 at 14:38















                    up vote
                    1
                    down vote













                    You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:



                    mask = df.drop(['C'], axis = 1).duplicated(keep=False)
                    df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()

                    A B C D Z
                    0 50 'Ohio' 'Rep' 3 45
                    1 50 'Ohio' 'Dem' 3 45
                    4 55 'Texas' 'Rep' 2 7
                    5 55 'Texas' 'Dem' 2 7





                    share|improve this answer























                    • isin cannot be used here, because it select also 'Rep' and 'Rep'
                      – jezrael
                      Nov 22 at 14:23










                    • True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
                      – nixon
                      Nov 22 at 14:28












                    • hmmm, not sure if possible general solution with isin...
                      – jezrael
                      Nov 22 at 14:30










                    • From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
                      – nixon
                      Nov 22 at 14:37










                    • OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
                      – jezrael
                      Nov 22 at 14:38













                    up vote
                    1
                    down vote










                    up vote
                    1
                    down vote









                    You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:



                    mask = df.drop(['C'], axis = 1).duplicated(keep=False)
                    df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()

                    A B C D Z
                    0 50 'Ohio' 'Rep' 3 45
                    1 50 'Ohio' 'Dem' 3 45
                    4 55 'Texas' 'Rep' 2 7
                    5 55 'Texas' 'Dem' 2 7





                    share|improve this answer














                    You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:



                    mask = df.drop(['C'], axis = 1).duplicated(keep=False)
                    df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()

                    A B C D Z
                    0 50 'Ohio' 'Rep' 3 45
                    1 50 'Ohio' 'Dem' 3 45
                    4 55 'Texas' 'Rep' 2 7
                    5 55 'Texas' 'Dem' 2 7






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Nov 22 at 14:30

























                    answered Nov 22 at 13:59









                    nixon

                    1,961117




                    1,961117












                    • isin cannot be used here, because it select also 'Rep' and 'Rep'
                      – jezrael
                      Nov 22 at 14:23










                    • True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
                      – nixon
                      Nov 22 at 14:28












                    • hmmm, not sure if possible general solution with isin...
                      – jezrael
                      Nov 22 at 14:30










                    • From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
                      – nixon
                      Nov 22 at 14:37










                    • OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
                      – jezrael
                      Nov 22 at 14:38


















                    • isin cannot be used here, because it select also 'Rep' and 'Rep'
                      – jezrael
                      Nov 22 at 14:23










                    • True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
                      – nixon
                      Nov 22 at 14:28












                    • hmmm, not sure if possible general solution with isin...
                      – jezrael
                      Nov 22 at 14:30










                    • From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
                      – nixon
                      Nov 22 at 14:37










                    • OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
                      – jezrael
                      Nov 22 at 14:38
















                    isin cannot be used here, because it select also 'Rep' and 'Rep'
                    – jezrael
                    Nov 22 at 14:23




                    isin cannot be used here, because it select also 'Rep' and 'Rep'
                    – jezrael
                    Nov 22 at 14:23












                    True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
                    – nixon
                    Nov 22 at 14:28






                    True. I guess adding drop_duplicates() in the end solves it for the cases where there's only Repand Repin c?
                    – nixon
                    Nov 22 at 14:28














                    hmmm, not sure if possible general solution with isin...
                    – jezrael
                    Nov 22 at 14:30




                    hmmm, not sure if possible general solution with isin...
                    – jezrael
                    Nov 22 at 14:30












                    From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
                    – nixon
                    Nov 22 at 14:37




                    From what I understand @Mike only wants rows that are identical except for column c, where two instances of Rep and Demcannot be found, only one of each. So adding a drop_duplicatesshould solve it in my opinion. Thx for pointing out though!
                    – nixon
                    Nov 22 at 14:37












                    OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
                    – jezrael
                    Nov 22 at 14:38




                    OP need I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'
                    – jezrael
                    Nov 22 at 14:38


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53432520%2ffinding-rows-with-one-difference-in-dataframe%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Trompette piccolo

                    Slow SSRS Report in dynamic grouping and multiple parameters

                    Simon Yates (cyclisme)