A pythonic and uFunc-y way to turn pandas column into “increasing” index?











up vote
7
down vote

favorite
2












Let's say I have a pandas df like so:



Index   A     B
0 foo 3
1 foo 2
2 foo 5
3 bar 3
4 bar 4
5 baz 5


What's a good fast way to add a column like so:



Index   A     B    Aidx
0 foo 3 0
1 foo 2 0
2 foo 5 0
3 bar 3 1
4 bar 4 1
5 baz 5 2


I.e. adding an increasing index for each unique value?



I know I could use df.unique(), then use a dict and enumerate to create a lookup, and then apply that dictionary lookup to create the column. But I feel like there should be faster way, possibly involving groupby with some special function?










share|improve this question


























    up vote
    7
    down vote

    favorite
    2












    Let's say I have a pandas df like so:



    Index   A     B
    0 foo 3
    1 foo 2
    2 foo 5
    3 bar 3
    4 bar 4
    5 baz 5


    What's a good fast way to add a column like so:



    Index   A     B    Aidx
    0 foo 3 0
    1 foo 2 0
    2 foo 5 0
    3 bar 3 1
    4 bar 4 1
    5 baz 5 2


    I.e. adding an increasing index for each unique value?



    I know I could use df.unique(), then use a dict and enumerate to create a lookup, and then apply that dictionary lookup to create the column. But I feel like there should be faster way, possibly involving groupby with some special function?










    share|improve this question
























      up vote
      7
      down vote

      favorite
      2









      up vote
      7
      down vote

      favorite
      2






      2





      Let's say I have a pandas df like so:



      Index   A     B
      0 foo 3
      1 foo 2
      2 foo 5
      3 bar 3
      4 bar 4
      5 baz 5


      What's a good fast way to add a column like so:



      Index   A     B    Aidx
      0 foo 3 0
      1 foo 2 0
      2 foo 5 0
      3 bar 3 1
      4 bar 4 1
      5 baz 5 2


      I.e. adding an increasing index for each unique value?



      I know I could use df.unique(), then use a dict and enumerate to create a lookup, and then apply that dictionary lookup to create the column. But I feel like there should be faster way, possibly involving groupby with some special function?










      share|improve this question













      Let's say I have a pandas df like so:



      Index   A     B
      0 foo 3
      1 foo 2
      2 foo 5
      3 bar 3
      4 bar 4
      5 baz 5


      What's a good fast way to add a column like so:



      Index   A     B    Aidx
      0 foo 3 0
      1 foo 2 0
      2 foo 5 0
      3 bar 3 1
      4 bar 4 1
      5 baz 5 2


      I.e. adding an increasing index for each unique value?



      I know I could use df.unique(), then use a dict and enumerate to create a lookup, and then apply that dictionary lookup to create the column. But I feel like there should be faster way, possibly involving groupby with some special function?







      python pandas






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 2 hours ago









      Lagerbaer

      2,6231124




      2,6231124
























          3 Answers
          3






          active

          oldest

          votes

















          up vote
          7
          down vote













          One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:



          df['Aidx'] = df.groupby('A',sort=False).ngroup()
          >>> df
          Index A B Aidx
          0 0 foo 3 0
          1 1 foo 2 0
          2 2 foo 5 0
          3 3 bar 3 1
          4 4 bar 4 1
          5 5 baz 5 2





          share|improve this answer






























            up vote
            6
            down vote













            No need groupby using





            Method 1factorize



            pd.factorize(df.A)[0]
            array([0, 0, 0, 1, 1, 2], dtype=int64)
            #df['Aidx']=pd.factorize(df.A)[0]




            Method 2 sklearn



            from sklearn import preprocessing
            le = preprocessing.LabelEncoder()
            le.fit(df.A)
            LabelEncoder()
            le.transform(df.A)
            array([2, 2, 2, 0, 0, 1])




            Method 3 cat.codes



            df.A.astype('category').cat.codes




            Method 4 map + unique



            l=df.A.unique()
            df.A.map(dict(zip(l,range(len(l)))))
            0 0
            1 0
            2 0
            3 1
            4 1
            5 2
            Name: A, dtype: int64




            Method 5 np.unique



            x,y=np.unique(df.A.values,return_inverse=True)
            y
            array([2, 2, 2, 0, 0, 1], dtype=int64)





            share|improve this answer























            • Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
              – Vaishali
              38 mins ago












            • @Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
              – W-B
              25 mins ago


















            up vote
            4
            down vote













            One more method of doing so could be.



            df['C'] = i.ne(df.A.shift()).cumsum()-1
            df


            When we print df value it will be as follows.



              Index  A    B  C
            0 0 foo 3 0
            1 1 foo 2 0
            2 2 foo 5 0
            3 3 bar 3 1
            4 4 bar 4 1
            5 5 baz 5 2


            Explanation of solution: Let's break above solution into parts for understanding purposes.



            1st step: Compare df's A column by shifting its value down to itself as follows.



            i.ne(df.A.shift())


            Output we will get is:



            0     True
            1 False
            2 False
            3 True
            4 False
            5 True


            2nd step: Use of cumsum() function, so wherever TRUE value is coming(which will come when a match of A column and its shift is NOT found) it will call cumsum() function and its value will be increased.



            i.ne(df.A.shift()).cumsum()-1
            0 0
            1 0
            2 0
            3 1
            4 1
            5 2
            Name: A, dtype: int32


            3rd step: Save command's value into df['C'] which will create a new column named C in df.






            share|improve this answer



















            • 1




              Nice method ve++ for you
              – W-B
              1 hour ago










            • @W-B, thank you for encouragement sir, ++ve for your unique style already :)
              – RavinderSingh13
              1 hour ago













            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53772121%2fa-pythonic-and-ufunc-y-way-to-turn-pandas-column-into-increasing-index%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            7
            down vote













            One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:



            df['Aidx'] = df.groupby('A',sort=False).ngroup()
            >>> df
            Index A B Aidx
            0 0 foo 3 0
            1 1 foo 2 0
            2 2 foo 5 0
            3 3 bar 3 1
            4 4 bar 4 1
            5 5 baz 5 2





            share|improve this answer



























              up vote
              7
              down vote













              One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:



              df['Aidx'] = df.groupby('A',sort=False).ngroup()
              >>> df
              Index A B Aidx
              0 0 foo 3 0
              1 1 foo 2 0
              2 2 foo 5 0
              3 3 bar 3 1
              4 4 bar 4 1
              5 5 baz 5 2





              share|improve this answer

























                up vote
                7
                down vote










                up vote
                7
                down vote









                One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:



                df['Aidx'] = df.groupby('A',sort=False).ngroup()
                >>> df
                Index A B Aidx
                0 0 foo 3 0
                1 1 foo 2 0
                2 2 foo 5 0
                3 3 bar 3 1
                4 4 bar 4 1
                5 5 baz 5 2





                share|improve this answer














                One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:



                df['Aidx'] = df.groupby('A',sort=False).ngroup()
                >>> df
                Index A B Aidx
                0 0 foo 3 0
                1 1 foo 2 0
                2 2 foo 5 0
                3 3 bar 3 1
                4 4 bar 4 1
                5 5 baz 5 2






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 1 hour ago

























                answered 1 hour ago









                sacul

                29.7k41640




                29.7k41640
























                    up vote
                    6
                    down vote













                    No need groupby using





                    Method 1factorize



                    pd.factorize(df.A)[0]
                    array([0, 0, 0, 1, 1, 2], dtype=int64)
                    #df['Aidx']=pd.factorize(df.A)[0]




                    Method 2 sklearn



                    from sklearn import preprocessing
                    le = preprocessing.LabelEncoder()
                    le.fit(df.A)
                    LabelEncoder()
                    le.transform(df.A)
                    array([2, 2, 2, 0, 0, 1])




                    Method 3 cat.codes



                    df.A.astype('category').cat.codes




                    Method 4 map + unique



                    l=df.A.unique()
                    df.A.map(dict(zip(l,range(len(l)))))
                    0 0
                    1 0
                    2 0
                    3 1
                    4 1
                    5 2
                    Name: A, dtype: int64




                    Method 5 np.unique



                    x,y=np.unique(df.A.values,return_inverse=True)
                    y
                    array([2, 2, 2, 0, 0, 1], dtype=int64)





                    share|improve this answer























                    • Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
                      – Vaishali
                      38 mins ago












                    • @Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
                      – W-B
                      25 mins ago















                    up vote
                    6
                    down vote













                    No need groupby using





                    Method 1factorize



                    pd.factorize(df.A)[0]
                    array([0, 0, 0, 1, 1, 2], dtype=int64)
                    #df['Aidx']=pd.factorize(df.A)[0]




                    Method 2 sklearn



                    from sklearn import preprocessing
                    le = preprocessing.LabelEncoder()
                    le.fit(df.A)
                    LabelEncoder()
                    le.transform(df.A)
                    array([2, 2, 2, 0, 0, 1])




                    Method 3 cat.codes



                    df.A.astype('category').cat.codes




                    Method 4 map + unique



                    l=df.A.unique()
                    df.A.map(dict(zip(l,range(len(l)))))
                    0 0
                    1 0
                    2 0
                    3 1
                    4 1
                    5 2
                    Name: A, dtype: int64




                    Method 5 np.unique



                    x,y=np.unique(df.A.values,return_inverse=True)
                    y
                    array([2, 2, 2, 0, 0, 1], dtype=int64)





                    share|improve this answer























                    • Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
                      – Vaishali
                      38 mins ago












                    • @Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
                      – W-B
                      25 mins ago













                    up vote
                    6
                    down vote










                    up vote
                    6
                    down vote









                    No need groupby using





                    Method 1factorize



                    pd.factorize(df.A)[0]
                    array([0, 0, 0, 1, 1, 2], dtype=int64)
                    #df['Aidx']=pd.factorize(df.A)[0]




                    Method 2 sklearn



                    from sklearn import preprocessing
                    le = preprocessing.LabelEncoder()
                    le.fit(df.A)
                    LabelEncoder()
                    le.transform(df.A)
                    array([2, 2, 2, 0, 0, 1])




                    Method 3 cat.codes



                    df.A.astype('category').cat.codes




                    Method 4 map + unique



                    l=df.A.unique()
                    df.A.map(dict(zip(l,range(len(l)))))
                    0 0
                    1 0
                    2 0
                    3 1
                    4 1
                    5 2
                    Name: A, dtype: int64




                    Method 5 np.unique



                    x,y=np.unique(df.A.values,return_inverse=True)
                    y
                    array([2, 2, 2, 0, 0, 1], dtype=int64)





                    share|improve this answer














                    No need groupby using





                    Method 1factorize



                    pd.factorize(df.A)[0]
                    array([0, 0, 0, 1, 1, 2], dtype=int64)
                    #df['Aidx']=pd.factorize(df.A)[0]




                    Method 2 sklearn



                    from sklearn import preprocessing
                    le = preprocessing.LabelEncoder()
                    le.fit(df.A)
                    LabelEncoder()
                    le.transform(df.A)
                    array([2, 2, 2, 0, 0, 1])




                    Method 3 cat.codes



                    df.A.astype('category').cat.codes




                    Method 4 map + unique



                    l=df.A.unique()
                    df.A.map(dict(zip(l,range(len(l)))))
                    0 0
                    1 0
                    2 0
                    3 1
                    4 1
                    5 2
                    Name: A, dtype: int64




                    Method 5 np.unique



                    x,y=np.unique(df.A.values,return_inverse=True)
                    y
                    array([2, 2, 2, 0, 0, 1], dtype=int64)






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited 28 mins ago

























                    answered 1 hour ago









                    W-B

                    97.8k73162




                    97.8k73162












                    • Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
                      – Vaishali
                      38 mins ago












                    • @Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
                      – W-B
                      25 mins ago


















                    • Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
                      – Vaishali
                      38 mins ago












                    • @Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
                      – W-B
                      25 mins ago
















                    Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
                    – Vaishali
                    38 mins ago






                    Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
                    – Vaishali
                    38 mins ago














                    @Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
                    – W-B
                    25 mins ago




                    @Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
                    – W-B
                    25 mins ago










                    up vote
                    4
                    down vote













                    One more method of doing so could be.



                    df['C'] = i.ne(df.A.shift()).cumsum()-1
                    df


                    When we print df value it will be as follows.



                      Index  A    B  C
                    0 0 foo 3 0
                    1 1 foo 2 0
                    2 2 foo 5 0
                    3 3 bar 3 1
                    4 4 bar 4 1
                    5 5 baz 5 2


                    Explanation of solution: Let's break above solution into parts for understanding purposes.



                    1st step: Compare df's A column by shifting its value down to itself as follows.



                    i.ne(df.A.shift())


                    Output we will get is:



                    0     True
                    1 False
                    2 False
                    3 True
                    4 False
                    5 True


                    2nd step: Use of cumsum() function, so wherever TRUE value is coming(which will come when a match of A column and its shift is NOT found) it will call cumsum() function and its value will be increased.



                    i.ne(df.A.shift()).cumsum()-1
                    0 0
                    1 0
                    2 0
                    3 1
                    4 1
                    5 2
                    Name: A, dtype: int32


                    3rd step: Save command's value into df['C'] which will create a new column named C in df.






                    share|improve this answer



















                    • 1




                      Nice method ve++ for you
                      – W-B
                      1 hour ago










                    • @W-B, thank you for encouragement sir, ++ve for your unique style already :)
                      – RavinderSingh13
                      1 hour ago

















                    up vote
                    4
                    down vote













                    One more method of doing so could be.



                    df['C'] = i.ne(df.A.shift()).cumsum()-1
                    df


                    When we print df value it will be as follows.



                      Index  A    B  C
                    0 0 foo 3 0
                    1 1 foo 2 0
                    2 2 foo 5 0
                    3 3 bar 3 1
                    4 4 bar 4 1
                    5 5 baz 5 2


                    Explanation of solution: Let's break above solution into parts for understanding purposes.



                    1st step: Compare df's A column by shifting its value down to itself as follows.



                    i.ne(df.A.shift())


                    Output we will get is:



                    0     True
                    1 False
                    2 False
                    3 True
                    4 False
                    5 True


                    2nd step: Use of cumsum() function, so wherever TRUE value is coming(which will come when a match of A column and its shift is NOT found) it will call cumsum() function and its value will be increased.



                    i.ne(df.A.shift()).cumsum()-1
                    0 0
                    1 0
                    2 0
                    3 1
                    4 1
                    5 2
                    Name: A, dtype: int32


                    3rd step: Save command's value into df['C'] which will create a new column named C in df.






                    share|improve this answer



















                    • 1




                      Nice method ve++ for you
                      – W-B
                      1 hour ago










                    • @W-B, thank you for encouragement sir, ++ve for your unique style already :)
                      – RavinderSingh13
                      1 hour ago















                    up vote
                    4
                    down vote










                    up vote
                    4
                    down vote









                    One more method of doing so could be.



                    df['C'] = i.ne(df.A.shift()).cumsum()-1
                    df


                    When we print df value it will be as follows.



                      Index  A    B  C
                    0 0 foo 3 0
                    1 1 foo 2 0
                    2 2 foo 5 0
                    3 3 bar 3 1
                    4 4 bar 4 1
                    5 5 baz 5 2


                    Explanation of solution: Let's break above solution into parts for understanding purposes.



                    1st step: Compare df's A column by shifting its value down to itself as follows.



                    i.ne(df.A.shift())


                    Output we will get is:



                    0     True
                    1 False
                    2 False
                    3 True
                    4 False
                    5 True


                    2nd step: Use of cumsum() function, so wherever TRUE value is coming(which will come when a match of A column and its shift is NOT found) it will call cumsum() function and its value will be increased.



                    i.ne(df.A.shift()).cumsum()-1
                    0 0
                    1 0
                    2 0
                    3 1
                    4 1
                    5 2
                    Name: A, dtype: int32


                    3rd step: Save command's value into df['C'] which will create a new column named C in df.






                    share|improve this answer














                    One more method of doing so could be.



                    df['C'] = i.ne(df.A.shift()).cumsum()-1
                    df


                    When we print df value it will be as follows.



                      Index  A    B  C
                    0 0 foo 3 0
                    1 1 foo 2 0
                    2 2 foo 5 0
                    3 3 bar 3 1
                    4 4 bar 4 1
                    5 5 baz 5 2


                    Explanation of solution: Let's break above solution into parts for understanding purposes.



                    1st step: Compare df's A column by shifting its value down to itself as follows.



                    i.ne(df.A.shift())


                    Output we will get is:



                    0     True
                    1 False
                    2 False
                    3 True
                    4 False
                    5 True


                    2nd step: Use of cumsum() function, so wherever TRUE value is coming(which will come when a match of A column and its shift is NOT found) it will call cumsum() function and its value will be increased.



                    i.ne(df.A.shift()).cumsum()-1
                    0 0
                    1 0
                    2 0
                    3 1
                    4 1
                    5 2
                    Name: A, dtype: int32


                    3rd step: Save command's value into df['C'] which will create a new column named C in df.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited 1 hour ago

























                    answered 1 hour ago









                    RavinderSingh13

                    25k41437




                    25k41437








                    • 1




                      Nice method ve++ for you
                      – W-B
                      1 hour ago










                    • @W-B, thank you for encouragement sir, ++ve for your unique style already :)
                      – RavinderSingh13
                      1 hour ago
















                    • 1




                      Nice method ve++ for you
                      – W-B
                      1 hour ago










                    • @W-B, thank you for encouragement sir, ++ve for your unique style already :)
                      – RavinderSingh13
                      1 hour ago










                    1




                    1




                    Nice method ve++ for you
                    – W-B
                    1 hour ago




                    Nice method ve++ for you
                    – W-B
                    1 hour ago












                    @W-B, thank you for encouragement sir, ++ve for your unique style already :)
                    – RavinderSingh13
                    1 hour ago






                    @W-B, thank you for encouragement sir, ++ve for your unique style already :)
                    – RavinderSingh13
                    1 hour ago




















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53772121%2fa-pythonic-and-ufunc-y-way-to-turn-pandas-column-into-increasing-index%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to ignore python UserWarning in pytest?

                    What visual should I use to simply compare current year value vs last year in Power BI desktop

                    Script to remove string up to first number