Performant cartesian product (CROSS JOIN) with pandas

up vote
7
down vote

favorite

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})

right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})



left



  col1  col2

0    A     1

1    B     2

2    C     3



right



  col1  col2

0    X    20

1    Y    30

2    Z    50

The cross product of these frames can be computed, and will look something like:

A       1      X      20

A       1      Y      30

A       1      Z      50

B       2      X      20

B       2      Y      30

B       2      Z      50

C       3      X      20

C       3      Y      30

C       3      Z      50

What is the most performant method of computing this result?

edited 26 mins ago

asked 6 hours ago

coldspeed

113k18104177

add a comment |

up vote
7
down vote

favorite

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})

right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})



left



  col1  col2

0    A     1

1    B     2

2    C     3



right



  col1  col2

0    X    20

1    Y    30

2    Z    50

The cross product of these frames can be computed, and will look something like:

A       1      X      20

A       1      Y      30

A       1      Z      50

B       2      X      20

B       2      Y      30

B       2      Z      50

C       3      X      20

C       3      Y      30

C       3      Z      50

What is the most performant method of computing this result?

edited 26 mins ago

asked 6 hours ago

coldspeed

113k18104177

add a comment |

up vote
7
down vote

favorite

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})

right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})



left



  col1  col2

0    A     1

1    B     2

2    C     3



right



  col1  col2

0    X    20

1    Y    30

2    Z    50

The cross product of these frames can be computed, and will look something like:

A       1      X      20

A       1      Y      30

A       1      Z      50

B       2      X      20

B       2      Y      30

B       2      Z      50

C       3      X      20

C       3      Y      30

C       3      Z      50

What is the most performant method of computing this result?

edited 26 mins ago

asked 6 hours ago

coldspeed

113k18104177

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})

right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})



left



  col1  col2

0    A     1

1    B     2

2    C     3



right



  col1  col2

0    X    20

1    Y    30

2    Z    50

The cross product of these frames can be computed, and will look something like:

A       1      X      20

A       1      Y      30

A       1      Z      50

B       2      X      20

B       2      Y      30

B       2      Z      50

C       3      X      20

C       3      Y      30

C       3      Z      50

What is the most performant method of computing this result?

python pandas numpy dataframe merge

edited 26 mins ago

asked 6 hours ago

coldspeed

113k18104177

edited 26 mins ago

asked 6 hours ago

coldspeed

113k18104177

edited 26 mins ago

asked 6 hours ago

coldspeed

113k18104177

asked 6 hours ago

coldspeed

113k18104177

asked 6 hours ago

coldspeed

113k18104177

add a comment |

2 Answers
2

active

oldest

votes

up vote
8
down vote

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):

    return (

       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))



cartesian_product_basic(left, right)



  col1_x  col2_x col1_y  col2_y

0      A       1      X      20

1      A       1      Y      30

2      A       1      Z      50

3      B       2      X      20

4      B       2      Y      30

5      B       2      Z      50

6      C       3      X      20

7      C       3      Y      30

8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.

def cartesian_product(*arrays):

    la = len(arrays)

    dtype = np.result_type(*arrays)

    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)

    for i, a in enumerate(np.ix_(*arrays)):

        arr[...,i] = a

    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and

def cartesian_product_generalized(left, right):

    la, lb = len(left), len(right)

    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])

    return pd.DataFrame(

        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))



cartesian_product_generalized(left, right)



   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left, right))

True

And, along similar lines,

left2 = left.copy()

left2.index = ['s1', 's2', 's1']



right2 = right.copy()

right2.index = ['x', 'y', 'y']





left2

   col1  col2

s1    A     1

s2    B     2

s1    C     3



right2

  col1  col2

x    X    20

y    Y    30

y    Z    50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left2, right2))

True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):

    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])

    return pd.DataFrame(

        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))



cartesian_product_multi(*[left, right, left]).head()



   0  1  2   3  4  5

0  A  1  X  20  A  1

1  A  1  X  20  B  2

2  A  1  X  20  C  3

3  A  1  X  20  D  4

4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):

    la, lb = len(left), len(right)

    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])



    return pd.DataFrame(

        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))



np.array_equal(cartesian_product_simplified(left, right),

               cartesian_product_basic(left2, right2))

True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087

# I've put my own spin on this to make it as fast as possible.

def cartesian_product_itertools(left, right):

    return pd.DataFrame([

           [*x, *y] for x, y in itertools.product(

               left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit

import pandas as pd

import matplotlib.pyplot as plt



res = pd.DataFrame(

       index=['cartesian_product_basic', 'cartesian_product_generalized', 

              'cartesian_product_multi', 'cartesian_product_simplified',

              'cartesian_product_itertools'],

       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],

       dtype=float

)



for f in res.index: 

    for c in res.columns:

        # print(f,c)

        if f in {'cartesian_product_itertools'} and c > 600:

            continue

        left2 = pd.concat([left] * c, ignore_index=True)

        right2 = pd.concat([right] * c, ignore_index=True)

        stmt = '{}(left2, right2)'.format(f)

        setp = 'from __main__ import left2, right2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=5)



ax = res.div(res.min()).T.plot(loglog=True) 

ax.set_xlabel("N"); 

ax.set_ylabel("time (relative)");



plt.show()

edited 5 hours ago

answered 6 hours ago

coldspeed

113k18104177

1

Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
– U9-Forward
6 hours ago

1

@U9-Forward Thanks, appreciate your support!
– coldspeed
5 hours ago

Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
– b-fg
5 hours ago

@b-fg Fair enough, done.
– coldspeed
5 hours ago

Wow, I didn't expect that! Thanks, appreciate the effort.
– b-fg
5 hours ago

add a comment |

up vote
3
down vote

Using itertools product and recreate the value in dataframe

import itertools

l=list(itertools.product(left.values.tolist(),right.values.tolist()))

pd.DataFrame(list(map(lambda x : sum(x,),l)))

   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50

answered 5 hours ago

W-B

96.8k73162

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53699012%2fperformant-cartesian-product-cross-join-with-pandas%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
8
down vote

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):

    return (

       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))



cartesian_product_basic(left, right)



  col1_x  col2_x col1_y  col2_y

0      A       1      X      20

1      A       1      Y      30

2      A       1      Z      50

3      B       2      X      20

4      B       2      Y      30

5      B       2      Z      50

6      C       3      X      20

7      C       3      Y      30

8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):

    la = len(arrays)

    dtype = np.result_type(*arrays)

    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)

    for i, a in enumerate(np.ix_(*arrays)):

        arr[...,i] = a

    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

def cartesian_product_generalized(left, right):

    la, lb = len(left), len(right)

    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])

    return pd.DataFrame(

        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))



cartesian_product_generalized(left, right)



   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left, right))

True

And, along similar lines,

left2 = left.copy()

left2.index = ['s1', 's2', 's1']



right2 = right.copy()

right2.index = ['x', 'y', 'y']





left2

   col1  col2

s1    A     1

s2    B     2

s1    C     3



right2

  col1  col2

x    X    20

y    Y    30

y    Z    50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left2, right2))

True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):

    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])

    return pd.DataFrame(

        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))



cartesian_product_multi(*[left, right, left]).head()



   0  1  2   3  4  5

0  A  1  X  20  A  1

1  A  1  X  20  B  2

2  A  1  X  20  C  3

3  A  1  X  20  D  4

4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):

    la, lb = len(left), len(right)

    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])



    return pd.DataFrame(

        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))



np.array_equal(cartesian_product_simplified(left, right),

               cartesian_product_basic(left2, right2))

True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087

# I've put my own spin on this to make it as fast as possible.

def cartesian_product_itertools(left, right):

    return pd.DataFrame([

           [*x, *y] for x, y in itertools.product(

               left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit

import pandas as pd

import matplotlib.pyplot as plt



res = pd.DataFrame(

       index=['cartesian_product_basic', 'cartesian_product_generalized', 

              'cartesian_product_multi', 'cartesian_product_simplified',

              'cartesian_product_itertools'],

       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],

       dtype=float

)



for f in res.index: 

    for c in res.columns:

        # print(f,c)

        if f in {'cartesian_product_itertools'} and c > 600:

            continue

        left2 = pd.concat([left] * c, ignore_index=True)

        right2 = pd.concat([right] * c, ignore_index=True)

        stmt = '{}(left2, right2)'.format(f)

        setp = 'from __main__ import left2, right2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=5)



ax = res.div(res.min()).T.plot(loglog=True) 

ax.set_xlabel("N"); 

ax.set_ylabel("time (relative)");



plt.show()

edited 5 hours ago

answered 6 hours ago

coldspeed

113k18104177

1

Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
– U9-Forward
6 hours ago

1

@U9-Forward Thanks, appreciate your support!
– coldspeed
5 hours ago

Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
– b-fg
5 hours ago

@b-fg Fair enough, done.
– coldspeed
5 hours ago

Wow, I didn't expect that! Thanks, appreciate the effort.
– b-fg
5 hours ago

add a comment |

up vote
8
down vote

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):

    return (

       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))



cartesian_product_basic(left, right)



  col1_x  col2_x col1_y  col2_y

0      A       1      X      20

1      A       1      Y      30

2      A       1      Z      50

3      B       2      X      20

4      B       2      Y      30

5      B       2      Z      50

6      C       3      X      20

7      C       3      Y      30

8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):

    la = len(arrays)

    dtype = np.result_type(*arrays)

    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)

    for i, a in enumerate(np.ix_(*arrays)):

        arr[...,i] = a

    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

def cartesian_product_generalized(left, right):

    la, lb = len(left), len(right)

    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])

    return pd.DataFrame(

        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))



cartesian_product_generalized(left, right)



   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left, right))

True

And, along similar lines,

left2 = left.copy()

left2.index = ['s1', 's2', 's1']



right2 = right.copy()

right2.index = ['x', 'y', 'y']





left2

   col1  col2

s1    A     1

s2    B     2

s1    C     3



right2

  col1  col2

x    X    20

y    Y    30

y    Z    50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left2, right2))

True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):

    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])

    return pd.DataFrame(

        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))



cartesian_product_multi(*[left, right, left]).head()



   0  1  2   3  4  5

0  A  1  X  20  A  1

1  A  1  X  20  B  2

2  A  1  X  20  C  3

3  A  1  X  20  D  4

4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):

    la, lb = len(left), len(right)

    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])



    return pd.DataFrame(

        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))



np.array_equal(cartesian_product_simplified(left, right),

               cartesian_product_basic(left2, right2))

True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087

# I've put my own spin on this to make it as fast as possible.

def cartesian_product_itertools(left, right):

    return pd.DataFrame([

           [*x, *y] for x, y in itertools.product(

               left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit

import pandas as pd

import matplotlib.pyplot as plt



res = pd.DataFrame(

       index=['cartesian_product_basic', 'cartesian_product_generalized', 

              'cartesian_product_multi', 'cartesian_product_simplified',

              'cartesian_product_itertools'],

       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],

       dtype=float

)



for f in res.index: 

    for c in res.columns:

        # print(f,c)

        if f in {'cartesian_product_itertools'} and c > 600:

            continue

        left2 = pd.concat([left] * c, ignore_index=True)

        right2 = pd.concat([right] * c, ignore_index=True)

        stmt = '{}(left2, right2)'.format(f)

        setp = 'from __main__ import left2, right2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=5)



ax = res.div(res.min()).T.plot(loglog=True) 

ax.set_xlabel("N"); 

ax.set_ylabel("time (relative)");



plt.show()

edited 5 hours ago

answered 6 hours ago

coldspeed

113k18104177

1

Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
– U9-Forward
6 hours ago

1

@U9-Forward Thanks, appreciate your support!
– coldspeed
5 hours ago

Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
– b-fg
5 hours ago

@b-fg Fair enough, done.
– coldspeed
5 hours ago

Wow, I didn't expect that! Thanks, appreciate the effort.
– b-fg
5 hours ago

add a comment |

up vote
8
down vote

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):

    return (

       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))



cartesian_product_basic(left, right)



  col1_x  col2_x col1_y  col2_y

0      A       1      X      20

1      A       1      Y      30

2      A       1      Z      50

3      B       2      X      20

4      B       2      Y      30

5      B       2      Z      50

6      C       3      X      20

7      C       3      Y      30

8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):

    la = len(arrays)

    dtype = np.result_type(*arrays)

    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)

    for i, a in enumerate(np.ix_(*arrays)):

        arr[...,i] = a

    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

def cartesian_product_generalized(left, right):

    la, lb = len(left), len(right)

    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])

    return pd.DataFrame(

        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))



cartesian_product_generalized(left, right)



   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left, right))

True

And, along similar lines,

left2 = left.copy()

left2.index = ['s1', 's2', 's1']



right2 = right.copy()

right2.index = ['x', 'y', 'y']





left2

   col1  col2

s1    A     1

s2    B     2

s1    C     3



right2

  col1  col2

x    X    20

y    Y    30

y    Z    50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left2, right2))

True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):

    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])

    return pd.DataFrame(

        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))



cartesian_product_multi(*[left, right, left]).head()



   0  1  2   3  4  5

0  A  1  X  20  A  1

1  A  1  X  20  B  2

2  A  1  X  20  C  3

3  A  1  X  20  D  4

4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):

    la, lb = len(left), len(right)

    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])



    return pd.DataFrame(

        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))



np.array_equal(cartesian_product_simplified(left, right),

               cartesian_product_basic(left2, right2))

True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087

# I've put my own spin on this to make it as fast as possible.

def cartesian_product_itertools(left, right):

    return pd.DataFrame([

           [*x, *y] for x, y in itertools.product(

               left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit

import pandas as pd

import matplotlib.pyplot as plt



res = pd.DataFrame(

       index=['cartesian_product_basic', 'cartesian_product_generalized', 

              'cartesian_product_multi', 'cartesian_product_simplified',

              'cartesian_product_itertools'],

       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],

       dtype=float

)



for f in res.index: 

    for c in res.columns:

        # print(f,c)

        if f in {'cartesian_product_itertools'} and c > 600:

            continue

        left2 = pd.concat([left] * c, ignore_index=True)

        right2 = pd.concat([right] * c, ignore_index=True)

        stmt = '{}(left2, right2)'.format(f)

        setp = 'from __main__ import left2, right2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=5)



ax = res.div(res.min()).T.plot(loglog=True) 

ax.set_xlabel("N"); 

ax.set_ylabel("time (relative)");



plt.show()

edited 5 hours ago

answered 6 hours ago

coldspeed

113k18104177

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):

    return (

       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))



cartesian_product_basic(left, right)



  col1_x  col2_x col1_y  col2_y

0      A       1      X      20

1      A       1      Y      30

2      A       1      Z      50

3      B       2      X      20

4      B       2      Y      30

5      B       2      Z      50

6      C       3      X      20

7      C       3      Y      30

8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):

    la = len(arrays)

    dtype = np.result_type(*arrays)

    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)

    for i, a in enumerate(np.ix_(*arrays)):

        arr[...,i] = a

    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

def cartesian_product_generalized(left, right):

    la, lb = len(left), len(right)

    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])

    return pd.DataFrame(

        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))



cartesian_product_generalized(left, right)



   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left, right))

True

And, along similar lines,

left2 = left.copy()

left2.index = ['s1', 's2', 's1']



right2 = right.copy()

right2.index = ['x', 'y', 'y']





left2

   col1  col2

s1    A     1

s2    B     2

s1    C     3



right2

  col1  col2

x    X    20

y    Y    30

y    Z    50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left2, right2))

True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):

    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])

    return pd.DataFrame(

        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))



cartesian_product_multi(*[left, right, left]).head()



   0  1  2   3  4  5

0  A  1  X  20  A  1

1  A  1  X  20  B  2

2  A  1  X  20  C  3

3  A  1  X  20  D  4

4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):

    la, lb = len(left), len(right)

    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])



    return pd.DataFrame(

        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))



np.array_equal(cartesian_product_simplified(left, right),

               cartesian_product_basic(left2, right2))

True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087

# I've put my own spin on this to make it as fast as possible.

def cartesian_product_itertools(left, right):

    return pd.DataFrame([

           [*x, *y] for x, y in itertools.product(

               left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit

import pandas as pd

import matplotlib.pyplot as plt



res = pd.DataFrame(

       index=['cartesian_product_basic', 'cartesian_product_generalized', 

              'cartesian_product_multi', 'cartesian_product_simplified',

              'cartesian_product_itertools'],

       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],

       dtype=float

)



for f in res.index: 

    for c in res.columns:

        # print(f,c)

        if f in {'cartesian_product_itertools'} and c > 600:

            continue

        left2 = pd.concat([left] * c, ignore_index=True)

        right2 = pd.concat([right] * c, ignore_index=True)

        stmt = '{}(left2, right2)'.format(f)

        setp = 'from __main__ import left2, right2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=5)



ax = res.div(res.min()).T.plot(loglog=True) 

ax.set_xlabel("N"); 

ax.set_ylabel("time (relative)");



plt.show()

edited 5 hours ago

answered 6 hours ago

coldspeed

113k18104177

edited 5 hours ago

answered 6 hours ago

coldspeed

113k18104177

answered 6 hours ago

coldspeed

113k18104177

answered 6 hours ago

coldspeed

113k18104177

1

Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
– U9-Forward
6 hours ago

1

@U9-Forward Thanks, appreciate your support!
– coldspeed
5 hours ago

Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
– b-fg
5 hours ago

@b-fg Fair enough, done.
– coldspeed
5 hours ago

Wow, I didn't expect that! Thanks, appreciate the effort.
– b-fg
5 hours ago

add a comment |

1

Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
– U9-Forward
6 hours ago

1

@U9-Forward Thanks, appreciate your support!
– coldspeed
5 hours ago

Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
– b-fg
5 hours ago

@b-fg Fair enough, done.
– coldspeed
5 hours ago

Wow, I didn't expect that! Thanks, appreciate the effort.
– b-fg
5 hours ago

Happy to to upvote twice (actually 1.5 times) :-), :D, lol :-)
– U9-Forward
6 hours ago

@U9-Forward Thanks, appreciate your support!
– coldspeed
5 hours ago

Good post. Upvoted 1.5 too hehe. Would be nice to see the performance of itertools.product as suggested by @W-B in your post as well.
– b-fg
5 hours ago

@b-fg Fair enough, done.
– coldspeed
5 hours ago

Wow, I didn't expect that! Thanks, appreciate the effort.
– b-fg
5 hours ago

add a comment |

up vote
3
down vote

Using itertools product and recreate the value in dataframe

import itertools

l=list(itertools.product(left.values.tolist(),right.values.tolist()))

pd.DataFrame(list(map(lambda x : sum(x,),l)))

   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50

answered 5 hours ago

W-B

96.8k73162

add a comment |

up vote
3
down vote

Using itertools product and recreate the value in dataframe

import itertools

l=list(itertools.product(left.values.tolist(),right.values.tolist()))

pd.DataFrame(list(map(lambda x : sum(x,),l)))

   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50

answered 5 hours ago

W-B

96.8k73162

add a comment |

up vote
3
down vote

Using itertools product and recreate the value in dataframe

import itertools

l=list(itertools.product(left.values.tolist(),right.values.tolist()))

pd.DataFrame(list(map(lambda x : sum(x,),l)))

   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50

answered 5 hours ago

W-B

96.8k73162

Using itertools product and recreate the value in dataframe

import itertools

l=list(itertools.product(left.values.tolist(),right.values.tolist()))

pd.DataFrame(list(map(lambda x : sum(x,),l)))

   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50

answered 5 hours ago

W-B

96.8k73162

answered 5 hours ago

W-B

96.8k73162

answered 5 hours ago

W-B

96.8k73162

answered 5 hours ago

W-B

96.8k73162

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Qfyilyi