Working with Multiple DataFrames¶

Often you’ll work with multiple dataframes that you want to stick together or merge. df.merge() and df.concat() are all you need to know for combining dataframes. The Pandas documentation is very helpful for these functions, but they are pretty easy to grasp.

Note

The example joins shown in this section are inspired by Chapter 15 of Jenny Bryan’s STAT 545 materials.

Sticking DataFrames Together with `pd.concat()`¶

You can use `pd.concat()` to stick dataframes together: - Vertically: if they have the same **columns**, OR - Horizontally: if they have the same **rows**

import pandas as pd

df1 = pd.DataFrame({'A': [1, 3, 5],
                    'B': [2, 4, 6]})
df2 = pd.DataFrame({'A': [7, 9, 11],
                    'B': [8, 10, 12]})

df1

	A	B
0	1	2
1	3	4
2	5	6

df2

	A	B
0	7	8
1	9	10
2	11	12

pd.concat((df1, df2), axis=0)  # axis=0 specifies a vertical stick, i.e., on the columns

	A	B
0	1	2
1	3	4
2	5	6
0	7	8
1	9	10
2	11	12

Notice that the indexes were simply joined together? This may or may not be what you want. To reset the index, you can specify the argument ignore_index=True:

pd.concat((df1, df2), axis=0, ignore_index=True)

	A	B
0	1	2
1	3	4
2	5	6
3	7	8
4	9	10
5	11	12

Use axis=1 to stick together horizontally:

pd.concat((df1, df2), axis=1, ignore_index=True)

	0	1	2	3
0	1	2	7	8
1	3	4	9	10
2	5	6	11	12

You are not limited to just two dataframes, you can concatenate as many as you want:

pd.concat((df1, df2, df1, df2), axis=0, ignore_index=True)

	A	B
0	1	2
1	3	4
2	5	6
3	7	8
4	9	10
5	11	12
6	1	2
7	3	4
8	5	6
9	7	8
10	9	10
11	11	12

Joining DataFrames with `pd.merge()`¶

pd.merge() gives you the ability to “join” dataframes using different rules (just like with SQL if you’re familiar with it). You can use df.merge() to join dataframes based on shared key columns. Methods include:

“inner join”
“outer join”
“left join”
“right join”

See this great cheat sheet and these great animations for more insights.

df1 = pd.DataFrame({"name": ['Magneto', 'Storm', 'Mystique', 'Batman', 'Joker', 'Catwoman', 'Hellboy'],
                    'alignment': ['bad', 'good', 'bad', 'good', 'bad', 'bad', 'good'],
                    'gender': ['male', 'female', 'female', 'male', 'male', 'female', 'male'],
                    'publisher': ['Marvel', 'Marvel', 'Marvel', 'DC', 'DC', 'DC', 'Dark Horse Comics']})
df2 = pd.DataFrame({'publisher': ['DC', 'Marvel', 'Image'],
                    'year_founded': [1934, 1939, 1992]})

df1

	name	alignment	gender	publisher
0	Magneto	bad	male	Marvel
1	Storm	good	female	Marvel
2	Mystique	bad	female	Marvel
3	Batman	good	male	DC
4	Joker	bad	male	DC
5	Catwoman	bad	female	DC
6	Hellboy	good	male	Dark Horse Comics

An “inner” join will return all rows of df1 where matching values for “publisher” are found in df2:

pd.merge(df1, df2, how="inner", on="publisher")

	name	alignment	gender	publisher	year_founded
0	Magneto	bad	male	Marvel	1939
1	Storm	good	female	Marvel	1939
2	Mystique	bad	female	Marvel	1939
3	Batman	good	male	DC	1934
4	Joker	bad	male	DC	1934
5	Catwoman	bad	female	DC	1934

An “outer” join will return all rows of df1 and df2, placing NaNs where information is unavailable:

pd.merge(df1, df2, how="outer", on="publisher")

	name	alignment	gender	publisher	year_founded
0	Magneto	bad	male	Marvel	1939.0
1	Storm	good	female	Marvel	1939.0
2	Mystique	bad	female	Marvel	1939.0
3	Batman	good	male	DC	1934.0
4	Joker	bad	male	DC	1934.0
5	Catwoman	bad	female	DC	1934.0
6	Hellboy	good	male	Dark Horse Comics	NaN
7	NaN	NaN	NaN	Image	1992.0

Return all rows from df1 and all columns of df1 and df2, populated where matches occur:

pd.merge(df1, df2, how="left", on="publisher")

	name	alignment	gender	publisher	year_founded
0	Magneto	bad	male	Marvel	1939.0
1	Storm	good	female	Marvel	1939.0
2	Mystique	bad	female	Marvel	1939.0
3	Batman	good	male	DC	1934.0
4	Joker	bad	male	DC	1934.0
5	Catwoman	bad	female	DC	1934.0
6	Hellboy	good	male	Dark Horse Comics	NaN

pd.merge(df1, df2, how="right", on="publisher")

	name	alignment	gender	publisher	year_founded
0	Batman	good	male	DC	1934
1	Joker	bad	male	DC	1934
2	Catwoman	bad	female	DC	1934
3	Magneto	bad	male	Marvel	1939
4	Storm	good	female	Marvel	1939
5	Mystique	bad	female	Marvel	1939
6	NaN	NaN	NaN	Image	1992

There are many ways to specify the key to join dataframes on, you can join on index values, different, column names, etc. Another helpful argument is the indicator argument which will add a column to the result telling you where matches were found in the dataframes:

pd.merge(df1, df2, how="outer", on="publisher", indicator=True)

	name	alignment	gender	publisher	year_founded	_merge
0	Magneto	bad	male	Marvel	1939.0	both
1	Storm	good	female	Marvel	1939.0	both
2	Mystique	bad	female	Marvel	1939.0	both
3	Batman	good	male	DC	1934.0	both
4	Joker	bad	male	DC	1934.0	both
5	Catwoman	bad	female	DC	1934.0	both
6	Hellboy	good	male	Dark Horse Comics	NaN	left_only
7	NaN	NaN	NaN	Image	1992.0	right_only

By the way, you can use pd.concat() to do a simple “inner” or “outer” join on multiple datadrames at once. It’s less flexible than merge, but can be useful sometimes.

Data Science using Python

Working with Multiple DataFrames

Contents

Working with Multiple DataFrames¶

Sticking DataFrames Together with `pd.concat()`¶

Joining DataFrames with `pd.merge()`¶

Data Science using Python

Working with Multiple DataFrames

Contents

Working with Multiple DataFrames¶

Sticking DataFrames Together with pd.concat()¶

Joining DataFrames with pd.merge()¶

Sticking DataFrames Together with `pd.concat()`¶

Joining DataFrames with `pd.merge()`¶