Grouping and Pivoting
Contents
Grouping and Pivoting¶
In this section, we will answer the question:
What were the most popular male and female names in each year?
Here’s the Baby Names dataset once again:
import pandas as pd
baby = pd.read_csv('babynames.csv')
baby.head()
# the .head() method outputs the first five rows of the DataFrame
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Mary | F | 9217 | 1884 |
1 | Anna | F | 3860 | 1884 |
2 | Emma | F | 2587 | 1884 |
3 | Elizabeth | F | 2549 | 1884 |
4 | Minnie | F | 2243 | 1884 |
Breaking the Problem Down
We decompose this problem into simpler table manipulations.
Group the
baby
DataFrame by ‘Year’ and ‘Sex’.For each group, compute the most popular name.
Recognizing which operation is needed for each problem is sometimes tricky. Usually, a convoluted series of steps will signal to you that there might be a simpler way to express what you want. If we didn’t immediately recognize that we needed to group, for example, we might write steps like the following:
Loop through each unique year.
For each year, loop through each unique sex.
For each unique year and sex, find the most common name.
Note
There is almost always a better alternative to looping over a pandas
DataFrame. In particular, looping over unique values of a DataFrame should usually be replaced with a group.
Grouping¶
To group in pandas
. we use the .groupby()
method.
baby.groupby('Year')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff990dd36d0>
.groupby()
returns a strange-looking DataFrameGroupBy
object. We can call .agg()
on this object with an aggregation function in order to get a familiar output:
# The aggregation function takes in a series of values for each group
# and outputs a single value
def length(series):
return len(series)
# Count up number of values for each year. This is equivalent to
# counting the number of rows where each year appears.
baby.groupby('Year').agg(length)
Name | Sex | Count | |
---|---|---|---|
Year | |||
1880 | 2000 | 2000 | 2000 |
1881 | 1935 | 1935 | 1935 |
1882 | 2127 | 2127 | 2127 |
1883 | 2084 | 2084 | 2084 |
1884 | 2297 | 2297 | 2297 |
... | ... | ... | ... |
2012 | 33715 | 33715 | 33715 |
2013 | 33253 | 33253 | 33253 |
2014 | 33206 | 33206 | 33206 |
2015 | 33063 | 33063 | 33063 |
2016 | 32868 | 32868 | 32868 |
137 rows × 3 columns
You might notice that the length
function simply calls the len
function, so we can simplify the code above.
baby.groupby('Year').agg(len)
Name | Sex | Count | |
---|---|---|---|
Year | |||
1880 | 2000 | 2000 | 2000 |
1881 | 1935 | 1935 | 1935 |
1882 | 2127 | 2127 | 2127 |
1883 | 2084 | 2084 | 2084 |
1884 | 2297 | 2297 | 2297 |
... | ... | ... | ... |
2012 | 33715 | 33715 | 33715 |
2013 | 33253 | 33253 | 33253 |
2014 | 33206 | 33206 | 33206 |
2015 | 33063 | 33063 | 33063 |
2016 | 32868 | 32868 | 32868 |
137 rows × 3 columns
A further shorthand to accomplish the same result would be by using .count() method as our aggregating function. Pandas has shorthands for common aggregation functions, including count
, sum
, and mean
.
baby.groupby('Year').count()
Name | Sex | Count | |
---|---|---|---|
Year | |||
1880 | 2000 | 2000 | 2000 |
1881 | 1935 | 1935 | 1935 |
1882 | 2127 | 2127 | 2127 |
1883 | 2084 | 2084 | 2084 |
1884 | 2297 | 2297 | 2297 |
... | ... | ... | ... |
2012 | 33715 | 33715 | 33715 |
2013 | 33253 | 33253 | 33253 |
2014 | 33206 | 33206 | 33206 |
2015 | 33063 | 33063 | 33063 |
2016 | 32868 | 32868 | 32868 |
137 rows × 3 columns
The aggregation is applied to each column of the DataFrame, producing redundant information. We can restrict the output columns by slicing before grouping.
year_rows = baby[['Year', 'Count']].groupby('Year').agg(len)
year_rows
Count | |
---|---|
Year | |
1880 | 2000 |
1881 | 1935 |
1882 | 2127 |
1883 | 2084 |
1884 | 2297 |
... | ... |
2012 | 33715 |
2013 | 33253 |
2014 | 33206 |
2015 | 33063 |
2016 | 32868 |
137 rows × 1 columns
Note that the index of the resulting DataFrame now contains the unique years, so we can slice subsets of years using .loc
as before:
# Every twentieth year starting at 1880
year_rows.loc[1880:2016:20, :]
Count | |
---|---|
Year | |
1880 | 2000 |
1900 | 3730 |
1920 | 10755 |
1940 | 8961 |
1960 | 11924 |
1980 | 19440 |
2000 | 29764 |
Grouping on Multiple Columns¶
We can group on multiple columns to get groups based on unique pairs of values. To do this, pass in a list of column labels into .groupby()
.
grouped_counts = baby.groupby(['Year', 'Sex']).sum()
grouped_counts
Count | ||
---|---|---|
Year | Sex | |
1880 | F | 90992 |
M | 110491 | |
1881 | F | 91953 |
M | 100743 | |
1882 | F | 107847 |
... | ... | ... |
2014 | M | 1913434 |
2015 | F | 1776538 |
M | 1907211 | |
2016 | F | 1756647 |
M | 1880674 |
274 rows × 1 columns
The code above computes the total number of babies born for each year and sex. We can now use grouping by multiple columns to compute the most popular names for each year and sex. Since the data are already sorted in descending order of Count for each year and sex, we can define an aggregation function that returns the first value in each series. (If the data weren’t sorted, we can call sort_values()
first.)
# The most popular name is simply the first one that appears in the series
def most_popular(series):
return series.iloc[0]
baby_pop = baby.groupby(['Year', 'Sex']).agg(most_popular)
baby_pop
Name | Count | ||
---|---|---|---|
Year | Sex | ||
1880 | F | Mary | 7065 |
M | John | 9655 | |
1881 | F | Mary | 6919 |
M | John | 8769 | |
1882 | F | Mary | 8148 |
... | ... | ... | ... |
2014 | M | Noah | 19263 |
2015 | F | Emma | 20415 |
M | Noah | 19594 | |
2016 | F | Emma | 19414 |
M | Noah | 19015 |
274 rows × 2 columns
Notice that grouping by multiple columns results in multiple labels for each row. This is called a “multilevel index” and is tricky to work with. The important thing to know is that .loc
takes in a tuple for the row index instead of a single value:
baby_pop.loc[(2000, 'F'), 'Name']
'Emily'
But .iloc
behaves the same as usual since it uses indices instead of labels:
baby_pop.iloc[10:15, :]
Name | Count | ||
---|---|---|---|
Year | Sex | ||
1885 | F | Mary | 9128 |
M | John | 8756 | |
1886 | F | Mary | 9889 |
M | John | 9026 | |
1887 | F | Mary | 9888 |
Pivoting¶
If you group by two columns, you can often use pivot to present your data in a more convenient format. Using a pivot lets you use one set of grouped labels as the columns of the resulting table.
To pivot, use the pd.pivot_table()
function.
pd.pivot_table(baby,
index='Year', # Index for rows
columns='Sex', # Columns
values='Name', # Values in table
aggfunc=most_popular) # Aggregation function
Sex | F | M |
---|---|---|
Year | ||
1880 | Mary | John |
1881 | Mary | John |
1882 | Mary | John |
1883 | Mary | John |
1884 | Mary | John |
... | ... | ... |
2012 | Sophia | Jacob |
2013 | Sophia | Noah |
2014 | Emma | Noah |
2015 | Emma | Noah |
2016 | Emma | Noah |
137 rows × 2 columns
Compare this result to the baby_pop
table that we computed using .groupby()
. We can see that the Sex
index in baby_pop
became the columns of the pivot table.
baby_pop
Name | Count | ||
---|---|---|---|
Year | Sex | ||
1880 | F | Mary | 7065 |
M | John | 9655 | |
1881 | F | Mary | 6919 |
... | ... | ... | ... |
2015 | M | Noah | 19594 |
2016 | F | Emma | 19414 |
M | Noah | 19015 |
274 rows × 2 columns
Summary¶
We now have the most popular baby names for each sex and year in our dataset and learned to express the following operations in pandas
:
Operation |
|
---|---|
Group |
|
Group by multiple columns |
|
Group and aggregate |
|
Pivot |
|