DataFrames in Python


Much of data anlysis involves working with data contained in rectangular form in tables or csv files. In python we call these data tables as DataFrames, they are one of the most common and useful forms of data for analysis. We introduce data manipulation using pandas, the standard Python library for working with dataframes.

It is more important that you understand the types of useful operations on data than the exact details of pandas syntax. For example, knowing when to perform a group-by is generally more useful than knowing how to call the pandas function to group data.

Because we will cover only the most commonly used pandas functions in this tutorial, you should bookmark the pandas documentation for reference when you conduct your own data analyses.

We begin by talking about the types of dataset structures that pandas can read. Then, we introduce indexes, grouping, apply, and strings.

Chapter Learning Objectives


  • Create Pandas series with pd.Series() and Pandas dataframe with pd.DataFrame()

  • Be able to access values from a Series/DataFrame by indexing, slicing and boolean indexing using notation such as df[], df.loc[], df.iloc[], df.query[]

  • Perform basic arithmetic operations between two series and anticipate the result.

  • Describe how Pandas assigns dtypes to Series and what the object dtype is

  • Read a standard .csv file from a local path or url using Pandas pd.read_csv().

Introduction to Pandas


Pandas is most popular Python library for tabular data structures. You can think of Pandas as an extremely powerful version of Excel (but free and with a lot more features!)

Pandas can be installed using conda:

conda install pandas

We usually import pandas with the alias pd. You’ll see these two imports at the top of most data science workflows:

import pandas as pd

Pandas Series


What are Series?

A Series is like an list/array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc), including a mix of them. Series can be created from a scalar, a list, ndarray or dictionary using pd.Series() (note the captial “S”). Here are some example series:

Creating Series

By default, series are labelled with indices starting from 0. For example:

pd.Series(data = [-5, 1.3, 21, 6, 3])
0    -5.0
1     1.3
2    21.0
3     6.0
4     3.0
dtype: float64

But you can add a custom index:

pd.Series(data = [-5, 1.3, 21, 6, 3],
          index = ['a', 'b', 'c', 'd', 'e'])
a    -5.0
b     1.3
c    21.0
d     6.0
e     3.0
dtype: float64

You can create a Series from a dictionary:

pd.Series(data = {'a': 10, 'b': 20, 'c': 30})
a    10
b    20
c    30
dtype: int64

Series Characteristics

Series can be given a name attribute. I almost never use this but it might come up sometimes:

s = pd.Series(data = [34, 56, 45, 75, 90], name='random_series')
s
0    34
1    56
2    45
3    75
4    90
Name: random_series, dtype: int64
s.name
'random_series'
s.rename("another_name")
0    34
1    56
2    45
3    75
4    90
Name: another_name, dtype: int64

You can access the index labels of your series using the .index attribute:

s.index
RangeIndex(start=0, stop=5, step=1)

Pandas DataFrames


What are DataFrames?

Pandas DataFrames are you’re new best friend. They are like the Excel spreadsheets you may be used to. DataFrames are really just Series stuck together! Think of a DataFrame as a dictionary of series, with the “keys” being the column labels and the “values” being the series data:

Creating DataFrames

Dataframes can be created using pd.DataFrame() (note the capital “D” and “F”). Like series, index and column labels of dataframes are labelled starting from 0 by default:

pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9

We can use the index and columns arguments to give them labels:

pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]],
             index = ["R1", "R2", "R3"],
             columns = ["C1", "C2", "C3"])
C1 C2 C3
R1 1 2 3
R2 4 5 6
R3 7 8 9

There are so many ways to create dataframes. We can create them from dictionaries

pd.DataFrame({"C1": [1, 2, 3],
              "C2": ['A', 'B', 'C']},
             index=["R1", "R2", "R3"])
C1 C2
R1 1 A
R2 2 B
R3 3 C

Here’s a table of the main ways you can create dataframes (see the Pandas documentation for more):

Create DataFrame from

Code

Lists of lists

pd.DataFrame([['Tom', 7], ['Mike', 15], ['Tiffany', 3]])

ndarray

pd.DataFrame(np.array([['Tom', 7], ['Mike', 15], ['Tiffany', 3]]))

Dictionary

pd.DataFrame({"Name": ['Tom', 'Mike', 'Tiffany'], "Number": [7, 15, 3]})

List of tuples

pd.DataFrame(zip(['Tom', 'Mike', 'Tiffany'], [7, 15, 3]))

Series

pd.DataFrame({"Name": pd.Series(['Tom', 'Mike', 'Tiffany']), "Number": pd.Series([7, 15, 3])})

Indexing and Slicing DataFrames

There are several main ways to select data from a DataFrame:

  1. []

  2. .loc[]

  3. .iloc[]

  4. Boolean indexing

  5. .query()

df = pd.DataFrame({"Name": ["Harry", "George", "Lucas"],
                   "Language": ["Python", "Python", "R"],
                   "Confidence": [9, 4, 7]})
df
Name Language Confidence
0 Harry Python 9
1 George Python 4
2 Lucas R 7

Indexing with []

Select columns by single labels, lists of labels, or slices:

df['Name']  # returns a series
0     Harry
1    George
2     Lucas
Name: Name, dtype: object
df[['Name']]  # returns a dataframe!
Name
0 Harry
1 George
2 Lucas
df[['Name', 'Language']]
Name Language
0 Harry Python
1 George Python
2 Lucas R

You can only index rows by using slices, not single values (but not recommended, see preferred methods below).

df[0] # doesn't work
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Applications/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2894             try:
-> 2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-16-feb9bd85061b> in <module>
----> 1 df[0] # doesn't work

/Applications/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2900             if self.columns.nlevels > 1:
   2901                 return self._getitem_multilevel(key)
-> 2902             indexer = self.columns.get_loc(key)
   2903             if is_integer(indexer):
   2904                 indexer = [indexer]

/Applications/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:
-> 2897                 raise KeyError(key) from err
   2898 
   2899         if tolerance is not None:

KeyError: 0
df[0:1] # does work
Name Language Confidence
0 Harry Python 9
df[1:] # does work
Name Language Confidence
1 George Python 4
2 Lucas R 7

Indexing with .loc and .iloc

Pandas created the methods .loc[] and .iloc[] as more flexible alternatives for accessing data from a dataframe. Use df.iloc[] for indexing with integers and df.loc[] for indexing with labels. These are typically the recommended methods of indexing in Pandas.

df
Name Language Confidence
0 Harry Python 9
1 George Python 4
2 Lucas R 7

First we’ll try out .iloc which accepts integers as references to rows/columns:

df.iloc[0] 
Name           Harry
Language      Python
Confidence         9
Name: 0, dtype: object
df.iloc[0:2]  # slicing returns a dataframe
Name Language Confidence
0 Harry Python 9
1 George Python 4
df.iloc[2, 1]  # returns the indexed object
'R'
df.iloc[[0, 1], [1, 2]]  # returns a dataframe
Language Confidence
0 Python 9
1 Python 4

Now let’s look at .loc which accepts labels as references to rows/columns:

df.loc[:, 'Name']
0     Harry
1    George
2     Lucas
Name: Name, dtype: object
df.loc[:, 'Name':'Language']
Name Language
0 Harry Python
1 George Python
2 Lucas R
df.loc[[0, 2], ['Language']]
Language
0 Python
2 R

Sometimes we want to use a mix of integers and labels to reference data in a dataframe. The easiest way to do this is to use .loc[] with a label then use an integer in combinations with .index or .columns:

df.index
RangeIndex(start=0, stop=3, step=1)
df.columns
Index(['Name', 'Language', 'Confidence'], dtype='object')
df.loc[df.index[0], 'Confidence']  # I want to reference the first row and the column named "Courses"
9
df.loc[2, df.columns[1]]  # I want to reference row "2" and the second column
'R'

Boolean indexing

Just like with series, we can select data based on boolean masks:

df[df['Confidence'] > 5]
Name Language Confidence
0 Harry Python 9
2 Lucas R 7
df[df['Name'] == "Lucas"]
Name Language Confidence
2 Lucas R 7

Indexing with .query()

Boolean masks work fine, but we can also use the .query() method for selecting data. df.query() is a powerful tool for filtering data. It has an odd syntax, it is more like SQL - df.query() accepts a string expression to evaluate and it “knows” the names of the columns in your dataframe.

df.query("Confidence > 4 & Language == 'Python'")
Name Language Confidence
0 Harry Python 9

Note the use of single quotes AND double quotes above, lucky we have both in Python! Compare this to the equivalent boolean indexing operation and you can see that .query() is much more readable, especially as the query gets bigger!

df[(df['Confidence'] > 4) & (df['Language'] == 'Python')]
Name Language Confidence
0 Harry Python 9

Query also allows you to reference variable in the current workspace using the @ symbol:

confidence_threshold = 4
df.query("Confidence > @confidence_threshold")
Name Language Confidence
0 Harry Python 9
2 Lucas R 7

Reading/Writing Data From External Sources


.csv files

A lot of the time we will be loading .csv files for use in pandas. We can use the pd.read_csv() function for this. In the remaining sections of this chapter we will work with the Baby Names dataset. There are so many arguments that can be used to help read in your .csv file in an efficient and appropriate manner, feel free to check them out now (by using shift + tab in Jupyter, or typing help(pd.read_csv)).

path = 'babynames.csv'
baby = pd.read_csv(path)
baby
Name Sex Count Year
0 Mary F 9217 1884
1 Anna F 3860 1884
2 Emma F 2587 1884
3 Elizabeth F 2549 1884
4 Minnie F 2243 1884
... ... ... ... ...
1891889 Titus M 5 1883
1891890 Toney M 5 1883
1891891 Verna M 5 1883
1891892 Winnie M 5 1883
1891893 Winthrop M 5 1883

1891894 rows × 4 columns

You can print a dataframe to .csv using df.to_csv(). Be sure to check out all of the possible arguments to write your dataframe exactly how you want it.

Slicing using .loc

To select subsets of a DataFrame, we use the .loc slicing syntax. The first argument is the label of the row and the second is the label of the column:

baby.loc[1, 'Name'] # Row labeled 1, Column labeled 'Name'
'Anna'

To slice out multiple rows or columns, we can use :. Note that .loc slicing is inclusive, unlike Python’s slicing.

# Get rows 1 through 5, columns Name through Count inclusive
baby.loc[1:5, 'Name':'Count']
Name Sex Count
1 Anna F 3860
2 Emma F 2587
3 Elizabeth F 2549
4 Minnie F 2243
5 Margaret F 2142

We will often want a single column from a DataFrame:

baby.loc[:, 'Year']
0          1884
1          1884
2          1884
3          1884
4          1884
           ... 
1891889    1883
1891890    1883
1891891    1883
1891892    1883
1891893    1883
Name: Year, Length: 1891894, dtype: int64

To select out specific columns, we can pass a list into the .loc slice:

# This is a DataFrame again
baby.loc[:, ['Name', 'Year']]
Name Year
0 Mary 1884
1 Anna 1884
2 Emma 1884
3 Elizabeth 1884
4 Minnie 1884
... ... ...
1891889 Titus 1883
1891890 Toney 1883
1891891 Verna 1883
1891892 Winnie 1883
1891893 Winthrop 1883

1891894 rows × 2 columns

Common DataFrame Operations

DataFrames have built-in functions for performing most common operations, e.g., .min(), idxmin(), sort_values(), etc. They’re all documented in the Pandas documentation here but I’ll demonstrate a few below:

baby.min()
Name     Aaban
Sex          F
Count        5
Year      1880
dtype: object
baby['Year'].min()
1880
baby['Year'].idxmin()
1883748
baby['Year'].sum()
3734830352

We can use the .describe() method to get the basic summary statistics for our numerical columns and .info() to get an overview of the different data types contained inside our dataframe.

baby.describe()
Count Year
count 1.891894e+06 1.891894e+06
mean 1.821106e+02 1.974122e+03
std 1.544197e+03 3.386497e+01
min 5.000000e+00 1.880000e+03
25% 7.000000e+00 1.951000e+03
50% 1.200000e+01 1.984000e+03
75% 3.200000e+01 2.002000e+03
max 9.968500e+04 2.016000e+03
baby.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1891894 entries, 0 to 1891893
Data columns (total 4 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Name    object
 1   Sex     object
 2   Count   int64 
 3   Year    int64 
dtypes: int64(2), object(2)
memory usage: 57.7+ MB

Some methods require arguments to be specified, like .sort_values():

baby.sort_values(by='Year')
Name Sex Count Year
1883748 Mary F 7065 1880
1885089 Vern M 19 1880
1885088 Stewart M 19 1880
1885087 Randolph M 19 1880
1885086 Lucien M 19 1880
... ... ... ... ...
1864551 Livanna F 7 2016
1864552 Livinia F 7 2016
1864553 Liyanna F 7 2016
1864335 Jian F 7 2016
1865157 Adianna F 6 2016

1891894 rows × 4 columns

baby.sort_values(by='Year', ascending=False)
Name Sex Count Year
1881450 Schyler M 6 2016
1861918 Ellenora F 9 2016
1861824 Braylea F 9 2016
1861825 Brayli F 9 2016
1861826 Breelle F 9 2016
... ... ... ... ...
1885085 Hardy M 19 1880
1885086 Lucien M 19 1880
1885087 Randolph M 19 1880
1885088 Stewart M 19 1880
1885671 Humphrey M 5 1880

1891894 rows × 4 columns

Summary


We now have learned to express the following operations in pandas:

Operation

pandas

Read a CSV file

pd.read_csv()

Slicing using labels or indices

.loc and .iloc

Slicing rows using a predicate

Use a boolean-valued Series in .loc

Sorting rows

.sort_values()