DataFrames in Python¶

Much of data anlysis involves working with data contained in rectangular form in tables or csv files. In python we call these data tables as DataFrames, they are one of the most common and useful forms of data for analysis. We introduce data manipulation using pandas, the standard Python library for working with dataframes.

It is more important that you understand the types of useful operations on data than the exact details of pandas syntax. For example, knowing when to perform a group-by is generally more useful than knowing how to call the pandas function to group data.

Because we will cover only the most commonly used pandas functions in this tutorial, you should bookmark the pandas documentation for reference when you conduct your own data analyses.

We begin by talking about the types of dataset structures that pandas can read. Then, we introduce indexes, grouping, apply, and strings.

Chapter Learning Objectives¶

Create Pandas series with pd.Series() and Pandas dataframe with pd.DataFrame()
Be able to access values from a Series/DataFrame by indexing, slicing and boolean indexing using notation such as df[], df.loc[], df.iloc[], df.query[]
Perform basic arithmetic operations between two series and anticipate the result.
Describe how Pandas assigns dtypes to Series and what the object dtype is
Read a standard .csv file from a local path or url using Pandas pd.read_csv().

Introduction to Pandas¶

Pandas is most popular Python library for tabular data structures. You can think of Pandas as an extremely powerful version of Excel (but free and with a lot more features!)

Pandas can be installed using conda:

conda install pandas

We usually import pandas with the alias pd. You’ll see these two imports at the top of most data science workflows:

import pandas as pd

Pandas Series¶

What are Series?¶

A Series is like an list/array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc), including a mix of them. Series can be created from a scalar, a list, ndarray or dictionary using pd.Series() (note the captial “S”). Here are some example series:

Creating Series¶

By default, series are labelled with indices starting from 0. For example:

pd.Series(data = [-5, 1.3, 21, 6, 3])

  -5.0
   1.3
  21.0
   6.0
   3.0
dtype: float64

But you can add a custom index:

pd.Series(data = [-5, 1.3, 21, 6, 3],
          index = ['a', 'b', 'c', 'd', 'e'])

a    -5.0
b     1.3
c    21.0
d     6.0
e     3.0
dtype: float64

You can create a Series from a dictionary:

pd.Series(data = {'a': 10, 'b': 20, 'c': 30})

a    10
b    20
c    30
dtype: int64

Series Characteristics¶

Series can be given a name attribute. I almost never use this but it might come up sometimes:

s = pd.Series(data = [34, 56, 45, 75, 90], name='random_series')
s

  34
  56
  45
  75
  90
Name: random_series, dtype: int64

s.name

'random_series'

s.rename("another_name")

  34
  56
  45
  75
  90
Name: another_name, dtype: int64

You can access the index labels of your series using the .index attribute:

s.index

RangeIndex(start=0, stop=5, step=1)

Pandas DataFrames¶

What are DataFrames?¶

Pandas DataFrames are you’re new best friend. They are like the Excel spreadsheets you may be used to. DataFrames are really just Series stuck together! Think of a DataFrame as a dictionary of series, with the “keys” being the column labels and the “values” being the series data:

Creating DataFrames¶

Dataframes can be created using pd.DataFrame() (note the capital “D” and “F”). Like series, index and column labels of dataframes are labelled starting from 0 by default:

pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

	0	1	2
0	1	2	3
1	4	5	6
2	7	8	9

We can use the index and columns arguments to give them labels:

pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]],
             index = ["R1", "R2", "R3"],
             columns = ["C1", "C2", "C3"])

	C1	C2	C3
R1	1	2	3
R2	4	5	6
R3	7	8	9

There are so many ways to create dataframes. We can create them from dictionaries

pd.DataFrame({"C1": [1, 2, 3],
              "C2": ['A', 'B', 'C']},
             index=["R1", "R2", "R3"])

	C1	C2
R1	1	A
R2	2	B
R3	3	C

Here’s a table of the main ways you can create dataframes (see the Pandas documentation for more):

Create DataFrame from	Code
Lists of lists	`pd.DataFrame([['Tom', 7], ['Mike', 15], ['Tiffany', 3]])`
ndarray	`pd.DataFrame(np.array([['Tom', 7], ['Mike', 15], ['Tiffany', 3]]))`
Dictionary	`pd.DataFrame({"Name": ['Tom', 'Mike', 'Tiffany'], "Number": [7, 15, 3]})`
List of tuples	`pd.DataFrame(zip(['Tom', 'Mike', 'Tiffany'], [7, 15, 3]))`
Series	`pd.DataFrame({"Name": pd.Series(['Tom', 'Mike', 'Tiffany']), "Number": pd.Series([7, 15, 3])})`

Indexing and Slicing DataFrames¶

There are several main ways to select data from a DataFrame:

[]
.loc[]
.iloc[]
Boolean indexing
.query()

df = pd.DataFrame({"Name": ["Harry", "George", "Lucas"],
                   "Language": ["Python", "Python", "R"],
                   "Confidence": [9, 4, 7]})
df

	Name	Language	Confidence
0	Harry	Python	9
1	George	Python	4
2	Lucas	R	7

Indexing with `[]`¶

Select columns by single labels, lists of labels, or slices:

df['Name']  # returns a series

   Harry
  George
   Lucas
Name: Name, dtype: object

df[['Name']]  # returns a dataframe!

	Name
0	Harry
1	George
2	Lucas

df[['Name', 'Language']]

	Name	Language
0	Harry	Python
1	George	Python
2	Lucas	R

You can only index rows by using slices, not single values (but not recommended, see preferred methods below).

df[0] # doesn't work

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Applications/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2894             try:
-> 2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-16-feb9bd85061b> in <module>
----> 1 df[0] # doesn't work

/Applications/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2900             if self.columns.nlevels > 1:
   2901                 return self._getitem_multilevel(key)
-> 2902             indexer = self.columns.get_loc(key)
   2903             if is_integer(indexer):
   2904                 indexer = [indexer]

/Applications/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:
-> 2897                 raise KeyError(key) from err
   2898 
   2899         if tolerance is not None:

KeyError: 0

df[0:1] # does work

	Name	Language	Confidence
0	Harry	Python	9

df[1:] # does work

	Name	Language	Confidence
1	George	Python	4
2	Lucas	R	7

Indexing with `.loc` and `.iloc`¶

Pandas created the methods .loc[] and .iloc[] as more flexible alternatives for accessing data from a dataframe. Use df.iloc[] for indexing with integers and df.loc[] for indexing with labels. These are typically the recommended methods of indexing in Pandas.

df

	Name	Language	Confidence
0	Harry	Python	9
1	George	Python	4
2	Lucas	R	7

First we’ll try out .iloc which accepts integers as references to rows/columns:

df.iloc[0] 

Name           Harry
Language      Python
Confidence         9
Name: 0, dtype: object

df.iloc[0:2]  # slicing returns a dataframe

	Name	Language	Confidence
0	Harry	Python	9
1	George	Python	4

df.iloc[2, 1]  # returns the indexed object

'R'

df.iloc[[0, 1], [1, 2]]  # returns a dataframe

	Language	Confidence
0	Python	9
1	Python	4

Now let’s look at .loc which accepts labels as references to rows/columns:

df.loc[:, 'Name']

   Harry
  George
   Lucas
Name: Name, dtype: object

df.loc[:, 'Name':'Language']

	Name	Language
0	Harry	Python
1	George	Python
2	Lucas	R

df.loc[[0, 2], ['Language']]

	Language
0	Python
2	R

Sometimes we want to use a mix of integers and labels to reference data in a dataframe. The easiest way to do this is to use .loc[] with a label then use an integer in combinations with .index or .columns:

df.index

RangeIndex(start=0, stop=3, step=1)

df.columns

Index(['Name', 'Language', 'Confidence'], dtype='object')

df.loc[df.index[0], 'Confidence']  # I want to reference the first row and the column named "Courses"

df.loc[2, df.columns[1]]  # I want to reference row "2" and the second column

'R'

Boolean indexing¶

Just like with series, we can select data based on boolean masks:

df[df['Confidence'] > 5]

	Name	Language	Confidence
0	Harry	Python	9
2	Lucas	R	7

df[df['Name'] == "Lucas"]

	Name	Language	Confidence
2	Lucas	R	7

Indexing with `.query()`¶

Boolean masks work fine, but we can also use the .query() method for selecting data. df.query() is a powerful tool for filtering data. It has an odd syntax, it is more like SQL - df.query() accepts a string expression to evaluate and it “knows” the names of the columns in your dataframe.

df.query("Confidence > 4 & Language == 'Python'")

	Name	Language	Confidence
0	Harry	Python	9

Note the use of single quotes AND double quotes above, lucky we have both in Python! Compare this to the equivalent boolean indexing operation and you can see that .query() is much more readable, especially as the query gets bigger!

df[(df['Confidence'] > 4) & (df['Language'] == 'Python')]

	Name	Language	Confidence
0	Harry	Python	9

Query also allows you to reference variable in the current workspace using the @ symbol:

confidence_threshold = 4
df.query("Confidence > @confidence_threshold")

	Name	Language	Confidence
0	Harry	Python	9
2	Lucas	R	7

Reading/Writing Data From External Sources¶

.csv files¶

A lot of the time we will be loading .csv files for use in pandas. We can use the pd.read_csv() function for this. In the remaining sections of this chapter we will work with the Baby Names dataset. There are so many arguments that can be used to help read in your .csv file in an efficient and appropriate manner, feel free to check them out now (by using shift + tab in Jupyter, or typing help(pd.read_csv)).

path = 'babynames.csv'
baby = pd.read_csv(path)
baby

	Name	Sex	Count	Year
0	Mary	F	9217	1884
1	Anna	F	3860	1884
2	Emma	F	2587	1884
3	Elizabeth	F	2549	1884
4	Minnie	F	2243	1884
...	...	...	...	...
1891889	Titus	M	5	1883
1891890	Toney	M	5	1883
1891891	Verna	M	5	1883
1891892	Winnie	M	5	1883
1891893	Winthrop	M	5	1883

1891894 rows × 4 columns

You can print a dataframe to .csv using df.to_csv(). Be sure to check out all of the possible arguments to write your dataframe exactly how you want it.

Slicing using `.loc`¶

To select subsets of a DataFrame, we use the .loc slicing syntax. The first argument is the label of the row and the second is the label of the column:

baby.loc[1, 'Name'] # Row labeled 1, Column labeled 'Name'

'Anna'

To slice out multiple rows or columns, we can use :. Note that .loc slicing is inclusive, unlike Python’s slicing.

# Get rows 1 through 5, columns Name through Count inclusive
baby.loc[1:5, 'Name':'Count']

	Name	Sex	Count
1	Anna	F	3860
2	Emma	F	2587
3	Elizabeth	F	2549
4	Minnie	F	2243
5	Margaret	F	2142

We will often want a single column from a DataFrame:

baby.loc[:, 'Year']

0          1884
1          1884
2          1884
3          1884
4          1884
           ... 
1891889    1883
1891890    1883
1891891    1883
1891892    1883
1891893    1883
Name: Year, Length: 1891894, dtype: int64

To select out specific columns, we can pass a list into the .loc slice:

# This is a DataFrame again
baby.loc[:, ['Name', 'Year']]

	Name	Year
0	Mary	1884
1	Anna	1884
2	Emma	1884
3	Elizabeth	1884
4	Minnie	1884
...	...	...
1891889	Titus	1883
1891890	Toney	1883
1891891	Verna	1883
1891892	Winnie	1883
1891893	Winthrop	1883

1891894 rows × 2 columns

Common DataFrame Operations¶

DataFrames have built-in functions for performing most common operations, e.g., .min(), idxmin(), sort_values(), etc. They’re all documented in the Pandas documentation here but I’ll demonstrate a few below:

baby.min()

Name     Aaban
Sex          F
Count        5
Year      1880
dtype: object

baby['Year'].min()

baby['Year'].idxmin()

baby['Year'].sum()

3734830352

We can use the .describe() method to get the basic summary statistics for our numerical columns and .info() to get an overview of the different data types contained inside our dataframe.

baby.describe()

	Count	Year
count	1.891894e+06	1.891894e+06
mean	1.821106e+02	1.974122e+03
std	1.544197e+03	3.386497e+01
min	5.000000e+00	1.880000e+03
25%	7.000000e+00	1.951000e+03
50%	1.200000e+01	1.984000e+03
75%	3.200000e+01	2.002000e+03
max	9.968500e+04	2.016000e+03

baby.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1891894 entries, 0 to 1891893
Data columns (total 4 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Name    object
 1   Sex     object
 2   Count   int64 
 3   Year    int64 
dtypes: int64(2), object(2)
memory usage: 57.7+ MB

Some methods require arguments to be specified, like .sort_values():

baby.sort_values(by='Year')

	Name	Sex	Count	Year
1883748	Mary	F	7065	1880
1885089	Vern	M	19	1880
1885088	Stewart	M	19	1880
1885087	Randolph	M	19	1880
1885086	Lucien	M	19	1880
...	...	...	...	...
1864551	Livanna	F	7	2016
1864552	Livinia	F	7	2016
1864553	Liyanna	F	7	2016
1864335	Jian	F	7	2016
1865157	Adianna	F	6	2016

1891894 rows × 4 columns

baby.sort_values(by='Year', ascending=False)

	Name	Sex	Count	Year
1881450	Schyler	M	6	2016
1861918	Ellenora	F	9	2016
1861824	Braylea	F	9	2016
1861825	Brayli	F	9	2016
1861826	Breelle	F	9	2016
...	...	...	...	...
1885085	Hardy	M	19	1880
1885086	Lucien	M	19	1880
1885087	Randolph	M	19	1880
1885088	Stewart	M	19	1880
1885671	Humphrey	M	5	1880

1891894 rows × 4 columns

Summary¶

We now have learned to express the following operations in pandas:

Operation	`pandas`
Read a CSV file	`pd.read_csv()`
Slicing using labels or indices	`.loc` and `.iloc`
Slicing rows using a predicate	Use a boolean-valued Series in `.loc`
Sorting rows	`.sort_values()`

Data Science using Python

DataFrames in Python

Contents

DataFrames in Python¶

Chapter Learning Objectives¶

Introduction to Pandas¶

Pandas Series¶

What are Series?¶

Creating Series¶

Series Characteristics¶

Pandas DataFrames¶

What are DataFrames?¶

Creating DataFrames¶

Indexing and Slicing DataFrames¶

Indexing with `[]`¶

Indexing with `.loc` and `.iloc`¶

Boolean indexing¶

Indexing with `.query()`¶

Reading/Writing Data From External Sources¶

.csv files¶

Slicing using `.loc`¶

Common DataFrame Operations¶

Summary¶

Data Science using Python

DataFrames in Python

Contents

DataFrames in Python¶

Chapter Learning Objectives¶

Introduction to Pandas¶

Pandas Series¶

What are Series?¶

Creating Series¶

Series Characteristics¶

Pandas DataFrames¶

What are DataFrames?¶

Creating DataFrames¶

Indexing and Slicing DataFrames¶

Indexing with []¶

Indexing with .loc and .iloc¶

Boolean indexing¶

Indexing with .query()¶

Reading/Writing Data From External Sources¶

.csv files¶

Slicing using .loc¶

Common DataFrame Operations¶

Summary¶

Indexing with `[]`¶

Indexing with `.loc` and `.iloc`¶

Indexing with `.query()`¶

Slicing using `.loc`¶