DataFrames 101

Introduction

(aka why do we care?)

The daily workflow of a data scientist involves data cleaning, data analysis, data modeling, and organization of the results in a presentable manner. Many languages did not have everything that data scientists were looking for in these areas – until pandas came on the scene. Pandas is one of the most common and well-known Python libraries for data analysis around the world. Pandas allows you to carry out the entire data science workflow without having to switch to a more domain specific language (like R).

One of the most powerful tools offered in pandas is the DataFrame. A DataFrame is a tabular data structure with rows and columns. Labelled rows and columns improve the clarity and intuition of a data scientist’s daily tasks. If you understand the structure of the data you are working with, it makes it easier to clean it, slice it, dice it, and analyze it.

The columns of a DataFrame are a data structure called a Series. If you extract one column from a data frame, then you will receive a Series. You can access many of the same methods and attributes of a Series that you can of a DataFrame.

A DataFrame is comprised of multiple Series.

Creating a DataFrame

DataFrames can be populated in many different ways, depending on the task at hand. We can create DataFrames from a dictionary of lists with keys as column labels.

# from a dictionary of lists with keys as column labels
import pandas as pd

data = {
    'month': ['January', 'February', 'March', 'April', 'May'],
    'products_sold': [7894, 7902, 6532, 6123, 7425],
    'dollars_collected': [12462.59, 13478.77, 11902.92, 10862.55, 13908.07],
    'visitors': [15892, 20312, 14628, 13999, 16435]}

df = pd.DataFrame(data)
print(df)

We can create DataFrames from a list of lists that contain our data, and a list that contains the column labels. The .zip() method is used to map the similar index of multiple containers so they can be used as a single entity. We wrap this in a list() call to convert the output of .zip() to a list.

Once we have a zipped list, we can use dict() to convert this to a dictionary type. Finally, we can convert the dictionary to a DataFrame. We get the same DataFrame as the one we created above.

# from a list of lists that contain our data, and a list that contains the column labels
import pandas as pd

months = ['January', 'February', 'March', 'April', 'May']
products_sold = [7894, 7902, 6532, 6123, 7425]
dollars_collected = [12462.59, 13478.77, 11902.92, 10862.55, 13908.07]
visitors = [15892, 20312, 14628, 13999, 16435]

column_labels = ['month', 'products_sold', 'dollars_collected', 'visitors']
column_values = [months, products_sold, dollars_collected, visitors]

zipped = list(zip(column_labels, column_values))

zipped_dict = dict(zipped)
df = pd.DataFrame(zipped_dict)
print(df)

It is extremely common as a data scientist to receive data sources in the form of a variety of file types: comma seperated files, tab delimited files, Excel files, etc. The pandas library gives you the ability to read that data into a DataFrame seamlessly.

Remember, if you have column labels in your data file, be sure to set the header parameter to the row number of your labels. Also, if you have a particular column that you want to use as the index of the DataFrame, you can set that explicitly with the index_col parameter.

import pandas as pd

csv_df = pd.read_csv('./data.csv', header=0)

excel_df = pd.read_excel('./data.xlsx', index_col=1)

json_df = pd.read_json('./data.json')

Inspecting a DataFrame

Once the data that needs analyzing is loaded into a DataFrame, it does not mean that it is time for modelling or analysis. In fact, unless you know exactly what the data is like when you receive it, it probably isn’t even time to begin the data cleaning process yet. The best next course of action is to inspect the DataFrame.

Pandas provides numerous attributes and methods that can help you learn more about the structure and contents of a given DataFrame. Going through this inspection process will help you to have a more organized plan when approaching the later steps of the analysis or modelling process.

The .head() and .tail() methods will give you a snapshot of the beginning or the end of a DataFrame, respectively. There is an option to pass in a parameter for the number of rows to return.

df.head()

df.tail()

The .info() method and .shape attribute will help to understand the structure and dimensions of the DataFrame you are working with. Note how .info() displays the type associated with each column. Also, it shows the counts of non-null values in each column. If this count is less than the number of entries, then we know that column contains missing values.

df.shape

df.info()

If you want to know the summary statistics for each column in the DataFrame, you can take advantage of the .describe() method. This method will compile some handy information (mean, median, max, etc.) about the DataFrame into a nice, organized output for you.

Note that by default, .describe() will include all numeric columns only. If you would also like to see statistics for other types of columns, you may specify so with the include parameter.

# list of dtypes to include 
include =['object', 'float64', 'int64'] 

df.describe(include = include)

Cleaning a DataFrame

Often, a data scientist will want to work with subsets of a DataFrame object. There are numerous ways to accomplish this. If you want to select a certain column from a DataFrame, you can use square brackets [ ] with the name of the column, or a period . with the name of the column. If you want to select multiple columns at once, you may put a list of column names into the square brackets.

sold = df['products_sold']
print(sold)

sold2 = df.products_sold
print(sold2)

sales = df[['month', 'products_sold']]
print(sales)

When you want to select a subset of one or more rows or columns, you can take advantage of the .iloc property. .iloc is used for index-based selection, so you can pass it integers, lists of integers, slice objects, or even boolean arrays for both the rows and columns.

Note that if you select one row, .iloc returns a series instead of a DataFrame. If you need the row to be selected as a DataFrame, pass the integer as a single-valued list (  like this: .iloc[[1]] ).

# get first three rows
first3 = df.iloc[0:3]

# get first three rows and first two columns
first3_first2 = df.iloc[0:3, 0:2]

# get last row
last1 = df.iloc[-1]

# get last row as a data frame
last1_df = df.iloc[[-1]]

print('--- First 3 ---')
print(first3)
print('\n', '--- First 3, First 2 ---')
print(first3_first2)
print('\n', '--- Last 1 ---')
print(last1)
print('\n', '--- Last 1 DF ---')
print(last1_df)

If you have found that your data has missing values, and records with missing data are of no use to you, you may drop any records with missing values from the DataFrame with .dropna() .

print(df_withna.shape)

df_dropna = df_withna.dropna()

print(df_dropna.shape)
print(df_dropna.head())

If you have found that your data has missing values, but you can’t drop missing data, you can take advantage of the .fillna() method which will accept a value as a parameter and assign that value to any missing ones.

print(df_withna.shape)

df_fillna = df_withna.fillna(0)

Suppose that you read in some data from a .csv file, and you realize that a column that should be an integer, is actually of type object, due to some strange unexpected values. If you want to convert the column to an integer, you may call the method .to_numeric() on that one column of the DataFrame.

If you want to use .to_numeric() (or any other method that works on a Series) but apply it to more than one column in the DataFrame, you may use the .apply() in combination with a .to_numeric() method call to achieve the expected results.

df.visitors = pd.to_numeric(df.visitors)

df[['visitors', 'products_sold']] = df[['visitors', 'products_sold']].apply(pd.to_numeric)

Sometimes, your data in your DataFrame may be structured differently than you need it to be. Perhaps it is delivered to you in a more reporting friendly version like below – and you want to transform it to be more analysis friendly like so.

The left side is analysis friendly. The right side is more reporting friendly. Basically, we need to go from right to left.

Notice that our data for month, products collected, dollars collected, and visitors is organized much like the reporting friendly data in the diagram above. Thus, in this case, you can use .melt() to turn the columns into rows.

df_melt = df.melt(id_vars='month',var_name='metric',value_name='value')

print(df_melt.head())

It is also possible that the opposite will be the case. When your data is delivered to you in a more analysis friendly version, and you need it to be more report friendly, you can use .pivot() to turn unique values into separate columns. To illustrate this, let’s pivot the DataFrame that we just melted.

df_pivot = df_melt.pivot(index='month', columns='metric', values='value')

df_pivot=df_pivot.reset_index()

print(df_pivot.head())

Remember, your data will not always come in one huge file with everything you need. Sometimes, you will need to combine data from different sources to develop the complete data source for analysis. Luckily, DataFrames make it easy to combine data in these cases.

Given that you have data that is of the same structure but segmented into different subsets (for example, five years of sales data split into multiple DataFrames), you can use .concat() to combine the five DataFrames into one complete DataFrame for analysis. Think of .concat() as a way to stitch together pieces of data that were once a single data set.

Note that if you don’t have a meaningful index in your DataFrames and forget to set ignore_index, you may have duplicate values in the index column!

df_2013 = pd.read_csv('./sales2013.csv')
df_2014 = pd.read_csv('./sales2014.csv')
df_2015 = pd.read_csv('./sales2015.csv')
df_2016_2017 = pd.read_excel('./sales2016_2017.xlsx')

df_sales = pd.concat([df_2013, df_2014, df_2015, df_2016_2017], ignore_index=True)

Finally, there may be times where you have two DataFrames that are connected in some way with a common identifier, but have different attributes. An example of this could be a DataFrame schools with information about each school, and a DataFrame teachers with information about each teacher (including which school they work at). You may need to combine each teacher’s information and the information about their school into one DataFrame for analysis purposes.

This can be accomplished by calling the .merge() method. Think of this method as the pandas equivalent of joining two tables in SQL.

schools = {
    's_id': [1, 2, 3],
    'school_name': ['East High School', 'West High School', 'North Elementary School']}

teachers = {
    't_id': [1, 2, 3],
    'teacher_name': ['John Ace', 'Jane Lay', 'James Mann'],
    'school_id': [2, 2, 3]
}

schools = pd.DataFrame(schools)
teachers = pd.DataFrame(teachers)

merged = pd.merge(left=teachers, right=schools, how='inner', on=None, left_on='school_id', right_on='s_id')

# drop unneeded columns
merged = merged.drop(columns=['school_id', 's_id'])

print(merged)

Visualizing a DataFrame

Time series data can be easily plotted with the pandas method .plot(). Given that your DataFrame has the date time set as the index, you can access the values in a series using the [] method. Once you have isolated the series of values, you can take advantage of .plot() and visualize the time series.

# this is a Python 'magic function' (yes that's real)
# include this to show plots inline within frontends 
# like Jupyter notebook

%matplotlib inline

index = pd.DatetimeIndex(['2019-01-01', '2019-02-01',
                          '2019-03-01', '2019-04-01', '2019-05-01', '2019-06-01'])
data = pd.Series([50, 62, 65, 59, 68, 77], index=index)


data.plot()

Calling plot.bar() produces a multiple bar plot; if you set the stacked parameter to True, you will receive a stacked bar plot. Boxplots are produced by calling the .plot.box() or the boxplot() method of a DataFrame. This is useful to visualize the distribution of values within each column.

# this is the example in the pandas docs
# it's just cute so I used it :)

speed = [0.1, 17.5, 40, 48, 52, 69, 88]
lifespan = [2, 8, 70, 1.5, 25, 12, 28]
index = ['snail', 'pig', 'elephant',
         'rabbit', 'giraffe', 'coyote', 'horse']
animals = pd.DataFrame({'speed': speed,
                   'lifespan': lifespan}, index=index)

fig, (ax1,ax2) = plt.subplots(1,2, figsize=(10,4))

animals.plot.bar(rot=45, title='Speed and Lifespan Bar Graph', ax=ax1)

animals.plot.box(title='Speed and Lifespan Box Plots', ax=ax2)

That’s it! That’s DataFrames 101. There is plenty more to know about DataFrames, but this is intended to give you a good jumping off point to go and learn more and perfect your skills.

Be the first to reply

Leave a Reply

Your email address will not be published. Required fields are marked *