Pandas is a python library used for working with tabular data. It is easy to use and it is fast since it uses C under the hood.
The 2 main disadvantatges is that it will try to fit all the information in the RAM memory so it has limits on the amounth of data it can work with. And it only uses one core of the computer so it might be possible to improve.
The best source for information are the pandas webpage and the offical documentation.
This post will be an overview of the basic usage of this library.
The first thing is to install it with pip:
pip install pandas
And the second one to import it:
import pandas as pd print(pd.__version__)
This tutorial uses pandas 0.25.3 since it was the latest version avaible at the time this post was written.
To create a new DataFrame from python objects use pd.DataFrame
.
It works with almost all python objects, for example:
df = pd.DataFrame({"A": range(5), "B": list("ABCDE")})
The most common reading function is pd.read_csv
.
But there are also functions to read almost everything (more info).
They are really easy to use since in general you only need the filename of the file you want to read:
df = pd.read_csv("datasets/titanic_train.csv") df_iris = pd.read_excel("datasets/iris.xlsx")
Dataframes have functions that allows you to export them to almost any format.
For example you can export it to parquet
(a really eficient file format, more info here).
df.to_parquet("datasets/output.parquet")
Pandas is usually used to handle a lot of data so it is difficult to view all of it. That is why is important to preview, describe and summarize the data it contains.
One of the most basic things to do is to view a part of the data. The three most common functions to do so are:
Function | What it does |
---|---|
df.head(n) | Show first n rows |
df.tail(n) | Show last n rows |
df.sample(n) | Show a sample of n rows |
There are some ways to describe the whole dataframe.
The most common way is by calling df.describe()
.
This will give a report of the numerical columns with some basic stats like mean, std, min, max etc.
It is possible to include all columns with:
df.describe(include='all')
The other important function is:
df.info(memory_usage='deep')
This will show the types of each column and an aproximation of how much memory is used.
A column of the dataframe in pandas is called a Series.
To best way to view a column is with df[name]
.
It is possible to call df.name
but it is not advised since it won't work with columns that have spaces or with some special names.
It is possible to use the functions to describe the dataframe to describe a column.
For example: df["age"].head()
.
There are also some functions to describe the column. The most important are:
Function | What it does |
---|---|
df[column].head(n) | Show first n values |
df[column].unique() | Show all unique values |
df["Sex"].value_counts() | Show all unique values while counting how many times they appear |
There are some functions to retrive one basic stat for the whole dataframe. Those are:
As an example they can be called with df.mean()
.
They can also be used for a column like df[col].nunique()
.
Pandas is usually used to work with a lot of information. So it is really important to know how to slice the dataframe to get a subset of it.
The best way to get one value is with the .at
method.
The first parameter would be the name of the row and the second the name of the column.
So for example:
df.at[1, "Sex"]
This would yield the value of the Sex
column on the row with index 1
.
As seen in the previous section a column can be called like df["Sex"]
.
By passing a list instead of only one name it will return some columns.
For example:
df[["Sex", "Age"]] # Return 'Age' and 'Sex' columns
To filter some rows you will need to call df[some_filter]
.
This filter could be complex by using the and operator (&
), the or (|
) or the not (~
).
You can't use the regular python operators (and
, or
, not
).
Some examples of filters would be:
df[df["Sex"] == "male"] # Only rows with male people df[(df["Age"] > 22) & (df["Age"] < 30)] # Age between 22 and 30
And of course you can first create a filter and the apply it:
mfilter = df["Age"] > 20 df[mfilter]
Apart from the basic operators there is the isin
which will give all values inside a list and the isna
which will give all null values.
So for example:
df[df["Pclass"].isin([1, 2])] # All with 'Pclass' 1 or 2 df[df["Cabin"].isna()] # All that does not have 'Cabin' informed
You can use loc
when using rows/columns names or iloc
when using the position (number) for rows/columns.
Some examples:
df.loc[df["Age"] > 30, "Survived"] # First the filter for rows, then the column or columns df.loc[df["Age"] > 30, ["Sex", "Survived"]].head() # You can always concatenate functions df.iloc[0:5, 1:3] # Slice for rows, slice for columns
This is one of the most powerful ways of working with pandas. So you must really get comfortable with it!
You might find the function
ix
but it has been deprecated in favor ofloc/iloc
. Please do not useix
.
It is important to see the data but it is usually not enough.
In general you can assing values by using the same functions that were used to acces data. For example:
df.at[1, "Sex"] = "Other" # Assign one value df["Ticket"] = "no ticket" # Assign a value to all rows df["Dummy"] = df["Age"] + df["Survived"] # sum to columns and assign the value to a new column
For columns that have string values it is possible the call .str
and use some string special functions.
As an example:
df["Name"].str.replace("Mr", "Hello") # This only shows the modification but it does not replace the original value df["Name"] = df["Name"].str.replace("Mr", "Hello") # This updates the values df["Name"].str[:10] # First 10 chars of each string
There are also some special functions for date columns by calling .dt
.
But to use them first you would need to transform the column to datetime.
df["key"] = pd.to_datetime(df["key"]) # Transform the column to datetime
This function is really smart and usually there is no need to specify the format since it will get guessed.
The only parameter I find really useful is to pass dayfirst=True
to work with formats like dd/mm/yyyy
.
After transforming the column to date some of the things that you could do are:
df["key"].dt.day # Gets the day df["key"].dt.strftime("%Y-%m-01") # Get first day of the month
The function fillna
will fill the missing values.
In order to update the original values you need to set inplace to true (fillna.(fill_value, inplace=True)
)
As an example:
df["Cabin"].fillna("Unknown")
It is also possible to drop rows with missing values with df[col].dropna()
This is really straight forward:
df.sort_values("Cabin", ascending=False) # Default ascending=True
It is possible to drop rows that have duplicates with:
df.drop_duplicates(["Pclass"]) # Discard the duplicates based on certain columns
You can apply custom functions. Remeber that usually functions have names but you can define unamed functions using lambdas. As an example:
def sum_1(x): return x + 1 sum_2 = lambda x: x + 2
You have three functions that will work when applying custom functions:
DataFrame.apply
operates on entire rows or columns at a time.DataFrame.applymap
, Series.apply
, and Series.map
operate on one element at time.So for example:
df[numerical_columns].apply(sum_1)
Another example would be:
def custom_categories(x): """ This needs to be applied to one element at a time """ if x > 5: return "L" if x < 1: return "S" return "M" df["SL"].apply(custom_categories)
It is posible to apply changes to only part of the dataframe using loc
.
For example:
df.loc[df["Sex"] == "male", "Pclass"] = 77 # Set 'Pclass' to 77 for all male passangers
And this is where creating a filter before calling .loc
is really useful. For example:
mfilter = df["Embarked"] == "S" df.loc[mfilter, "Age"] = df.loc[mfilter, "Age"]*2
This will double the Age for all rows where Embarked = S
.
This is done with df.drop
. To delete rows set axis=0
and axis=1
for columns.
For example:
df.drop("Dummy", axis=1) df.drop(["Dummy", "Fare", "Cabin"], axis=1, inplace=True) # inplace=True to update the original dataframe
The function df.columns
retrive the names of the columns.
To rename them simply pass a list with the desired names. For example:
df.columns = ['id', 'survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'embarked']
A really common transformation for tables is to group them by one or more columns. Once the aggrupation is done you can call one or more aggregation function. For example:
df.groupby("pclass").sum() # One aggregation df.groupby("pclass").agg(["min", "max"]) # More than one aggregation
With pandas 0.25
or higher you can specify what to do to what column and the name of the output column.
This is done with the following syntax: agg.(new_name=(old_name, aggregation_function))
.
For example:
df.groupby("pclass").agg(age_avg=("age", "mean"), age_min=("age", "min"))
And that would be a really good overview of the pandas library. In the future I will writte another post with more advanced pandas tricks.