
In this tutorial, we will learn how to use Pandas’ `pipe` method to build end-to-end data science pipelines. The pipeline includes various steps like data ingestion, data cleaning, data analysis, and data visualization. To highlight the benefits of this approach, we will also compare pipeline-based code with non-pipeline alternatives, giving you a clear understanding of the differences and advantages.
The Pandas `pipe` method is a powerful tool that allows users to chain multiple data processing functions in a clear and readable manner. This method can handle both positional and keyword arguments, making it flexible for various custom functions.
In short, Pandas `pipe` method:
Here is the code example of the `pipe` function. We have applied `clean` and `analysis` Python functions to the Pandas DataFrame. The pipe method will first clean the data, perform data analysis, and return the output.
(
df.pipe(clean)
.pipe(analysis)
)
First, we will write a simple data analysis code without using pipe so that we have a clear comparison of when we use pipe to simplify our data processing pipeline.
For this tutorial, we will be using the Online Sales Dataset – Popular Marketplace Data from Kaggle that contains information about online sales transactions across different product categories.
import pandas as pd
df = pd.read_csv('/work/Online Sales Data.csv')
df.head(3)
# data cleaning
df = df.drop_duplicates()
df = df.dropna()
df = df.reset_index(drop=True)
# convert types
df['Product Category'] = df['Product Category'].astype('str')
df['Product Name'] = df['Product Name'].astype('str')
df['Date'] = pd.to_datetime(df['Date'])
# data analysis
df['month'] = df['Date'].dt.month
new_df = df.groupby('month')['Units Sold'].mean()
# data visualization
new_df.plot(kind='bar', figsize=(10, 5), title="Average Units Sold by Month");
This is quite simple, and if you are a data scientist or even a data science student, you will know how to perform most of these tasks.
To create an end-to-end data science pipeline, we first have to convert the above code into a proper format using Python functions.
We will create Python functions for:
def load_data(path):
return pd.read_csv(path)
def data_cleaning(data):
data = data.drop_duplicates()
data = data.dropna()
data = data.reset_index(drop=True)
return data
def convert_dtypes(data, types_dict=None):
data = data.astype(dtype=types_dict)
## convert the date column to datetime
data['Date'] = pd.to_datetime(data['Date'])
return data
def data_analysis(data):
data['month'] = data['Date'].dt.month
new_df = data.groupby('month')['Units Sold'].mean()
return new_df
def data_visualization(new_df,vis_type="bar"):
new_df.plot(kind=vis_type, figsize=(10, 5), title="Average Units Sold by Month")
return new_df
We will now use the `pipe` method to chain all of the above Python functions in series. As we can see, we have provided the path of the file to the `load_data` function, data types to the `convert_dtypes` function, and visualization type to the `data_visualization` function. Instead of a bar, we will use a visualization line chart.
Building the data pipelines allows us to experiment with different scenarios without changing the overall code. You are standardizing the code and making it more readable.
path = "/work/Online Sales Data.csv"
df = (pd.DataFrame()
.pipe(lambda x: load_data(path))
.pipe(data_cleaning)
.pipe(convert_dtypes,{'Product Category': 'str', 'Product Name': 'str'})
.pipe(data_analysis)
.pipe(data_visualization,'line')
)
The end result looks awesome.
In this short tutorial, we learned about the Pandas `pipe` method and how to use it to build and execute end-to-end data science pipelines. The pipeline makes your code more readable, reproducible, and better organized. By integrating the pipe method into your workflow, you can streamline your data processing tasks and enhance the overall efficiency of your projects. Additionally, some users have found that using `pipe` instead of the `.apply()`method results in significantly faster execution times.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.