Image by Author | DALLE-3 & Canva
While pandas is mainly used for data manipulation and analysis, it can also provide basic data visualization capabilities. However, plain dataframes can make the information look cluttered and overwhelming. So, what can be done to make it better? If you’ve worked with Excel before, you know that you can highlight important values with different colors, font styles, etc. The idea of using these styles and colors is to communicate the information in an effective way. You can do similar work with pandas dataframes too, using conditional formatting and the Styler object.
In this article, we will see what conditional formatting is and how to use it to enhance your data readability.
Conditional Formatting
Conditional formatting is a feature in pandas that allows you to format the cells based on some criteria. You can easily highlight the outliers, visualize trends, or emphasize important data points using it. The Styler object in pandas provides a convenient way to apply conditional formatting. Before covering the examples, let’s take a quick look at how the Styler object works.
What is the Styler Object & How Does It Work?
You can control the visual representation of the dataframe by using the property. This property returns a Styler object, which is responsible for styling the dataframe. The Styler object allows you to manipulate the CSS properties of the dataframe to create a visually appealing and informative display. The generic syntax is as follows:
Where
- Built-in Styles: To apply quick formatting styles to your dataframe
- Custom Stylization: Create your own formatting rules for the Styler object and pass them through one of the following methods (
Styler.applymap
: element-wise orStyler.apply
: column-/row-/table-wise)
Now, we will cover some examples of both approaches to help you enhance the visualization of your data.
Examples: Built-in-Styles
Let’s create a dummy stock price dataset with columns for Date, Cost Price, Satisfaction Score, and Sales Amount to demonstrate the examples below:
import pandas as pd
import numpy as np
data = {'Date': ['2024-03-05', '2024-03-06', '2024-03-07', '2024-03-08', '2024-03-09', '2024-03-10'],
'Cost Price': [100, 120, 110, 1500, 1600, 1550],
'Satisfaction Score': [90, 80, 70, 95, 85, 75],
'Sales Amount': [1000, 800, 1200, 900, 1100, None]}
df = pd.DataFrame(data)
df
Output:
Original Unformatted Dataframe
1. Highlighting Maximum and Minimum Values
We can use highlight_max
and highlight_min
functions to highlight the maximum and minimum values in a column or row. For column set axis=0 like this:
# Highlighting Maximum and Minimum Values
df.style.highlight_max(color="green", axis=0 , subset=['Cost Price', 'Satisfaction Score', 'Sales Amount']).highlight_min(color="red", axis=0 , subset=['Cost Price', 'Satisfaction Score', 'Sales Amount'])
Output:
Max & Min Values
2. Applying Color Gradients
Color gradients are an effective way to visualize the values in your data. In this case, we will apply the gradient to satisfaction scores using the colormap set to 'viridis'. This is a type of color coding that ranges from purple (low values) to yellow (high values). Here is how you can do this:
# Applying Color Gradients
df.style.background_gradient(cmap='viridis', subset=['Satisfaction Score'])
Output:
Colormap - viridis
3. Highlighting Null or Missing Values
When we have large datasets, it becomes difficult to identify null or missing values. You can use conditional formatting using the built-in df.style.highlight_null
function for this purpose. For example, in this case, the sales amount of the 6th entry is missing. You can highlight this information like this:
# Highlighting Null or Missing Values
df.style.highlight_null('yellow', subset=['Sales Amount'])
Output:
Highlighting Missing Values
Examples: Custom Stylization Using apply()
& applymap()
1. Conditional Formatting for Outliers
Suppose that we have a housing dataset with their prices, and we want to highlight the houses with outlier prices (i.e., prices that are significantly higher or lower than the other neighborhoods). This can be done as follows:
import pandas as pd
import numpy as np
# House prices dataset
df = pd.DataFrame({
'Neighborhood': ['H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'H7'],
'Price': [50, 300, 360, 390, 420, 450, 1000],
})
# Calculate Q1 (25th percentile), Q3 (75th percentile) and Interquartile Range (IQR)
q1 = df['Price'].quantile(0.25)
q3 = df['Price'].quantile(0.75)
iqr = q3 - q1
# Bounds for outliers
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
# Custom function to highlight outliers
def highlight_outliers(val):
if val < lower_bound or val > upper_bound:
return 'background-color: yellow; font-weight: bold; color: black'
else:
return ''
df.style.applymap(highlight_outliers, subset=['Price'])
Output:
Highlighting Outliers
2. Highlighting Trends
Consider that you run a company and are recording your sales daily. To analyze the trends, you want to highlight the days when your daily sales increase by 5% or more. You can achieve this using a custom function and the apply method in pandas. Here’s how:
import pandas as pd
# Dataset of Company's Sales
data = {'date': ['2024-02-10', '2024-02-11', '2024-02-12', '2024-02-13', '2024-02-14'],
'sales': [100, 105, 110, 115, 125]}
df = pd.DataFrame(data)
# Daily percentage change
df['pct_change'] = df['sales'].pct_change() * 100
# Highlight the day if sales increased by more than 5%
def highlight_trend(row):
return ['background-color: green; border: 2px solid black; font-weight: bold' if row['pct_change'] > 5 else '' for _ in row]
df.style.apply(highlight_trend, axis=1)
Output:
Highlight >5% Increase in Sales
3. Highlighting Correlated Columns
Correlated columns are important because they show relationships between different variables. For example, if we have a dataset containing age, income, and spending habits and our analysis shows a high correlation (close to 1) between age and income, then it suggests that older people generally have higher incomes. Highlighting correlated columns helps to visually identify these relationships. This approach becomes extremely helpful as the dimensionality of your data increases. Let's explore an example to better understand this concept:
import pandas as pd
# Dataset of people
data = {
'age': [30, 35, 40, 45, 50],
'income': [60000, 66000, 70000, 75000, 100000],
'spending': [10000, 15000, 20000, 18000, 12000]
}
df = pd.DataFrame(data)
# Calculate the correlation matrix
corr_matrix = df.corr()
# Highlight highly correlated columns
def highlight_corr(val):
if val != 1.0 and abs(val) > 0.5: # Exclude self-correlation
return 'background-color: blue; text-decoration: underline'
else:
return ''
corr_matrix.style.applymap(highlight_corr)
Output:
Correlated Columns
Wrapping Up
These are just some of the examples I showed as a starter to up your game of data visualization. You can apply similar techniques to various other problems to enhance the data visualization, such as highlighting duplicate rows, grouping into categories and selecting different formatting for each category, or highlighting peak values. Additionally, there are many other CSS options you can explore in the official documentation. You can even define different properties on hover, like magnifying text or changing color. Check out the "Fun Stuff" section for more cool ideas. This article is part of my Pandas series, so if you enjoyed this, there's plenty more to explore. Head over to my author page for more tips, tricks, and tutorials.
Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT". As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She's also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.