How to Perform Memory-Efficient Operations on Large Datasets with Pandas
Image by Editor | Midjourney

 

Let’s learn how to perform operation in Pandas with Large datasets.

 

Preparation

 
As we are talking about the Pandas package, you should have one installed. Additionally, we would use the Numpy package as well. So, install them both.

 

Then, let’s get into the central part of the tutorial.
 

Perform Memory-Efficients Operations with Pandas

 

Pandas are typically not known to process large datasets as memory-intensive operations with the Pandas package can take too much time or even swallow your whole RAM. However, there are ways to improve efficiency in panda operations.

In this tutorial, we will walk you through ways to enhance your experience with large Datasets in Pandas.

First, try loading the dataset with a memory optimization parameter. Also, try changing the data type, especially to a memory-friendly type, and drop any unnecessary columns.

import pandas as pd

df = pd.read_csv('some_large_dataset.csv', low_memory=True, dtype={'column': 'int32'}, usecols=['col1', 'col2'])

 

Converting the integer and float with the smallest type would help reduce the memory footprint. Using category type to the categorical column with a small number of unique values would also help. Smaller columns also help with memory efficiency.

Next, we can use the chunk process to avoid using all the memory. It would be more efficient if process it iteratively. For example, we want to get the column mean, but the dataset is too big. We can process 100,000 data at a time and get the total result.

chunk_results = []

def column_mean(chunk):
    chunk_mean = chunk['target_column'].mean()
    return chunk_mean

chunksize = 100000
for chunk in pd.read_csv('some_large_dataset.csv', chunksize=chunksize):
    chunk_results.append(column_mean(chunk))

final_result = sum(chunk_results) / len(chunk_results) 

 

Additionally, avoid using the apply method with lambda functions; it could be memory intensive. Alternatively, it’s better to use vectorized operations or the .apply method with normal function.

df['new_column'] = df['existing_column'] * 2

 

For conditional operations in Pandas, it’s also faster to use np.whererather than directly using the Lambda function with .apply

import numpy as np 
df['new_column'] = np.where(df['existing_column'] > 0, 1, 0)

 

Then, using inplace=Truein many Pandas operations is much more memory-efficient than assigning them back to their DataFrame. It’s much more efficient because assigning them back would create a separate DataFrame before we put them into the same variable.

df.drop(columns=['column_to_drop'], inplace=True)

 

Lastly, filter the data early before any operations, if possible. This will limit the amount of data we process.

df = df[df['filter_column'] > threshold]

 

Try to master these tips to improve your Pandas experience in large datasets.

 

Additional Resources

 

 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.



Source link

Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *