5 Python Tips for Data Efficiency and Speed


5 Python Tips for Data Efficiency and Speed5 Python Tips for Data Efficiency and Speed
Image by Author

 

Writing efficient Python code is important for optimizing performance and resource usage, whether you’re working on data science projects, building web apps, or working on other programming tasks.

Using Python’s powerful features and best practices, you can reduce computation time and improve the responsiveness and maintainability of your applications.

In this tutorial, we’ll explore five essential tips to help you write more efficient Python code by coding examples for each. Let’s get started.

 

1. Use List Comprehensions Instead of Loops

 

You can use list comprehensions to create lists from existing lists and other iterables like strings and tuples. They are generally more concise and faster than regular loops for list operations.

Let’s say we have a dataset of user information, and we want to extract the names of users who have a score greater than 85.

Using a Loop

First, let’s do this using a for loop and if statement:

data = [{'name': 'Alice', 'age': 25, 'score': 90},
    	{'name': 'Bob', 'age': 30, 'score': 85},
    	{'name': 'Charlie', 'age': 22, 'score': 95}]

# Using a loop
result = []
for row in data:
    if row['score'] > 85:
        result.append(row['name'])

print(result)

 

You should get the following output:

Output  >>> ['Alice', 'Charlie']

 

Using a List Comprehension

Now, let’s rewrite using a list comprehension. You can use the generic syntax [output for input in iterable if condition] like so:

data = [{'name': 'Alice', 'age': 25, 'score': 90},
    	{'name': 'Bob', 'age': 30, 'score': 85},
    	{'name': 'Charlie', 'age': 22, 'score': 95}]

# Using a list comprehension
result = [row['name'] for row in data if row['score'] > 85]

print(result)

 

Which should give you the same output:

Output >>> ['Alice', 'Charlie']

 

As seen, the list comprehension version is more concise and easier to maintain. You can try out other examples and profile your code with timeit to compare the execution times of loops vs. list comprehensions.

List comprehensions, therefore, let you write more readable and efficient Python code, especially in transforming lists and filtering operations. But be careful not to overuse them. Read Why You Should Not Overuse List Comprehensions in Python to learn why overusing them may become too much of a good thing.

 

2. Use Generators for Efficient Data Processing

 

You can use generators in Python to iterate over large datasets and sequences without storing them all in memory up front. This is particularly useful in applications where memory efficiency is important.

Unlike regular Python functions that use the return keyword to return the entire sequence, generator functions yield a generator object. Which you can then loop over to get the individual items—on demand and one at a time.

Suppose we have a large CSV file with user data, and we want to process each row—one at a time—without loading the entire file into memory at once.

Here’s the generator function for this:

import csv
from typing import Generator, Dict

def read_large_csv_with_generator(file_path: str) -> Generator[Dict[str, str], None, None]:
    with open(file_path, 'r') as file:
        reader = csv.DictReader(file)
        for row in reader:
            yield row

# Path to a sample CSV file
file_path="large_data.csv"

for row in read_large_csv_with_generator(file_path):
    print(row)

 

Note: Remember to replace ‘large_data.csv’ with the path to your file in the above snippet.

As you can already tell, using generators is especially helpful when working with streaming data or when the dataset size exceeds available memory.

For a more detailed review of generators, read Getting Started with Python Generators.

 

3. Cache Expensive Function Calls

 

Caching can significantly improve performance by storing the results of expensive function calls and reusing them when the function is called with the same inputs again.

Suppose you’re coding k-means clustering algorithm from scratch and want to cache the Euclidean distances computed. Here’s how you can cache function calls with the @cache decorator:


from functools import cache
from typing import Tuple
import numpy as np

@cache
def euclidean_distance(pt1: Tuple[float, float], pt2: Tuple[float, float]) -> float:
    return np.sqrt((pt1[0] - pt2[0]) ** 2 + (pt1[1] - pt2[1]) ** 2)

def assign_clusters(data: np.ndarray, centroids: np.ndarray) -> np.ndarray:
    clusters = np.zeros(data.shape[0])
    for i, point in enumerate(data):
        distances = [euclidean_distance(tuple(point), tuple(centroid)) for centroid in centroids]
        clusters[i] = np.argmin(distances)
    return clusters

 

Let’s take the following sample function call:

data = np.array([[1.0, 2.0], [2.0, 3.0], [3.0, 4.0], [8.0, 9.0], [9.0, 10.0]])
centroids = np.array([[2.0, 3.0], [8.0, 9.0]])

print(assign_clusters(data, centroids))

 

Which outputs:

Outputs >>> [0. 0. 0. 1. 1.]

 

To learn more, read How To Speed Up Python Code with Caching.

 

4. Use Context Managers for Resource Handling

 

In Python, context managers ensure that resources—such as files, database connections, and subprocesses—are properly managed after use.

Say you need to query a database and want to ensure the connection is properly closed after use:

import sqlite3

def query_db(db_path):
    with sqlite3.connect(db_path) as conn:
        cursor = conn.cursor()
        cursor.execute(query)
        for row in cursor.fetchall():
            yield row

 

You can now try running queries against the database:

query = "SELECT * FROM users"
for row in query_database('people.db', query):
    print(row)

 

To learn more about the uses of context managers, read 3 Interesting Uses of Python’s Context Managers.

 

5. Vectorize Operations Using NumPy

 

NumPy allows you to perform element-wise operations on arrays—as operations on vectors—without the need for explicit loops. This is often significantly faster than loops because NumPy uses C under the hood.

Say we have two large arrays representing scores from two different tests, and we want to calculate the average score for each student. Let’s do it using a loop:

import numpy as np

# Sample data
scores_test1 = np.random.randint(0, 100, size=1000000)
scores_test2 = np.random.randint(0, 100, size=1000000)

# Using a loop
average_scores_loop = []
for i in range(len(scores_test1)):
    average_scores_loop.append((scores_test1[i] + scores_test2[i]) / 2)

print(average_scores_loop[:10])

 

Here’s how you can rewrite them with NumPy’s vectorized operations:

# Using NumPy vectorized operations
average_scores_vectorized = (scores_test1 + scores_test2) / 2

print(average_scores_vectorized[:10])

 

Loops vs. Vectorized Operations

Let’s measure the execution times of the loop and the NumPy versions using timeit:

setup = """
import numpy as np

scores_test1 = np.random.randint(0, 100, size=1000000)
scores_test2 = np.random.randint(0, 100, size=1000000)
"""

loop_code = """
average_scores_loop = []
for i in range(len(scores_test1)):
    average_scores_loop.append((scores_test1[i] + scores_test2[i]) / 2)
"""

vectorized_code = """
average_scores_vectorized = (scores_test1 + scores_test2) / 2
"""

loop_time = timeit.timeit(stmt=loop_code, setup=setup, number=10)
vectorized_time = timeit.timeit(stmt=vectorized_code, setup=setup, number=10)

print(f"Loop time: {loop_time:.6f} seconds")
print(f"Vectorized time: {vectorized_time:.6f} seconds")

 

As seen vectorized operations with Numpy are much faster than the loop version:

Output >>>
Loop time: 4.212010 seconds
Vectorized time: 0.047994 seconds

 

Wrapping Up

 

That’s all for this tutorial!

We reviewed the following tips—using list comprehensions over loops, leveraging generators for efficient processing, caching expensive function calls, managing resources with context managers, and vectorizing operations with NumPy—that can help optimize your code’s performance.

If you’re looking for tips specific to data science projects, read 5 Python Best Practices for Data Science.

 

 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.





Source link

Leave a comment

All fields marked with an asterisk (*) are required