Image by Editor | Ideogram
Random data consists of values generated through various tools without predictable patterns. The occurrence of values depends on the probability distribution from which they are drawn because they are unpredictable.
There are many benefits to using Random Data in our experiments, including real-world data simulation, synthetic data for machine learning training, or statistical sampling purposes.
NumPy is a powerful package that supports many mathematical and statistical computations, including random data generation. From simple data to complex multi-dimensional arrays and matrices, NumPy could help us facilitate the need for random data generation.
This article will discuss further how we could generate Random data with Numpy. So, let’s get into it.
Random Data Generation with NumPy
You need to have the NumPy package installed in your environment. If you haven’t done that, you can use pip to install them.
When the package has been successfully installed, we will move on to the main part of the article.
First, we would set the seed number for reproducibility purposes. When we perform random occurrences with the computer, we must remember that what we do is pseudo-random. The pseudo-random concept is when data seems random but is deterministic if we know where the starting points which we call seed.
To set the seed in NumPy, we will use the following code:
import numpy as np
np.random.seed(101)
You can give any positive integer numbers as the seed number, which would become our starting point. Also, the .random
method from the NumPy would become our main function for this article.
Once we have set the seed, we will try to generate random number data with NumPy. Let’s try to generate five different float numbers randomly.
Output>>
array([0.51639863, 0.57066759, 0.02847423, 0.17152166, 0.68527698])
It’s possible to get the multi-dimensional array using NumPy. For example, the following code would result in 3×3 array filled with random float numbers.
Output>>
array([[0.26618856, 0.77888791, 0.89206388],
[0.0756819 , 0.82565261, 0.02549692],
[0.5902313 , 0.5342532 , 0.58125755]])
Next, we could generate an integer random number from certain range. We can do that with this code:
np.random.randint(1, 1000, size=5)
Output>>
array([974, 553, 645, 576, 937])
All the data generated by random sampling previously followed the uniform distribution. It means that all the data have a similar chance to occur. If we iterate the data generation process to infinity times, all the number taken frequency would be close to equal.
We can generate random data from various distributions. Here, we try to generate ten random data from the standard normal distribution.
np.random.normal(0, 1, 10)
Output>>
array([-1.31984116, 1.73778011, 0.25983863, -0.317497 , 0.0185246 ,
-0.42062671, 1.02851771, -0.7226102 , -1.17349046, 1.05557983])
The code above takes the Z-score value from the normal distribution with mean zero and STD one.
We can generate random data following other distributions. Here is how we use the Poisson distribution to generate random data.
Output>>
array([10, 6, 3, 3, 8, 3, 6, 8, 3, 3])
The random sample data from Poisson Distribution in the code above would simulate random events at a specific average rate (5), but the number generated could vary.
We could generate random data following the binomial distribution.
np.random.binomial(10, 0.5, 10)
Output>>
array([5, 7, 5, 4, 5, 6, 5, 7, 4, 7])
The code above simulates the experiments we perform following the Binomial distribution. Just imagine that we perform coin flips ten times (first parameter ten and second parameter probability 0.5); how many times does it show heads? As shown in the output above, we did the experiment ten times (the third parameter).
Let’s try the exponential distribution. With this code, we can generate data following the exponential distribution.
np.random.exponential(1, 10)
Output>>
array([0.7916478 , 0.59574388, 0.1622387 , 0.99915554, 0.10660882,
0.3713874 , 0.3766358 , 1.53743068, 1.82033544, 1.20722031])
Exponential distribution explains the time between events. For example, the code above can be said to be waiting for the bus to enter the station, which takes a random amount of time but, on average, takes 1 minute.
For an advanced generation, you can always combine the distribution results to create sample data following a custom distribution. For example, 70% of the generated random data below follows a normal distribution, while the rest follows an exponential distribution.
def combined_distribution(size=10):
# normal distribution
normal_samples = np.random.normal(loc=0, scale=1, size=int(0.7 * size))
#exponential distribution
exponential_samples = np.random.exponential(scale=1, size=int(0.3 * size))
# Combine the samples
combined_samples = np.concatenate([normal_samples, exponential_samples])
# Shuffle thes samples
np.random.shuffle(combined_samples)
return combined_samples
samples = combined_distribution()
samples
Output>>
array([-1.42085224, -0.04597935, -1.22524869, 0.22023681, 1.13025524,
0.74561453, 1.35293768, 1.20491792, -0.7179921 , -0.16645063])
These custom distributions are much more powerful, especially if we want to simulate our data to follow actual case data (which is usually more messy).
Conclusion
NumPy is a powerful Python package for mathematical and statistical computation. It generates random data that can be used for many events, such as data simulations, synthetic data for machine learning, and many others.
In this article, we have discussed how we can generate random data with NumPy, including methods that could improve our data generation experience.
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.