This is an excerpt from the Python Data Science Handbook by Jake VanderPlas; Jupyter notebooks are available on GitHub.

The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!

Agregaciones: Min, Max, etc.

Sumando los Valores de un Array

In [1]:
import numpy as np
In [2]:
L = np.random.random(100)
sum(L)
Out[2]:
55.61209116604941
In [3]:
np.sum(L)
Out[3]:
55.612091166049424
In [4]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)
10 loops, best of 3: 104 ms per loop
1000 loops, best of 3: 442 µs per loop

Minimo y Maximo

In [5]:
min(big_array), max(big_array)
Out[5]:
(1.1717128136634614e-06, 0.9999976784968716)
In [6]:
np.min(big_array), np.max(big_array)
Out[6]:
(1.1717128136634614e-06, 0.9999976784968716)
In [7]:
%timeit min(big_array)
%timeit np.min(big_array)
10 loops, best of 3: 82.3 ms per loop
1000 loops, best of 3: 497 µs per loop
In [8]:
print(big_array.min(), big_array.max(), big_array.sum())
1.17171281366e-06 0.999997678497 499911.628197

Agregaciones multi dimensionales

In [9]:
M = np.random.random((3, 4))
print(M)
[[ 0.8967576   0.03783739  0.75952519  0.06682827]
 [ 0.8354065   0.99196818  0.19544769  0.43447084]
 [ 0.66859307  0.15038721  0.37911423  0.6687194 ]]
In [10]:
M.sum()
Out[10]:
6.0850555667307118
In [11]:
M.min(axis=0)
Out[11]:
array([ 0.66859307,  0.03783739,  0.19544769,  0.06682827])
In [12]:
M.max(axis=1)
Out[12]:
array([ 0.8967576 ,  0.99196818,  0.6687194 ])

Otras funciones de agregación

The following table provides a list of useful aggregation functions available in NumPy:

Function Name NaN-safe Version Description
np.sum np.nansum Compute sum of elements
np.prod np.nanprod Compute product of elements
np.mean np.nanmean Compute mean of elements
np.std np.nanstd Compute standard deviation
np.var np.nanvar Compute variance
np.min np.nanmin Find minimum value
np.max np.nanmax Find maximum value
np.argmin np.nanargmin Find index of minimum value
np.argmax np.nanargmax Find index of maximum value
np.median np.nanmedian Compute median of elements
np.percentile np.nanpercentile Compute rank-based statistics of elements
np.any N/A Evaluate whether any elements are true
np.all N/A Evaluate whether all elements are true

We will see these aggregates often throughout the rest of the book.

Ejemplo: Altura promedio de los presidentes

In [1]:
!head -4 data/president_heights.csv
order,name,height(cm)
1,George Washington,189
2,John Adams,170
3,Thomas Jefferson,189
In [14]:
import pandas as pd
data = pd.read_csv('data/president_heights.csv')
heights = np.array(data['height(cm)'])
print(heights)
[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173
 174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183
 177 185 188 188 182 185]
In [15]:
print("Mean height:       ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height:    ", heights.min())
print("Maximum height:    ", heights.max())
Mean height:        179.738095238
Standard deviation: 6.93184344275
Minimum height:     163
Maximum height:     193
In [16]:
print("25th percentile:   ", np.percentile(heights, 25))
print("Median:            ", np.median(heights))
print("75th percentile:   ", np.percentile(heights, 75))
25th percentile:    174.25
Median:             182.0
75th percentile:    183.0
In [17]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()  # set plot style
In [18]:
plt.hist(heights)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number');