This is an excerpt from the Python Data Science Handbook by Jake VanderPlas; Jupyter notebooks are available on GitHub.

The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!

Datos faltantes

Datos faltantes en Pandas

None

In [1]:
import numpy as np
import pandas as pd
In [2]:
vals1 = np.array([1, None, 3, 4])
vals1
Out[2]:
array([1, None, 3, 4], dtype=object)
In [3]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()
dtype = object
10 loops, best of 3: 78.2 ms per loop

dtype = int
100 loops, best of 3: 3.06 ms per loop

In [4]:
vals1.sum()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-749fd8ae6030> in <module>()
----> 1 vals1.sum()

/Users/jakevdp/anaconda/lib/python3.5/site-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims)
     30 
     31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
---> 32     return umr_sum(a, axis, dtype, out, keepdims)
     33 
     34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False):

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

NaN

In [5]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype
Out[5]:
dtype('float64')
In [6]:
1 + np.nan
Out[6]:
nan
In [7]:
0 *  np.nan
Out[7]:
nan
In [8]:
vals2.sum(), vals2.min(), vals2.max()
Out[8]:
(nan, nan, nan)
In [9]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
Out[9]:
(8.0, 1.0, 4.0)

NaN y None en Pandas

In [10]:
pd.Series([1, np.nan, 2, None])
Out[10]:
0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64
In [11]:
x = pd.Series(range(2), dtype=int)
x
Out[11]:
0    0
1    1
dtype: int64
In [12]:
x[0] = None
x
Out[12]:
0    NaN
1    1.0
dtype: float64
Typeclass Conversion When Storing NAs NA Sentinel Value
floating No change np.nan
object No change None or np.nan
integer Cast to float64 np.nan
boolean Cast to object None or np.nan

Keep in mind that in Pandas, string data is always stored with an object dtype.

Operando con valores Null

  • isnull(): Generate a boolean mask indicating missing values
  • notnull(): Opposite of isnull()
  • dropna(): Return a filtered version of the data
  • fillna(): Return a copy of the data with missing values filled or imputed

Detectando valores null

In [13]:
data = pd.Series([1, np.nan, 'hello', None])
In [14]:
data.isnull()
Out[14]:
0    False
1     True
2    False
3     True
dtype: bool
In [15]:
data[data.notnull()]
Out[15]:
0        1
2    hello
dtype: object

Eliminando/ignorando valores null

In [16]:
data.dropna()
Out[16]:
0        1
2    hello
dtype: object
In [17]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df
Out[17]:
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
In [18]:
df.dropna()
Out[18]:
0 1 2
1 2.0 3.0 5
In [19]:
df.dropna(axis='columns')
Out[19]:
2
0 2
1 5
2 6
In [20]:
df[3] = np.nan
df
Out[20]:
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In [21]:
df.dropna(axis='columns', how='all')
Out[21]:
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
In [22]:
df.dropna(axis='rows', thresh=3)
Out[22]:
0 1 2 3
1 2.0 3.0 5 NaN

Llenando valores null

In [23]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data
Out[23]:
a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64
In [24]:
data.fillna(0)
Out[24]:
a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64
In [25]:
# forward-fill
data.fillna(method='ffill')
Out[25]:
a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64
In [26]:
# back-fill
data.fillna(method='bfill')
Out[26]:
a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64
In [27]:
df
Out[27]:
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In [28]:
df.fillna(method='ffill', axis=1)
Out[28]:
0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 3.0 5.0 5.0
2 NaN 4.0 6.0 6.0