This is an excerpt from the Python Data Science Handbook by Jake VanderPlas; Jupyter notebooks are available on GitHub.

The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!

Indices y selección de datos

Selección de datos en Series

Series como diccionario

In [1]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data
Out[1]:
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
In [2]:
data['b']
Out[2]:
0.5
In [3]:
'a' in data
Out[3]:
True
In [4]:
data.keys()
Out[4]:
Index(['a', 'b', 'c', 'd'], dtype='object')
In [5]:
list(data.items())
Out[5]:
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
In [6]:
data['e'] = 1.25
data
Out[6]:
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

Series como un array uni-dimensional

In [7]:
# slicing by explicit index
data['a':'c']
Out[7]:
a    0.25
b    0.50
c    0.75
dtype: float64
In [8]:
# slicing by implicit integer index
data[0:2]
Out[8]:
a    0.25
b    0.50
dtype: float64
In [9]:
# masking
data[(data > 0.3) & (data < 0.8)]
Out[9]:
b    0.50
c    0.75
dtype: float64
In [10]:
# fancy indexing
data[['a', 'e']]
Out[10]:
a    0.25
e    1.25
dtype: float64

"Indexadores": loc, iloc y ix

In [11]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
Out[11]:
1    a
3    b
5    c
dtype: object
In [12]:
# explicit index when indexing
data[1]
Out[12]:
'a'
In [13]:
# implicit index when slicing
data[1:3]
Out[13]:
3    b
5    c
dtype: object
In [14]:
data.loc[1]
Out[14]:
'a'
In [15]:
data.loc[1:3]
Out[15]:
1    a
3    b
dtype: object
In [16]:
data.iloc[1]
Out[16]:
'b'
In [17]:
data.iloc[1:3]
Out[17]:
3    b
5    c
dtype: object

Selección de datos en un DataFrame

DataFrame como un diccionario

In [18]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
Out[18]:
area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193
In [19]:
data['area']
Out[19]:
California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64
In [20]:
data.area
Out[20]:
California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64
In [21]:
data.area is data['area']
Out[21]:
True
In [22]:
data.pop is data['pop']
Out[22]:
False
In [23]:
data['density'] = data['pop'] / data['area']
data
Out[23]:
area pop density
California 423967 38332521 90.413926
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
New York 141297 19651127 139.076746
Texas 695662 26448193 38.018740

DataFrame como un array bi-dimensional

In [24]:
data.values
Out[24]:
array([[  4.23967000e+05,   3.83325210e+07,   9.04139261e+01],
       [  1.70312000e+05,   1.95528600e+07,   1.14806121e+02],
       [  1.49995000e+05,   1.28821350e+07,   8.58837628e+01],
       [  1.41297000e+05,   1.96511270e+07,   1.39076746e+02],
       [  6.95662000e+05,   2.64481930e+07,   3.80187404e+01]])
In [25]:
data.T
Out[25]:
California Florida Illinois New York Texas
area 4.239670e+05 1.703120e+05 1.499950e+05 1.412970e+05 6.956620e+05
pop 3.833252e+07 1.955286e+07 1.288214e+07 1.965113e+07 2.644819e+07
density 9.041393e+01 1.148061e+02 8.588376e+01 1.390767e+02 3.801874e+01
In [26]:
data.values[0]
Out[26]:
array([  4.23967000e+05,   3.83325210e+07,   9.04139261e+01])
In [27]:
data['area']
Out[27]:
California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64
In [28]:
data.iloc[:3, :2]
Out[28]:
area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
In [29]:
data.loc[:'Illinois', :'pop']
Out[29]:
area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
In [30]:
data.ix[:3, :'pop']
Out[30]:
area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
In [31]:
data.loc[data.density > 100, ['pop', 'density']]
Out[31]:
pop density
Florida 19552860 114.806121
New York 19651127 139.076746
In [32]:
data.iloc[0, 2] = 90
data
Out[32]:
area pop density
California 423967 38332521 90.000000
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
New York 141297 19651127 139.076746
Texas 695662 26448193 38.018740

Convenciones adicionales sobre el indexado

In [33]:
data['Florida':'Illinois']
Out[33]:
area pop density
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
In [34]:
data[1:3]
Out[34]:
area pop density
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
In [35]:
data[data.density > 100]
Out[35]:
area pop density
Florida 170312 19552860 114.806121
New York 141297 19651127 139.076746