This is an excerpt from the Python Data Science Handbook by Jake VanderPlas; Jupyter notebooks are available on GitHub.

The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!

Operaciones sobre texto

Introducción a las operaciones sobre texto en Pandas

In [1]:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
Out[1]:
array([ 4,  6, 10, 14, 22, 26])
In [2]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
Out[2]:
['Peter', 'Paul', 'Mary', 'Guido']
In [3]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-fc1d891ab539> in <module>()
      1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]

<ipython-input-3-fc1d891ab539> in <listcomp>(.0)
      1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'
In [4]:
import pandas as pd
names = pd.Series(data)
names
Out[4]:
0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object
In [5]:
names.str.capitalize()
Out[5]:
0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

Métodos para texto en Pandas

In [6]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

Métodos similares a los de Python (para texto)

Método

len() | lower() | translate() | islower()
ljust() | upper() | startswith() | isupper()
rjust() | find() | endswith() | isnumeric()
center() | rfind() | isalnum() | isdecimal()
zfill() | index() | isalpha() | split()
strip() | rindex() | isdigit() | rsplit()
rstrip() | capitalize() | isspace() | partition()
lstrip() | swapcase() | istitle() | rpartition()

In [7]:
monte.str.lower()
Out[7]:
0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object
In [8]:
monte.str.len()
Out[8]:
0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64
In [9]:
monte.str.startswith('T')
Out[9]:
0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool
In [10]:
monte.str.split()
Out[10]:
0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

Métodos que usan exrpresiones regulares

Método Descripción
match() Call re.match() on each element, returning a boolean.
extract() Call re.match() on each element, returning matched groups as strings.
findall() Call re.findall() on each element
replace() Replace occurrences of pattern with some other string
contains() Call re.search() on each element, returning a boolean
count() Count occurrences of pattern
split() Equivalent to str.split(), but accepts regexps
rsplit() Equivalent to str.rsplit(), but accepts regexps
In [11]:
monte.str.extract('([A-Za-z]+)', expand=False)
Out[11]:
0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object
In [12]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
Out[12]:
0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

Métodos varios

Método Descripción
get() Index each element
slice() Slice each element
slice_replace() Replace slice in each element with passed value
cat() Concatenate strings
repeat() Repeat values
normalize() Return Unicode form of string
pad() Add whitespace to left, right, or both sides of strings
wrap() Split long strings into lines with length less than a given width
join() Join strings in each element of the Series with passed separator
get_dummies() extract dummy variables as a dataframe

Acceso vectorizado a elementos y rebanadas de texto

In [13]:
monte.str[0:3]
Out[13]:
0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object
In [14]:
monte.str.split().str.get(-1)
Out[14]:
0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

Variables de indicación

In [15]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte
Out[15]:
info name
0 B|C|D Graham Chapman
1 B|D John Cleese
2 A|C Terry Gilliam
3 B|D Eric Idle
4 B|C Terry Jones
5 B|C|D Michael Palin
In [16]:
full_monte['info'].str.get_dummies('|')
Out[16]:
A B C D
0 0 1 1 1
1 0 1 0 1
2 1 0 1 0
3 0 1 0 1
4 0 1 1 0
5 0 1 1 1

Ejemplo: Base de datos de recetas

In [17]:
# !curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
# !gunzip recipeitems-latest.json.gz
In [18]:
try:
    recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
    print("ValueError:", e)
ValueError: Trailing data
In [19]:
with open('recipeitems-latest.json') as f:
    line = f.readline()
pd.read_json(line).shape
Out[19]:
(2, 12)
In [20]:
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r') as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)
In [21]:
recipes.shape
Out[21]:
(173278, 17)
In [22]:
recipes.iloc[0]
Out[22]:
_id                                {'$oid': '5160756b96cc62079cc2db15'}
cookTime                                                          PT30M
creator                                                             NaN
dateModified                                                        NaN
datePublished                                                2013-03-11
description           Late Saturday afternoon, after Marlboro Man ha...
image                 http://static.thepioneerwoman.com/cooking/file...
ingredients           Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
name                                    Drop Biscuits and Sausage Gravy
prepTime                                                          PT10M
recipeCategory                                                      NaN
recipeInstructions                                                  NaN
recipeYield                                                          12
source                                                  thepioneerwoman
totalTime                                                           NaN
ts                                             {'$date': 1365276011104}
url                   http://thepioneerwoman.com/cooking/2013/03/dro...
Name: 0, dtype: object
In [23]:
recipes.ingredients.str.len().describe()
Out[23]:
count    173278.000000
mean        244.617926
std         146.705285
min           0.000000
25%         147.000000
50%         221.000000
75%         314.000000
max        9067.000000
Name: ingredients, dtype: float64
In [24]:
recipes.name[np.argmax(recipes.ingredients.str.len())]
Out[24]:
'Carrot Pineapple Spice &amp; Brownie Layer Cake with Whipped Cream &amp; Cream Cheese Frosting and Marzipan Carrots'
In [25]:
recipes.description.str.contains('[Bb]reakfast').sum()
Out[25]:
3524
In [26]:
recipes.ingredients.str.contains('[Cc]innamon').sum()
Out[26]:
10526
In [27]:
recipes.ingredients.str.contains('[Cc]inamon').sum()
Out[27]:
11

Un sistema de recomendación de recetas

In [28]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
In [29]:
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df.head()
Out[29]:
cumin oregano paprika parsley pepper rosemary sage salt tarragon thyme
0 False False False False False False True False False False
1 False False False False False False False False False False
2 True False False False True False False True False False
3 False False False False False False False False False False
4 False False False False False False False False False False
In [30]:
selection = spice_df.query('parsley & paprika & tarragon')
len(selection)
Out[30]:
10
In [31]:
recipes.name[selection.index]
Out[31]:
2069      All cremat with a Little Gem, dandelion and wa...
74964                         Lobster with Thermidor butter
93768      Burton's Southern Fried Chicken with White Gravy
113926                     Mijo's Slow Cooker Shredded Beef
137686                     Asparagus Soup with Poached Eggs
140530                                 Fried Oyster Po’boys
158475                Lamb shank tagine with herb tabbouleh
158486                 Southern fried chicken in buttermilk
163175            Fried Chicken Sliders with Pickles + Slaw
165243                        Bar Tartine Cauliflower Salad
Name: name, dtype: object