< Agregaciones y agrupaciones | Contenido | Series de tiempo >

Introducción a las operaciones sobre texto en Pandas¶

In [1]:

import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2

Out[1]:

array([ 4,  6, 10, 14, 22, 26])

In [2]:

data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]

Out[2]:

['Peter', 'Paul', 'Mary', 'Guido']

In [3]:

data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-fc1d891ab539> in <module>()
      1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]

<ipython-input-3-fc1d891ab539> in <listcomp>(.0)
      1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

In [4]:

import pandas as pd
names = pd.Series(data)
names

Out[4]:

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

In [5]:

names.str.capitalize()

Out[5]:

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

Métodos para texto en Pandas¶

In [6]:

monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

Métodos similares a los de Python (para texto)¶

Método

In [7]:

monte.str.lower()

Out[7]:

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

In [8]:

monte.str.len()

Out[8]:

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [9]:

monte.str.startswith('T')

Out[9]:

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [10]:

monte.str.split()

Out[10]:

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

Métodos que usan exrpresiones regulares¶

Método	Descripción
`match()`	Call `re.match()` on each element, returning a boolean.
`extract()`	Call `re.match()` on each element, returning matched groups as strings.
`findall()`	Call `re.findall()` on each element
`replace()`	Replace occurrences of pattern with some other string
`contains()`	Call `re.search()` on each element, returning a boolean
`count()`	Count occurrences of pattern
`split()`	Equivalent to `str.split()`, but accepts regexps
`rsplit()`	Equivalent to `str.rsplit()`, but accepts regexps

In [11]:

monte.str.extract('([A-Za-z]+)', expand=False)

Out[11]:

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

In [12]:

monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

Out[12]:

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

Métodos varios¶

Método	Descripción
`get()`	Index each element
`slice()`	Slice each element
`slice_replace()`	Replace slice in each element with passed value
`cat()`	Concatenate strings
`repeat()`	Repeat values
`normalize()`	Return Unicode form of string
`pad()`	Add whitespace to left, right, or both sides of strings
`wrap()`	Split long strings into lines with length less than a given width
`join()`	Join strings in each element of the Series with passed separator
`get_dummies()`	extract dummy variables as a dataframe

Acceso vectorizado a elementos y rebanadas de texto¶

In [13]:

monte.str[0:3]

Out[13]:

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

In [14]:

monte.str.split().str.get(-1)

Out[14]:

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

Variables de indicación¶

In [15]:

full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Out[15]:

	info	name
0	B\|C\|D	Graham Chapman
1	B\|D	John Cleese
2	A\|C	Terry Gilliam
3	B\|D	Eric Idle
4	B\|C	Terry Jones
5	B\|C\|D	Michael Palin

In [16]:

full_monte['info'].str.get_dummies('|')

Out[16]:

	A	B	C	D
0	0	1	1	1
1	0	1	0	1
2	1	0	1	0
3	0	1	0	1
4	0	1	1	0
5	0	1	1	1

Ejemplo: Base de datos de recetas¶

In [17]:

# !curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
# !gunzip recipeitems-latest.json.gz

In [18]:

try:
    recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
    print("ValueError:", e)

ValueError: Trailing data

In [19]:

with open('recipeitems-latest.json') as f:
    line = f.readline()
pd.read_json(line).shape

Out[19]:

(2, 12)

In [20]:

# read the entire file into a Python array
with open('recipeitems-latest.json', 'r') as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)

In [21]:

recipes.shape

Out[21]:

(173278, 17)

In [22]:

recipes.iloc[0]

Out[22]:

_id                                {'$oid': '5160756b96cc62079cc2db15'}
cookTime                                                          PT30M
creator                                                             NaN
dateModified                                                        NaN
datePublished                                                2013-03-11
description           Late Saturday afternoon, after Marlboro Man ha...
image                 http://static.thepioneerwoman.com/cooking/file...
ingredients           Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
name                                    Drop Biscuits and Sausage Gravy
prepTime                                                          PT10M
recipeCategory                                                      NaN
recipeInstructions                                                  NaN
recipeYield                                                          12
source                                                  thepioneerwoman
totalTime                                                           NaN
ts                                             {'$date': 1365276011104}
url                   http://thepioneerwoman.com/cooking/2013/03/dro...
Name: 0, dtype: object

In [23]:

recipes.ingredients.str.len().describe()

Out[23]:

count    173278.000000
mean        244.617926
std         146.705285
min           0.000000
25%         147.000000
50%         221.000000
75%         314.000000
max        9067.000000
Name: ingredients, dtype: float64

In [24]:

recipes.name[np.argmax(recipes.ingredients.str.len())]

Out[24]:

'Carrot Pineapple Spice &amp; Brownie Layer Cake with Whipped Cream &amp; Cream Cheese Frosting and Marzipan Carrots'

In [25]:

recipes.description.str.contains('[Bb]reakfast').sum()

Out[25]:

In [26]:

recipes.ingredients.str.contains('[Cc]innamon').sum()

Out[26]:

In [27]:

recipes.ingredients.str.contains('[Cc]inamon').sum()

Out[27]:

Un sistema de recomendación de recetas¶

In [28]:

spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

In [29]:

import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df.head()

Out[29]:

	cumin	oregano	paprika	parsley	pepper	rosemary	sage	salt	tarragon	thyme
0	False	False	False	False	False	False	True	False	False	False
1	False	False	False	False	False	False	False	False	False	False
2	True	False	False	False	True	False	False	True	False	False
3	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False

In [30]:

selection = spice_df.query('parsley & paprika & tarragon')
len(selection)

Out[30]:

In [31]:

recipes.name[selection.index]

Out[31]:

2069      All cremat with a Little Gem, dandelion and wa...
74964                         Lobster with Thermidor butter
93768      Burton's Southern Fried Chicken with White Gravy
113926                     Mijo's Slow Cooker Shredded Beef
137686                     Asparagus Soup with Poached Eggs
140530                                 Fried Oyster Po’boys
158475                Lamb shank tagine with herb tabbouleh
158486                 Southern fried chicken in buttermilk
163175            Fried Chicken Sliders with Pickles + Slaw
165243                        Bar Tartine Cauliflower Salad
Name: name, dtype: object

< Agregaciones y agrupaciones | Contenido | Series de tiempo >