Operaciones sobre texto
Introducción a las operaciones sobre texto en Pandas¶
In [1]:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
Out[1]:
In [2]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
Out[2]:
In [3]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]
In [4]:
import pandas as pd
names = pd.Series(data)
names
Out[4]:
In [5]:
names.str.capitalize()
Out[5]:
Métodos para texto en Pandas¶
In [6]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])
Métodos similares a los de Python (para texto)¶
Método | |||
---|---|---|---|
len()
| lower()
| translate()
| islower()
ljust()
| upper()
| startswith()
| isupper()
rjust()
| find()
| endswith()
| isnumeric()
center()
| rfind()
| isalnum()
| isdecimal()
zfill()
| index()
| isalpha()
| split()
strip()
| rindex()
| isdigit()
| rsplit()
rstrip()
| capitalize()
| isspace()
| partition()
lstrip()
| swapcase()
| istitle()
| rpartition()
In [7]:
monte.str.lower()
Out[7]:
In [8]:
monte.str.len()
Out[8]:
In [9]:
monte.str.startswith('T')
Out[9]:
In [10]:
monte.str.split()
Out[10]:
Métodos que usan exrpresiones regulares¶
Método | Descripción |
---|---|
match() |
Call re.match() on each element, returning a boolean. |
extract() |
Call re.match() on each element, returning matched groups as strings. |
findall() |
Call re.findall() on each element |
replace() |
Replace occurrences of pattern with some other string |
contains() |
Call re.search() on each element, returning a boolean |
count() |
Count occurrences of pattern |
split() |
Equivalent to str.split() , but accepts regexps |
rsplit() |
Equivalent to str.rsplit() , but accepts regexps |
In [11]:
monte.str.extract('([A-Za-z]+)', expand=False)
Out[11]:
In [12]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
Out[12]:
Métodos varios¶
Método | Descripción |
---|---|
get() |
Index each element |
slice() |
Slice each element |
slice_replace() |
Replace slice in each element with passed value |
cat() |
Concatenate strings |
repeat() |
Repeat values |
normalize() |
Return Unicode form of string |
pad() |
Add whitespace to left, right, or both sides of strings |
wrap() |
Split long strings into lines with length less than a given width |
join() |
Join strings in each element of the Series with passed separator |
get_dummies() |
extract dummy variables as a dataframe |
Acceso vectorizado a elementos y rebanadas de texto¶
In [13]:
monte.str[0:3]
Out[13]:
In [14]:
monte.str.split().str.get(-1)
Out[14]:
Variables de indicación¶
In [15]:
full_monte = pd.DataFrame({'name': monte,
'info': ['B|C|D', 'B|D', 'A|C',
'B|D', 'B|C', 'B|C|D']})
full_monte
Out[15]:
In [16]:
full_monte['info'].str.get_dummies('|')
Out[16]:
Ejemplo: Base de datos de recetas¶
In [17]:
# !curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
# !gunzip recipeitems-latest.json.gz
In [18]:
try:
recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
print("ValueError:", e)
In [19]:
with open('recipeitems-latest.json') as f:
line = f.readline()
pd.read_json(line).shape
Out[19]:
In [20]:
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r') as f:
# Extract each line
data = (line.strip() for line in f)
# Reformat so each line is the element of a list
data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)
In [21]:
recipes.shape
Out[21]:
In [22]:
recipes.iloc[0]
Out[22]:
In [23]:
recipes.ingredients.str.len().describe()
Out[23]:
In [24]:
recipes.name[np.argmax(recipes.ingredients.str.len())]
Out[24]:
In [25]:
recipes.description.str.contains('[Bb]reakfast').sum()
Out[25]:
In [26]:
recipes.ingredients.str.contains('[Cc]innamon').sum()
Out[26]:
In [27]:
recipes.ingredients.str.contains('[Cc]inamon').sum()
Out[27]:
Un sistema de recomendación de recetas¶
In [28]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
In [29]:
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
for spice in spice_list))
spice_df.head()
Out[29]:
In [30]:
selection = spice_df.query('parsley & paprika & tarragon')
len(selection)
Out[30]:
In [31]:
recipes.name[selection.index]
Out[31]: