Ingeniería de Atributos
Atributos categóricos¶
In [1]:
data = [
{'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
{'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
{'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
{'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]
In [2]:
{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3};
In [3]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)
Out[3]:
In [4]:
vec.get_feature_names()
Out[4]:
In [5]:
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)
Out[5]:
Atributos de texto¶
In [6]:
sample = ['problem of evil',
'evil queen',
'horizon problem']
In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(sample)
X
Out[7]:
In [8]:
import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
Out[8]:
In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
Out[9]:
Atributos de imágenes¶
We used for the digits data in Introduciendo Scikit-Learn: simply using the pixel values themselves. But depending on the application, such approaches may not be optimal. Otro ejemplo más complejo puede verse en Feature Engineering: Working with Images.
Atributos derivados¶
In [10]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1, 2, 3, 4, 5])
y = np.array([4, 2, 1, 3, 7])
plt.scatter(x, y);
In [11]:
from sklearn.linear_model import LinearRegression
X = x[:, np.newaxis]
model = LinearRegression().fit(X, y)
yfit = model.predict(X)
plt.scatter(x, y)
plt.plot(x, yfit);
In [12]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=False)
X2 = poly.fit_transform(X)
print(X2)
In [13]:
model = LinearRegression().fit(X2, y)
yfit = model.predict(X2)
plt.scatter(x, y)
plt.plot(x, yfit);
Imputación de datos faltantes¶
In [14]:
from numpy import nan
X = np.array([[ nan, 0, 3 ],
[ 3, 7, 9 ],
[ 3, 5, 2 ],
[ 4, nan, 6 ],
[ 8, 8, 1 ]])
y = np.array([14, 16, -1, 8, -5])
In [15]:
from sklearn.preprocessing import Imputer
imp = Imputer(strategy='mean')
X2 = imp.fit_transform(X)
X2
Out[15]:
In [16]:
model = LinearRegression().fit(X2, y)
model.predict(X2)
Out[16]:
Automatizando la extracción de atributos: pipelines¶
- Impute missing values using the mean
- Transform features to quadratic
- Fit a linear regression
In [17]:
from sklearn.pipeline import make_pipeline
model = make_pipeline(Imputer(strategy='mean'),
PolynomialFeatures(degree=2),
LinearRegression())
In [18]:
model.fit(X, y) # X with missing values, from above
print(y)
print(model.predict(X))
All the steps of the model are applied automatically.