This is an excerpt from the Python Data Science Handbook by Jake VanderPlas; Jupyter notebooks are available on GitHub.

The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!

Series de tiempo

Fechas y horas: representación del tiempo en Python

Python natico: datetime y dateutil

In [1]:
from datetime import datetime
datetime(year=2015, month=7, day=4)
Out[1]:
datetime.datetime(2015, 7, 4, 0, 0)
In [2]:
from dateutil import parser
date = parser.parse("4th of July, 2015")
date
Out[2]:
datetime.datetime(2015, 7, 4, 0, 0)
In [3]:
date.strftime('%A')
Out[3]:
'Saturday'

Arrays de tipo tiempo: datetime64 de NumPy

In [4]:
import numpy as np
date = np.array('2015-07-04', dtype=np.datetime64)
date
Out[4]:
array(datetime.date(2015, 7, 4), dtype='datetime64[D]')
In [5]:
date + np.arange(12)
Out[5]:
array(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
       '2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
       '2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'], dtype='datetime64[D]')
In [6]:
np.datetime64('2015-07-04')
Out[6]:
numpy.datetime64('2015-07-04')
In [7]:
np.datetime64('2015-07-04 12:00')
Out[7]:
numpy.datetime64('2015-07-04T12:00')
In [8]:
np.datetime64('2015-07-04 12:59:59.50', 'ns')
Out[8]:
numpy.datetime64('2015-07-04T12:59:59.500000000')
Code Meaning Time span (relative) Time span (absolute)
Y Year ± 9.2e18 years [9.2e18 BC, 9.2e18 AD]
M Month ± 7.6e17 years [7.6e17 BC, 7.6e17 AD]
W Week ± 1.7e17 years [1.7e17 BC, 1.7e17 AD]
D Day ± 2.5e16 years [2.5e16 BC, 2.5e16 AD]
h Hour ± 1.0e15 years [1.0e15 BC, 1.0e15 AD]
m Minute ± 1.7e13 years [1.7e13 BC, 1.7e13 AD]
s Second ± 2.9e12 years [ 2.9e9 BC, 2.9e9 AD]
ms Millisecond ± 2.9e9 years [ 2.9e6 BC, 2.9e6 AD]
us Microsecond ± 2.9e6 years [290301 BC, 294241 AD]
ns Nanosecond ± 292 years [ 1678 AD, 2262 AD]
ps Picosecond ± 106 days [ 1969 AD, 1970 AD]
fs Femtosecond ± 2.6 hours [ 1969 AD, 1970 AD]
as Attosecond ± 9.2 seconds [ 1969 AD, 1970 AD]

Fechas y horas en Pandas: lo mejor de los dos mundos

In [9]:
import pandas as pd
date = pd.to_datetime("4th of July, 2015")
date
Out[9]:
Timestamp('2015-07-04 00:00:00')
In [10]:
date.strftime('%A')
Out[10]:
'Saturday'
In [11]:
date + pd.to_timedelta(np.arange(12), 'D')
Out[11]:
DatetimeIndex(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
               '2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
               '2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'],
              dtype='datetime64[ns]', freq=None)

Series de Tiempo en Pandas: Indexado por el tiempo

In [12]:
index = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
                          '2015-07-04', '2015-08-04'])
data = pd.Series([0, 1, 2, 3], index=index)
data
Out[12]:
2014-07-04    0
2014-08-04    1
2015-07-04    2
2015-08-04    3
dtype: int64
In [13]:
data['2014-07-04':'2015-07-04']
Out[13]:
2014-07-04    0
2014-08-04    1
2015-07-04    2
dtype: int64
In [14]:
data['2015']
Out[14]:
2015-07-04    2
2015-08-04    3
dtype: int64

Estructuras de Series de tiempo en Pandas

  • For time stamps, Pandas provides the Timestamp type. As mentioned before, it is essentially a replacement for Python's native datetime, but is based on the more efficient numpy.datetime64 data type. The associated Index structure is DatetimeIndex.
  • For time Periods, Pandas provides the Period type. This encodes a fixed-frequency interval based on numpy.datetime64. The associated index structure is PeriodIndex.
  • For time deltas or durations, Pandas provides the Timedelta type. Timedelta is a more efficient replacement for Python's native datetime.timedelta type, and is based on numpy.timedelta64. The associated index structure is TimedeltaIndex.
In [15]:
dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
                       '2015-Jul-6', '07-07-2015', '20150708'])
dates
Out[15]:
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
               '2015-07-08'],
              dtype='datetime64[ns]', freq=None)
In [16]:
dates.to_period('D')
Out[16]:
PeriodIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
             '2015-07-08'],
            dtype='int64', freq='D')
In [17]:
dates - dates[0]
Out[17]:
TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None)

Sucesiones regulares: pd.date_range()

In [18]:
pd.date_range('2015-07-03', '2015-07-10')
Out[18]:
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
               '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
              dtype='datetime64[ns]', freq='D')
In [19]:
pd.date_range('2015-07-03', periods=8)
Out[19]:
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
               '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
              dtype='datetime64[ns]', freq='D')
In [20]:
pd.date_range('2015-07-03', periods=8, freq='H')
Out[20]:
DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00',
               '2015-07-03 02:00:00', '2015-07-03 03:00:00',
               '2015-07-03 04:00:00', '2015-07-03 05:00:00',
               '2015-07-03 06:00:00', '2015-07-03 07:00:00'],
              dtype='datetime64[ns]', freq='H')
In [21]:
pd.period_range('2015-07', periods=8, freq='M')
Out[21]:
PeriodIndex(['2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12',
             '2016-01', '2016-02'],
            dtype='int64', freq='M')
In [22]:
pd.timedelta_range(0, periods=10, freq='H')
Out[22]:
TimedeltaIndex(['00:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00',
                '05:00:00', '06:00:00', '07:00:00', '08:00:00', '09:00:00'],
               dtype='timedelta64[ns]', freq='H')

Frecuencias e intervalos (offsets)

Code Description Code Description
D Calendar day B Business day
W Weekly
M Month end BM Business month end
Q Quarter end BQ Business quarter end
A Year end BA Business year end
H Hours BH Business hours
T Minutes
S Seconds
L Milliseonds
U Microseconds
N nanoseconds
Code Description Code Description
MS Month start BMS Business month start
QS Quarter start BQS Business quarter start
AS Year start BAS Business year start
  • Q-JAN, BQ-FEB, QS-MAR, BQS-APR, etc.
  • A-JAN, BA-FEB, AS-MAR, BAS-APR, etc.
  • W-SUN, W-MON, W-TUE, W-WED, etc.
In [23]:
pd.timedelta_range(0, periods=9, freq="2H30T")
Out[23]:
TimedeltaIndex(['00:00:00', '02:30:00', '05:00:00', '07:30:00', '10:00:00',
                '12:30:00', '15:00:00', '17:30:00', '20:00:00'],
               dtype='timedelta64[ns]', freq='150T')
In [24]:
from pandas.tseries.offsets import BDay
pd.date_range('2015-07-01', periods=5, freq=BDay())
Out[24]:
DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-06',
               '2015-07-07'],
              dtype='datetime64[ns]', freq='B')

Muestreo, Cambios y Ventanas (resample, shift and windows)

In [25]:
from pandas_datareader import data

goog = data.DataReader('GOOG', start='2004', end='2016',
                       data_source='google')
goog.head()
Out[25]:
Open High Low Close Volume
Date
2004-08-19 49.96 51.98 47.93 50.12 NaN
2004-08-20 50.69 54.49 50.20 54.10 NaN
2004-08-23 55.32 56.68 54.47 54.65 NaN
2004-08-24 55.56 55.74 51.73 52.38 NaN
2004-08-25 52.43 53.95 51.89 52.95 NaN
In [26]:
goog = goog['Close']
In [27]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
In [28]:
goog.plot();

Muestres y conversión de frecuencias

In [29]:
goog.plot(alpha=0.5, style='-')
goog.resample('BA').mean().plot(style=':')
goog.asfreq('BA').plot(style='--');
plt.legend(['input', 'resample', 'asfreq'],
           loc='upper left');
In [30]:
fig, ax = plt.subplots(2, sharex=True)
data = goog.iloc[:10]

data.asfreq('D').plot(ax=ax[0], marker='o')

data.asfreq('D', method='bfill').plot(ax=ax[1], style='-o')
data.asfreq('D', method='ffill').plot(ax=ax[1], style='--o')
ax[1].legend(["back-fill", "forward-fill"]);

Cambios de tiempos (time-shifts)

In [31]:
fig, ax = plt.subplots(3, sharey=True)

# apply a frequency to the data
goog = goog.asfreq('D', method='pad')

goog.plot(ax=ax[0])
goog.shift(900).plot(ax=ax[1])
goog.tshift(900).plot(ax=ax[2])

# legends and annotations
local_max = pd.to_datetime('2007-11-05')
offset = pd.Timedelta(900, 'D')

ax[0].legend(['input'], loc=2)
ax[0].get_xticklabels()[2].set(weight='heavy', color='red')
ax[0].axvline(local_max, alpha=0.3, color='red')

ax[1].legend(['shift(900)'], loc=2)
ax[1].get_xticklabels()[2].set(weight='heavy', color='red')
ax[1].axvline(local_max + offset, alpha=0.3, color='red')

ax[2].legend(['tshift(900)'], loc=2)
ax[2].get_xticklabels()[1].set(weight='heavy', color='red')
ax[2].axvline(local_max + offset, alpha=0.3, color='red');
In [32]:
ROI = 100 * (goog.tshift(-365) / goog - 1)
ROI.plot()
plt.ylabel('% Return on Investment');

Ventanas móviles (rolling windows)

In [33]:
rolling = goog.rolling(365, center=True)

data = pd.DataFrame({'input': goog,
                     'one-year rolling_mean': rolling.mean(),
                     'one-year rolling_std': rolling.std()})
ax = data.plot(style=['-', '--', ':'])
ax.lines[0].set_alpha(0.3)

Información adicional

Ejemplo: Visualización del número de bicicletas en Seattle

In [34]:
# !curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD
In [35]:
data = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
data.head()
Out[35]:
Fremont Bridge West Sidewalk Fremont Bridge East Sidewalk
Date
2012-10-03 00:00:00 4.0 9.0
2012-10-03 01:00:00 4.0 6.0
2012-10-03 02:00:00 1.0 1.0
2012-10-03 03:00:00 2.0 3.0
2012-10-03 04:00:00 6.0 1.0
In [36]:
data.columns = ['West', 'East']
data['Total'] = data.eval('West + East')
In [37]:
data.dropna().describe()
Out[37]:
West East Total
count 35752.000000 35752.000000 35752.000000
mean 61.470267 54.410774 115.881042
std 82.588484 77.659796 145.392385
min 0.000000 0.000000 0.000000
25% 8.000000 7.000000 16.000000
50% 33.000000 28.000000 65.000000
75% 79.000000 67.000000 151.000000
max 825.000000 717.000000 1186.000000

Visualizando los datos

In [38]:
%matplotlib inline
import seaborn; seaborn.set()
In [39]:
data.plot()
plt.ylabel('Hourly Bicycle Count');
In [40]:
weekly = data.resample('W').sum()
weekly.plot(style=[':', '--', '-'])
plt.ylabel('Weekly bicycle count');
In [41]:
daily = data.resample('D').sum()
daily.rolling(30, center=True).sum().plot(style=[':', '--', '-'])
plt.ylabel('mean hourly count');
In [42]:
daily.rolling(50, center=True,
              win_type='gaussian').sum(std=10).plot(style=[':', '--', '-']);

Profundizando en los datos

In [43]:
by_time = data.groupby(data.index.time).mean()
hourly_ticks = 4 * 60 * 60 * np.arange(6)
by_time.plot(xticks=hourly_ticks, style=[':', '--', '-']);
In [44]:
by_weekday = data.groupby(data.index.dayofweek).mean()
by_weekday.index = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
by_weekday.plot(style=[':', '--', '-']);
In [45]:
weekend = np.where(data.index.weekday < 5, 'Weekday', 'Weekend')
by_time = data.groupby([weekend, data.index.time]).mean()
In [46]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(14, 5))
by_time.ix['Weekday'].plot(ax=ax[0], title='Weekdays',
                           xticks=hourly_ticks, style=[':', '--', '-'])
by_time.ix['Weekend'].plot(ax=ax[1], title='Weekends',
                           xticks=hourly_ticks, style=[':', '--', '-']);