< Operaciones sobre texto | Contenido | Machine Learning >

Fechas y horas: representación del tiempo en Python¶

Python natico: `datetime` y `dateutil`¶

In [1]:

from datetime import datetime
datetime(year=2015, month=7, day=4)

Out[1]:

datetime.datetime(2015, 7, 4, 0, 0)

In [2]:

from dateutil import parser
date = parser.parse("4th of July, 2015")
date

Out[2]:

datetime.datetime(2015, 7, 4, 0, 0)

In [3]:

date.strftime('%A')

Out[3]:

'Saturday'

Arrays de tipo tiempo: `datetime64` de NumPy¶

In [4]:

import numpy as np
date = np.array('2015-07-04', dtype=np.datetime64)
date

Out[4]:

array(datetime.date(2015, 7, 4), dtype='datetime64[D]')

In [5]:

date + np.arange(12)

Out[5]:

array(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
       '2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
       '2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'], dtype='datetime64[D]')

In [6]:

np.datetime64('2015-07-04')

Out[6]:

numpy.datetime64('2015-07-04')

In [7]:

np.datetime64('2015-07-04 12:00')

Out[7]:

numpy.datetime64('2015-07-04T12:00')

In [8]:

np.datetime64('2015-07-04 12:59:59.50', 'ns')

Out[8]:

numpy.datetime64('2015-07-04T12:59:59.500000000')

Code	Meaning	Time span (relative)	Time span (absolute)
`Y`	Year	± 9.2e18 years	[9.2e18 BC, 9.2e18 AD]
`M`	Month	± 7.6e17 years	[7.6e17 BC, 7.6e17 AD]
`W`	Week	± 1.7e17 years	[1.7e17 BC, 1.7e17 AD]
`D`	Day	± 2.5e16 years	[2.5e16 BC, 2.5e16 AD]
`h`	Hour	± 1.0e15 years	[1.0e15 BC, 1.0e15 AD]
`m`	Minute	± 1.7e13 years	[1.7e13 BC, 1.7e13 AD]
`s`	Second	± 2.9e12 years	[ 2.9e9 BC, 2.9e9 AD]
`ms`	Millisecond	± 2.9e9 years	[ 2.9e6 BC, 2.9e6 AD]
`us`	Microsecond	± 2.9e6 years	[290301 BC, 294241 AD]
`ns`	Nanosecond	± 292 years	[ 1678 AD, 2262 AD]
`ps`	Picosecond	± 106 days	[ 1969 AD, 1970 AD]
`fs`	Femtosecond	± 2.6 hours	[ 1969 AD, 1970 AD]
`as`	Attosecond	± 9.2 seconds	[ 1969 AD, 1970 AD]

Fechas y horas en Pandas: lo mejor de los dos mundos¶

In [9]:

import pandas as pd
date = pd.to_datetime("4th of July, 2015")
date

Out[9]:

Timestamp('2015-07-04 00:00:00')

In [10]:

date.strftime('%A')

Out[10]:

'Saturday'

In [11]:

date + pd.to_timedelta(np.arange(12), 'D')

Out[11]:

DatetimeIndex(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
               '2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
               '2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'],
              dtype='datetime64[ns]', freq=None)

Series de Tiempo en Pandas: Indexado por el tiempo¶

In [12]:

index = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
                          '2015-07-04', '2015-08-04'])
data = pd.Series([0, 1, 2, 3], index=index)
data

Out[12]:

2014-07-04    0
2014-08-04    1
2015-07-04    2
2015-08-04    3
dtype: int64

In [13]:

data['2014-07-04':'2015-07-04']

Out[13]:

2014-07-04    0
2014-08-04    1
2015-07-04    2
dtype: int64

In [14]:

data['2015']

Out[14]:

2015-07-04    2
2015-08-04    3
dtype: int64

Estructuras de Series de tiempo en Pandas¶

For time stamps, Pandas provides the Timestamp type. As mentioned before, it is essentially a replacement for Python's native datetime, but is based on the more efficient numpy.datetime64 data type. The associated Index structure is DatetimeIndex.
For time Periods, Pandas provides the Period type. This encodes a fixed-frequency interval based on numpy.datetime64. The associated index structure is PeriodIndex.
For time deltas or durations, Pandas provides the Timedelta type. Timedelta is a more efficient replacement for Python's native datetime.timedelta type, and is based on numpy.timedelta64. The associated index structure is TimedeltaIndex.

In [15]:

dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
                       '2015-Jul-6', '07-07-2015', '20150708'])
dates

Out[15]:

DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
               '2015-07-08'],
              dtype='datetime64[ns]', freq=None)

In [16]:

dates.to_period('D')

Out[16]:

PeriodIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
             '2015-07-08'],
            dtype='int64', freq='D')

In [17]:

dates - dates[0]

Out[17]:

TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None)

Sucesiones regulares: `pd.date_range()`¶

In [18]:

pd.date_range('2015-07-03', '2015-07-10')

Out[18]:

DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
               '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
              dtype='datetime64[ns]', freq='D')

In [19]:

pd.date_range('2015-07-03', periods=8)

Out[19]:

DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
               '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
              dtype='datetime64[ns]', freq='D')

In [20]:

pd.date_range('2015-07-03', periods=8, freq='H')

Out[20]:

DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00',
               '2015-07-03 02:00:00', '2015-07-03 03:00:00',
               '2015-07-03 04:00:00', '2015-07-03 05:00:00',
               '2015-07-03 06:00:00', '2015-07-03 07:00:00'],
              dtype='datetime64[ns]', freq='H')

In [21]:

pd.period_range('2015-07', periods=8, freq='M')

Out[21]:

PeriodIndex(['2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12',
             '2016-01', '2016-02'],
            dtype='int64', freq='M')

In [22]:

pd.timedelta_range(0, periods=10, freq='H')

Out[22]:

TimedeltaIndex(['00:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00',
                '05:00:00', '06:00:00', '07:00:00', '08:00:00', '09:00:00'],
               dtype='timedelta64[ns]', freq='H')

Frecuencias e intervalos (offsets)¶

Code	Description	Code	Description
`D`	Calendar day	`B`	Business day
`W`	Weekly
`M`	Month end	`BM`	Business month end
`Q`	Quarter end	`BQ`	Business quarter end
`A`	Year end	`BA`	Business year end
`H`	Hours	`BH`	Business hours
`T`	Minutes
`S`	Seconds
`L`	Milliseonds
`U`	Microseconds
`N`	nanoseconds

Code	Description	Code	Description
`MS`	Month start	`BMS`	Business month start
`QS`	Quarter start	`BQS`	Business quarter start
`AS`	Year start	`BAS`	Business year start

Q-JAN, BQ-FEB, QS-MAR, BQS-APR, etc.
A-JAN, BA-FEB, AS-MAR, BAS-APR, etc.

W-SUN, W-MON, W-TUE, W-WED, etc.

In [23]:

pd.timedelta_range(0, periods=9, freq="2H30T")

Out[23]:

TimedeltaIndex(['00:00:00', '02:30:00', '05:00:00', '07:30:00', '10:00:00',
                '12:30:00', '15:00:00', '17:30:00', '20:00:00'],
               dtype='timedelta64[ns]', freq='150T')

In [24]:

from pandas.tseries.offsets import BDay
pd.date_range('2015-07-01', periods=5, freq=BDay())

Out[24]:

DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-06',
               '2015-07-07'],
              dtype='datetime64[ns]', freq='B')

Muestreo, Cambios y Ventanas (resample, shift and windows)¶

In [25]:

from pandas_datareader import data

goog = data.DataReader('GOOG', start='2004', end='2016',
                       data_source='google')
goog.head()

Out[25]:

	Open	High	Low	Close	Volume
Date
2004-08-19	49.96	51.98	47.93	50.12	NaN
2004-08-20	50.69	54.49	50.20	54.10	NaN
2004-08-23	55.32	56.68	54.47	54.65	NaN
2004-08-24	55.56	55.74	51.73	52.38	NaN
2004-08-25	52.43	53.95	51.89	52.95	NaN

In [26]:

goog = goog['Close']

In [27]:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()

In [28]:

goog.plot();

Muestres y conversión de frecuencias¶

In [29]:

goog.plot(alpha=0.5, style='-')
goog.resample('BA').mean().plot(style=':')
goog.asfreq('BA').plot(style='--');
plt.legend(['input', 'resample', 'asfreq'],
           loc='upper left');

In [30]:

fig, ax = plt.subplots(2, sharex=True)
data = goog.iloc[:10]

data.asfreq('D').plot(ax=ax[0], marker='o')

data.asfreq('D', method='bfill').plot(ax=ax[1], style='-o')
data.asfreq('D', method='ffill').plot(ax=ax[1], style='--o')
ax[1].legend(["back-fill", "forward-fill"]);

Cambios de tiempos (time-shifts)¶

In [31]:

fig, ax = plt.subplots(3, sharey=True)

# apply a frequency to the data
goog = goog.asfreq('D', method='pad')

goog.plot(ax=ax[0])
goog.shift(900).plot(ax=ax[1])
goog.tshift(900).plot(ax=ax[2])

# legends and annotations
local_max = pd.to_datetime('2007-11-05')
offset = pd.Timedelta(900, 'D')

ax[0].legend(['input'], loc=2)
ax[0].get_xticklabels()[2].set(weight='heavy', color='red')
ax[0].axvline(local_max, alpha=0.3, color='red')

ax[1].legend(['shift(900)'], loc=2)
ax[1].get_xticklabels()[2].set(weight='heavy', color='red')
ax[1].axvline(local_max + offset, alpha=0.3, color='red')

ax[2].legend(['tshift(900)'], loc=2)
ax[2].get_xticklabels()[1].set(weight='heavy', color='red')
ax[2].axvline(local_max + offset, alpha=0.3, color='red');

In [32]:

ROI = 100 * (goog.tshift(-365) / goog - 1)
ROI.plot()
plt.ylabel('% Return on Investment');

Ventanas móviles (rolling windows)¶

In [33]:

rolling = goog.rolling(365, center=True)

data = pd.DataFrame({'input': goog,
                     'one-year rolling_mean': rolling.mean(),
                     'one-year rolling_std': rolling.std()})
ax = data.plot(style=['-', '--', ':'])
ax.lines[0].set_alpha(0.3)

Información adicional¶

Referirse a la sección "Time Series/Date" de la documentación.

Ejemplo: Visualización del número de bicicletas en Seattle¶

In [34]:

# !curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD

In [35]:

data = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
data.head()

Out[35]:

	Fremont Bridge West Sidewalk	Fremont Bridge East Sidewalk
Date
2012-10-03 00:00:00	4.0	9.0
2012-10-03 01:00:00	4.0	6.0
2012-10-03 02:00:00	1.0	1.0
2012-10-03 03:00:00	2.0	3.0
2012-10-03 04:00:00	6.0	1.0

In [36]:

data.columns = ['West', 'East']
data['Total'] = data.eval('West + East')

In [37]:

data.dropna().describe()

Out[37]:

	West	East	Total
count	35752.000000	35752.000000	35752.000000
mean	61.470267	54.410774	115.881042
std	82.588484	77.659796	145.392385
min	0.000000	0.000000	0.000000
25%	8.000000	7.000000	16.000000
50%	33.000000	28.000000	65.000000
75%	79.000000	67.000000	151.000000
max	825.000000	717.000000	1186.000000

Visualizando los datos¶

In [38]:

%matplotlib inline
import seaborn; seaborn.set()

In [39]:

data.plot()
plt.ylabel('Hourly Bicycle Count');

In [40]:

weekly = data.resample('W').sum()
weekly.plot(style=[':', '--', '-'])
plt.ylabel('Weekly bicycle count');

In [41]:

daily = data.resample('D').sum()
daily.rolling(30, center=True).sum().plot(style=[':', '--', '-'])
plt.ylabel('mean hourly count');

In [42]:

daily.rolling(50, center=True,
              win_type='gaussian').sum(std=10).plot(style=[':', '--', '-']);

Profundizando en los datos¶

In [43]:

by_time = data.groupby(data.index.time).mean()
hourly_ticks = 4 * 60 * 60 * np.arange(6)
by_time.plot(xticks=hourly_ticks, style=[':', '--', '-']);

In [44]:

by_weekday = data.groupby(data.index.dayofweek).mean()
by_weekday.index = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
by_weekday.plot(style=[':', '--', '-']);

In [45]:

weekend = np.where(data.index.weekday < 5, 'Weekday', 'Weekend')
by_time = data.groupby([weekend, data.index.time]).mean()

In [46]:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(14, 5))
by_time.ix['Weekday'].plot(ax=ax[0], title='Weekdays',
                           xticks=hourly_ticks, style=[':', '--', '-'])
by_time.ix['Weekend'].plot(ax=ax[1], title='Weekends',
                           xticks=hourly_ticks, style=[':', '--', '-']);