Understanding Time Series Cross-Validation: A Guide to Enhancing Prediction in Sequential Data
Written on
Introduction
Standard cross-validation techniques commonly used in machine learning fall short when applied to time series data. This is primarily due to the sequential nature of time series data, which often features temporal dependencies. In this context, the sequence of data points is vital, making random splits—typical in conventional cross-validation—detrimental, as they disrupt these temporal connections. Simply stated, the timing is crucial.
1. Time Series Cross-Validation
Having established the significance of Time Series Cross-Validation (tsCV) in the previous section, we will now explore a practical example to illustrate how to implement tsCV in a real-world context.
To start, we will generate a synthetic dataset using Python's pandas and numpy libraries. This dataset will feature a single predictor variable, X, and a target variable, y, which are linearly correlated. This straightforward example allows us to concentrate on the mechanics of tsCV without the added intricacies of real-world data.
In our initial examination of Time Series Cross-Validation (tsCV), we will begin with a simple approach that does not include gaps between the training and testing sets. This strategy is especially useful for grasping the basic principles of tsCV and serves as a foundation for more advanced methods, such as introducing gaps, which will be discussed later.
import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression
# Creating a dataset where X and y are linearly correlated np.random.seed(42) size = 100 X_values = np.linspace(0, 10, size) y_values = 3 * X_values + np.random.normal(0, 2, size)
# Constructing a DataFrame data = pd.DataFrame({'X': X_values, 'y': y_values})
# Function for time series cross-validation without gaps def time_series_cv(data, n_splits):
n_samples = len(data)
fold_size = n_samples // n_splits
for i in range(n_splits):
test_start = i * fold_size
test_end = test_start + fold_size if i < n_splits - 1 else n_samples
train = data[:test_start]
test = data[test_start:test_end]
yield train, test
# Linear regression model model = LinearRegression()
# DataFrame for storing results cv_results_df = pd.DataFrame(columns=['X', 'y', 'y_pred']) n_splits = 5
# Applying time series cross-validation for train_index, test_index in time_series_cv(data.index, n_splits):
X_train, X_test = data.loc[train_index, 'X'].values.reshape(-1, 1), data.loc[test_index, 'X'].values.reshape(-1, 1)
y_train, y_test = data.loc[train_index, 'y'].values, data.loc[test_index, 'y'].values
# Fitting the model and predicting
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Storing results in DataFrame
fold_results = pd.DataFrame({
'X': X_test.squeeze(),
'y': y_test,
'y_pred': y_pred
})
cv_results_df = pd.concat([cv_results_df, fold_results], ignore_index=True)
cv_results_df.head()
Note: The previous code will result in a ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required by LinearRegression. This error arises because, in the first fold of the cross-validation, the training dataset is empty. This occurs since the test dataset starts from the beginning, leaving no data for training. You can rectify this by modifying the cross-validation function to ensure there's always at least one data point available for training.
# Adjusted function for time series cross-validation without gaps def time_series_cv_adjusted(data, n_splits):
n_samples = len(data)
fold_size = n_samples // n_splits
for i in range(n_splits):
test_start = i * fold_size
if test_start == 0: # Ensure at least one sample in train set
continuetest_end = test_start + fold_size if i < n_splits - 1 else n_samples
train = data[:test_start]
test = data[test_start:test_end]
yield train, test
print(cv_results_df)
Next, we proceed with plotting the actual vs. predicted values:
import matplotlib.pyplot as plt
# Plotting X against y and y_pred plt.figure(figsize=(12, 6)) plt.plot(cv_results_df['X'], cv_results_df['y'], label='Actual y', color='blue', marker='o') plt.plot(cv_results_df['X'], cv_results_df['y_pred'], label='Predicted y', color='red', linestyle='--') plt.title('Actual vs Predicted y') plt.xlabel('X') plt.ylabel('y') plt.legend() plt.grid(True) plt.show()
Next, we will visualize the cross-validation process through a series of plots:
fig, axes = plt.subplots(n_splits, 1, figsize=(12, 2 * n_splits))
for i, (train_index, test_index) in enumerate(time_series_cv_adjusted(data.index, n_splits)):
X_train, X_test = data.loc[train_index, 'X'].values.reshape(-1, 1), data.loc[test_index, 'X'].values.reshape(-1, 1)
y_train, y_test = data.loc[train_index, 'y'].values, data.loc[test_index, 'y'].values
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
ax = axes[i]
ax.plot(data['X'], data['y'], label='Full Data', color='grey', alpha=0.3)
ax.scatter(X_train, y_train, color='blue', label='Train Data')
ax.scatter(X_test, y_test, color='green', label='Test Data')
ax.plot(X_test, y_pred, color='red', label='Predicted on Test', linestyle='--')
ax.set_title(f"Fold {i + 1}")
ax.legend()
plt.tight_layout() plt.show()
2. Time Series Cross-Validation with Gaps
# Function for time series cross-validation with gap adjusted def time_series_cv_with_gap_adjusted(data, n_splits, gap=0):
n_samples = len(data)
fold_size = n_samples // n_splits
for i in range(n_splits):
test_start = i * fold_size + gap
if test_start >= n_samples: # Skip if test_start is beyond the data length
continuetest_end = test_start + fold_size if i < n_splits - 1 else n_samples
train = data[:max(1, test_start - gap)] # Ensure at least one sample in train
test = data[test_start:test_end]
yield train, test
3. Time Series Cross-Validation with Lagged Features
In our continued exploration of Time Series Cross-Validation (tsCV), we now examine an advanced approach that enhances the model’s capacity to capture temporal dynamics: incorporating lagged features. This technique is particularly advantageous in situations where past observations can predict future outcomes, a common trait in many time series datasets.
Lagged features represent previous time steps of a variable, serving as additional predictors in the model. For instance, when predicting a daily stock price, the price from the previous day (lag 1), two days ago (lag 2), etc., can be valuable indicators for today’s price.
Integrating these lagged features enables the model to identify patterns and dependencies over time, enhancing prediction accuracy, especially in time series data exhibiting autocorrelation.
def create_lagged_features(df, n_lags=1):
"""
Create lagged features for a time series data.
Parameters:
df (pd.DataFrame): Original DataFrame with 'y' column.
n_lags (int): Number of lagged features to create.
Returns:
pd.DataFrame: DataFrame with lagged features.
"""
for lag in range(1, n_lags + 1):
df[f'y_lag_{lag}'] = df['y'].shift(lag)return df
# Adding lagged features to the dataset n_lags = 3 # Number of lagged features data_with_lags = create_lagged_features(data.copy(), n_lags)
# Dropping rows with NaN values that were created due to lagging data_with_lags.dropna(inplace=True)
data_with_lags.head() # Displaying the first few rows with lagged features
4. Time Series Cross-Validation with Lagged Features
We will now define a new time series cross-validation function that utilizes the DataFrame with lagged features.
def time_series_cv_with_lags(data, n_splits, gap=0):
n_samples = len(data)
fold_size = n_samples // n_splits
for i in range(n_splits):
test_start = i * fold_size + gap
if test_start >= n_samples: # Skip if test_start is beyond the data length
continuetest_end = test_start + fold_size if i < n_splits - 1 else n_samples
train = data.iloc[:max(1, test_start - gap)] # Ensure at least one sample in train
test = data.iloc[test_start:test_end]
yield train, test
# Applying time series cross-validation with the new dataset cv_results_lags_df = pd.DataFrame(columns=['X', 'y', 'y_pred'])
feature_cols = ['X', 'y_lag_1', 'y_lag_2', 'y_lag_3'] target_col = 'y'
for train, test in time_series_cv_with_lags(data_with_lags, n_splits, gap):
X_train = train[feature_cols]
y_train = train[target_col]
X_test = test[feature_cols]
y_test = test[target_col]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Append results to DataFrame
test_with_pred = test.copy()
test_with_pred['y_pred'] = y_pred
cv_results_lags_df = pd.concat([cv_results_lags_df, test_with_pred[['X', 'y', 'y_pred']]], ignore_index=True)
cv_results_lags_df.head() # Displaying the first few rows of the results DataFrame
5. Conclusion
This article introduced several implementations for the Time Series CV function: starting with a minimum of one fold, ensuring at least one sample, accommodating gaps, utilizing lagged features, and finally incorporating an offset. I hope this serves as a valuable introduction to the topic. Feel free to leave comments or feedback if you found this insightful. Thank you!