Important Rules of Time Series Forecasting in Python

wpnews.pro

Forecasting is one of the most common tasks in applied data science, but it is also one of the easiest to misunderstand. In my experience in finance and economics, we did that all the time.

At first glance, time series forecasting appears to be a simple prediction problem. We observe a sequence of values over time, estimate a model, and use that model to predict future values. In practice, the problem is more subtle. A forecast is not merely a machine learning output. It is a conditional statement about the future, based on the information available at a specific point in time.

This distinction matters.

When we forecast next month’s revenue, tomorrow’s electricity demand, next week’s sales, or future inflation, we are not only asking for a numerical estimate. We are asking whether historical regularities are sufficiently stable to support inference about an unobserved future period. That is an empirical question, not just a programming exercise.

So my objective here is to show how to structure a forecasting problem correctly, avoid common methodological errors, and build a defensible baseline machine learning model.

A time series is a sequence of observations indexed by time. Formally, we may observe a variable ( y_t ) at time ( t ), where ( t = 1, 2, …, T ).

A forecasting problem asks us to estimate a future value such as ( y_{T+1} ), ( y_{T+h} ), or an entire future path.

A useful way to think about a forecast is:

In plain language, this means:

The forecast for period ( T+h ), made at time ( T ), should only use information available at time ( T ).

This information set (the last part) is central to scientific forecasting. It includes past observed values, known calendar variables, and any external predictors that would genuinely be known when the forecast is produced.

It does not include future sales, future prices, future demand shocks, or target-derived features accidentally calculated using future observations.

This is where many forecasting projects fail. The model appears accurate during evaluation because the feature engineering pipeline has allowed information from the future to enter the training data. This is called data leakage. In time series forecasting, leakage is especially dangerous because it often produces deceptively strong results.

The first rule is therefore simple:

A forecast must be evaluated under the same information constraints that would exist in production.

In real applications, time series data often comes from databases, APIs, ERP systems, financial terminals, transaction logs, or public statistical agencies. It really depends where you do it and why.

For this article, we will use a synthetic daily sales series so that the code is fully reproducible. But if you want to check out how to parse various datasets, please feel free to check out my other articles (e.g. https://medium.com/dev-genius/parsing-ecb-exchange-rate-data-in-r-a-comprehensive-guide-a6b1d4f78429)

The artificial series will include trend, weekly seasonality, annual seasonality, and random noise. This is not intended to represent a complete economic data-generating process. It is a controlled example that allows us to demonstrate the mechanics of forecasting.

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_absolute_error, mean_squared_errorrng = np.random.default_rng(seed=1)dates = pd.date_range(start="2021-01-01", periods=2000, freq="D")t = np.arange(len(dates))trend = 0.06 * tweekly_pattern = 12 * np.sin(2 * np.pi * dates.dayofweek / 7)annual_pattern = 20 * np.sin(2 * np.pi * dates.dayofyear / 365.25)noise = rng.normal(loc=0, scale=6, size=len(dates))sales = 200 + trend + weekly_pattern + annual_pattern + noisedf = pd.DataFrame({    "date": dates,    "sales": sales}).set_index("date")df.head()

This code creates a synthetic daily sales dataset from 2021 onward by combining a linear upward trend, weekly seasonality, annual seasonality, and random noise, then stores it as a date-indexed pandas DataFrame.

Now plot the series just to see how it looks like.

plt.figure(figsize=(12, 5))plt.plot(df.index, df["sales"])plt.title("Synthetic Daily Sales Series")plt.xlabel("Date")plt.ylabel("Sales")plt.tight_layout()plt.show()

Before estimating any model, we should inspect the time series visually. ALWAYS. This is not cosmetic; this is not a step you skip when you have no time; this is a must-do exercise. It is part of the empirical workflow.

A time series plot helps us identify trends, seasonality, volatility changes, extreme values, discontinuities, and possible structural breaks. These characteristics influence both the choice of model and the interpretation of the forecast.

In economic and business data, stability should never be assumed casually. Consumer behavior changes. Prices change. Regulations change. Supply chains break. Competitors enter and exit. Macroeconomic regimes shift. A forecasting model is only as reliable as the assumption that the historical patterns it has learned remain relevant for the forecast horizon.

You need to understand what you analyse.

Now let's talk a little about machine learning. People with great machine learning skills often come to time series forecasting and sometimes do things that are not fully okay in the time series world.

In standard supervised learning, it is common to randomly split the dataset into training and test sets. For time series forecasting, that is usually inappropriate.

A random split destroys the temporal order of the data. Observations from the future may appear in the training set, while earlier observations appear in the test set. The resulting evaluation does not represent a real forecasting situation.

In a real forecasting problem, we train on the past and predict the future. Our validation design must reflect that.

Let us reserve the final 90 days as a test period.

test_size = 90train = df.iloc[:-test_size].copy()test = df.iloc[-test_size:].copy()print("Training period:")print(train.index.min(), "to", train.index.max())print("Test period:")print(test.index.min(), "to", test.index.max())

This chronological split is more realistic. The model is estimated using historical observations and evaluated on later observations that were not available during training.

The test period should correspond to the practical forecast horizon. If the business problem requires a 30-day forecast, evaluate a 30-day horizon. If the decision requires quarterly planning, evaluate quarterly performance. A model that performs well one day ahead may not perform well 90 days ahead.

A forecasting model should not be judged in isolation. It should be compared against a simple benchmark.

This is particularly important in applied work because many sophisticated models add little value relative to simple rules. For seasonal business data, a naive seasonal forecast can be surprisingly difficult to beat.

A weekly seasonal naive forecast assumes that the future value will resemble the value from the same day in the previous week. I would recommend to have naive prediction in any reasonable way to benchmark your main model.

For daily sales, this is a reasonable benchmark because weekdays and weekends often differ systematically.

last_week = train["sales"].iloc[-7:].valuesrepeats = int(np.ceil(len(test) / 7))test["seasonal_naive"] = np.tile(last_week, repeats)[:len(test)]

Now define evaluation metrics.

def evaluate_forecast(y_true, y_pred):    mae = mean_absolute_error(y_true, y_pred)    rmse = np.sqrt(mean_squared_error(y_true, y_pred))    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100        return {        "MAE": mae,        "RMSE": rmse,        "MAPE": mape    }baseline_metrics = evaluate_forecast(    y_true=test["sales"],    y_pred=test["seasonal_naive"])baseline_metrics

Each metric has a different interpretation.

Mean absolute error, or MAE, measures the average absolute forecast error in the same units as the target variable. Root mean squared error, or RMSE, penalizes large errors more strongly. Mean absolute percentage error, or MAPE, expresses error as a percentage, although it can behave poorly when actual values are close to zero.

There is no universally correct metric. The appropriate metric depends on the decision context. In inventory management, underforecasting may be more costly than overforecasting. In energy markets, large errors during peak demand may matter more than average error. In finance, directional performance, volatility, and tail risk may be more relevant than point accuracy.

A scientific forecasting workflow makes the loss function explicit.

plt.figure(figsize=(12, 5))plt.plot(train.index[-120:], train["sales"].iloc[-120:], label="Training data")plt.plot(test.index, test["sales"], label="Actual test values")plt.plot(test.index, test["seasonal_naive"], label="Seasonal naive forecast")plt.title("Weekly Seasonal Naive Forecast")plt.xlabel("Date")plt.ylabel("Sales")plt.legend()plt.tight_layout()plt.show()

Numerical metrics are necessary, but they are not sufficient. A forecast plot can reveal systematic errors that a single metric hides.

For example, a forecast may have an acceptable average error while repeatedly missing peaks. It may track the level but fail to capture turning points. It may perform well during normal periods but fail during volatile periods.

Applied forecasting is not simply about minimizing an error metric. It is about producing forecasts that are useful for decisions under uncertainty.

Most machine learning algorithms require a tabular structure. They expect feature columns ( X ) and a target variable ( y ). A time series must therefore be transformed into a supervised learning problem.

For forecasting, common features include:

The critical constraint is that features must be available at the time of prediction.

For example, a rolling average must be shifted before it is used as a predictor. If we calculate a rolling average that includes the current target value and then use it to predict that same target value, we have introduced leakage.

The following function creates calendar, lag, and rolling-window features while respecting the timing structure.

def create_features(data):    data = data.copy()        data["dayofweek"] = data.index.dayofweek    data["month"] = data.index.month    data["quarter"] = data.index.quarter    data["dayofyear"] = data.index.dayofyear        data["lag_1"] = data["sales"].shift(1)    data["lag_7"] = data["sales"].shift(7)    data["lag_14"] = data["sales"].shift(14)    data["lag_28"] = data["sales"].shift(28)        data["rolling_mean_7"] = data["sales"].shift(1).rolling(window=7).mean()    data["rolling_mean_14"] = data["sales"].shift(1).rolling(window=14).mean()    data["rolling_std_7"] = data["sales"].shift(1).rolling(window=7).std()        return data

Apply the transformation.

feature_df = create_features(df)feature_df = feature_df.dropna()feature_df.head()

The missing values at the beginning are expected because lag and rolling features require historical observations. In this example, dropping those rows is acceptable.

In real projects, missing values require closer attention. They may represent holidays, system failures, delayed reporting, stockouts, non-trading days, or genuine zeros. Treating all missing observations as equivalent is often an empirical mistake.

We now split the feature-enhanced dataset into training and test periods.

train_features = feature_df.iloc[:-test_size].copy()test_features = feature_df.iloc[-test_size:].copy()feature_cols = [    "dayofweek",    "month",    "quarter",    "dayofyear",    "lag_1",    "lag_7",    "lag_14",    "lag_28",    "rolling_mean_7",    "rolling_mean_14",    "rolling_std_7"]X_train = train_features[feature_cols]y_train = train_features["sales"]X_test = test_features[feature_cols]y_test = test_features["sales"]

For demonstration, we will use a random forest regressor. A random forest is not inherently a time series model. It does not understand temporal ordering by itself. However, once lagged and calendar features are constructed, it can learn nonlinear relationships between recent history, seasonal effects, and the target variable.

model = RandomForestRegressor(    n_estimators=300,    max_depth=10,    random_state=1,    n_jobs=-1)model.fit(X_train, y_train)test_features["rf_forecast"] = model.predict(X_test)

Now evaluate the model.

rf_metrics = evaluate_forecast(    y_true=test_features["sales"],    y_pred=test_features["rf_forecast"])rf_metrics

Compare the model with the benchmark.

results = pd.DataFrame([    {"model": "Seasonal naive", **baseline_metrics},    {"model": "Random forest", **rf_metrics}])results

We see that random forest beats our earlier created naive forecast. This comparison is more important than the algorithm itself.

A complex forecasting model has to justify its complexity. If it cannot outperform a transparent seasonal benchmark, it may not be adding empirical value. In many organizations, the best forecasting system is not the most advanced system. It is the system that is accurate enough, stable enough, interpretable enough, and maintainable enough for repeated use.

plt.figure(figsize=(12, 5))plt.plot(train.index[-120:], train["sales"].iloc[-120:], label="Training data")plt.plot(test_features.index, test_features["sales"], label="Actual test values")plt.plot(test_features.index, test_features["rf_forecast"], label="Random forest forecast")plt.title("Random Forest Forecast")plt.xlabel("Date")plt.ylabel("Sales")plt.legend()plt.tight_layout()plt.show()

A useful forecast should be inspected from multiple angles.

Ask the following questions:

A model can have a better average error and still be unsuitable for operational use. For example, if it performs poorly during high-volume periods, the business cost may be large even if the average metric looks acceptable.

Tree-based models provide a simple measure of feature importance. This is not a complete explanation of the model, but it is a useful diagnostic.

importance = pd.DataFrame({    "feature": feature_cols,    "importance": model.feature_importances_}).sort_values("importance", ascending=False)importance
plt.figure(figsize=(10, 5))plt.barh(importance["feature"], importance["importance"])plt.gca().invert_yaxis()plt.title("Random Forest Feature Importance")plt.xlabel("Importance")plt.ylabel("Feature")plt.tight_layout()plt.show()

Feature importance should be interpreted carefully. It can be affected by correlated predictors, model structure, and the scale of the problem. Still, it can help identify whether the model is relying on lagged values, calendar variables, or rolling statistics.

As we see on this chart above, the feature importance results suggest that the random forest relies overwhelmingly on recent weekly patterns, especially lag_7, which alone explains most of the model’s predictive signal, followed by lag_14; this is consistent with a daily sales series where the same weekday in previous weeks is highly informative. Rolling averages add some value by smoothing recent demand, but calendar variables such as month, quarter, and day of week contribute very little once lagged sales values are included, meaning the model is learning seasonality mainly through historical sales behavior rather than explicit date-based features.

If a suspicious feature dominates the model, investigate it. In forecasting, unusually strong predictors are often worth auditing because they may contain hidden leakage.

The example above resembles a one-step-ahead forecasting setup where recent observed values are available for feature construction.

Multi-step forecasting is more difficult.

Suppose we are standing at time ( T ) and want to forecast the next 30 days. For ( T+1 ), we know the lagged values from the historical data. But for ( T+20 ), some lagged features would depend on values between ( T+1 ) and ( T+19 ), which have not yet been observed.

There are several ways to handle this.

The first is recursive forecasting. The model predicts the next step, then uses that prediction as an input for the following step. This is practical but can accumulate errors.

The second is direct forecasting. Separate models are trained for different horizons. For example, one model predicts day ( T+1 ), another predicts day ( T+7 ), and another predicts day ( T+30 ). This can reduce recursive error propagation but requires more models.

The third is multi-output forecasting, where a model predicts several future horizons at once.

The correct choice depends on the decision problem. Retail replenishment, macroeconomic forecasting, traffic prediction, and financial risk monitoring do not necessarily require the same horizon design.

A serious forecasting project should specify the forecast horizon before model training begins.

A single train-test split is informative, but it may not be sufficient. The selected test period may be unusually stable, unusually volatile, or seasonally unrepresentative.

Time series cross-validation provides a more rigorous evaluation by simulating multiple historical forecasting exercises.

The logic is simple: train on an initial historical window, test on a later window, then move forward through time and repeat the process.

from sklearn.model_selection import TimeSeriesSplitX = feature_df[feature_cols]y = feature_df["sales"]tscv = TimeSeriesSplit(n_splits=5)cv_results = []for fold, (train_idx, test_idx) in enumerate(tscv.split(X), start=1):    X_train_cv = X.iloc[train_idx]    X_test_cv = X.iloc[test_idx]    y_train_cv = y.iloc[train_idx]    y_test_cv = y.iloc[test_idx]        cv_model = RandomForestRegressor(        n_estimators=300,        max_depth=10,        random_state=42,        n_jobs=-1    )        cv_model.fit(X_train_cv, y_train_cv)    preds = cv_model.predict(X_test_cv)        mae = mean_absolute_error(y_test_cv, preds)    rmse = np.sqrt(mean_squared_error(y_test_cv, preds))        cv_results.append({        "fold": fold,        "MAE": mae,        "RMSE": rmse    })cv_results = pd.DataFrame(cv_results)cv_results

Then summarize the results.

cv_results[["MAE", "RMSE"]].mean()

This approach is closer to the actual forecasting problem because the model is repeatedly evaluated on future periods relative to each training sample.

But what can we say about these results? Our cross-validation results are fairly stable across the five folds: the average MAE is about 8.77, meaning the model is typically off by roughly 9 sales units, while the average RMSE is about 11.25, indicating that larger errors are present but not extreme; more importantly, the fold-level errors do not vary dramatically, which suggests that the model’s forecasting performance is reasonably consistent across different historical test periods.

In applied economics and business analytics, this is similar in spirit to pseudo-out-of-sample evaluation. So we are not trying to fit the historical data well. Our aim is to evaluate whether **the model would have produced useful forecasts at earlier points in time. **This is the key.

It is important to distinguish forecasting from causal analysis.

A forecasting model asks:

Given the available information, what value is likely to occur?

A causal model asks:

What would happen if one variable were changed while other relevant conditions were held constant?

These are different questions.

A variable can be useful for prediction without having a causal interpretation. For example, calendar variables may help forecast demand, but “month” is not a policy lever. Similarly, lagged sales may be predictive because they summarize persistence, seasonality, or omitted factors, not because yesterday’s sales mechanically cause today’s sales.

This distinction matters when forecasts are used for decision-making. If the goal is to predict demand, a forecasting model may be sufficient. If the goal is to estimate the effect of a price change, advertising campaign, interest rate shock, or policy intervention, a causal research design is required.

A good applied economist does not confuse predictive usefulness with causal identification.

Time series models depend on historical regularities. When those regularities change, forecast accuracy can deteriorate quickly.

This is especially relevant in economic and financial settings. Structural breaks may occur because of recessions, inflation shocks, regulatory changes, geopolitical events, technological adoption, supply disruptions, or changes in consumer preferences.

A model trained during a stable expansion may fail during a crisis. A demand forecast estimated before a price regime change may become biased after the change. A financial model trained on low-volatility periods may underestimate risk during market stress.

This does not mean forecasting is useless. It means forecasting should be treated as a monitored empirical system rather than a one-time modeling exercise.

A production forecasting system should include:

Yeah, of course, forecasts are conditional estimates, not guarantees, but this is what we do.

Point forecasts are often overemphasized. A single predicted value can create the illusion of precision.

In many applied settings, uncertainty is more important than the point estimate. A retailer may need to know not only expected demand but also the risk of stockouts. A utility company may care about peak demand under adverse weather. A central bank may consider inflation scenarios rather than a single path.

A more complete forecast communicates a range of plausible outcomes.

Some forecasting models provide prediction intervals directly. Others require simulation, bootstrapping, quantile regression, Bayesian methods, or conformal prediction techniques. The appropriate method depends on the model class and the assumptions one is willing to make.

Even when a simple point forecast is used, it is worth reporting historical forecast errors. Decision-makers should understand the typical magnitude of uncertainty.

A forecast without an error distribution is incomplete.

The most common mistakes in time series forecasting are not syntax errors. They are design errors.

The first mistake is using random train-test splits. This breaks the temporal structure and often produces optimistic results.

The second mistake is leaking future information through feature engineering. Rolling means, scalers, imputers, target encoders, and external variables can all introduce leakage if they are not constructed according to the forecast date.

The third mistake is skipping the benchmark. A machine learning model should be compared against simple alternatives such as naive, seasonal naive, or moving-average forecasts.

The fourth mistake is evaluating the wrong horizon. A model that predicts one day ahead is not automatically valid for a 30-day planning problem.

The fifth mistake is using one metric without considering the cost structure. Forecast errors are rarely symmetric in economic terms. Underforecasting and overforecasting often have different consequences.

The sixth mistake is treating the model as permanent. Forecasting models require monitoring because the data-generating environment can change.

The seventh mistake is confusing prediction with explanation. A variable that improves forecast accuracy does not necessarily identify a causal mechanism.

A disciplined forecasting workflow should follow a clear empirical sequence.

First, define the target variable and the decision problem. Forecasting “sales” is not precise enough. We need to know the horizon, frequency, unit of observation, and business use case.

Second, inspect the time series. Identify trend, seasonality, missing values, outliers, and unstable periods.

Third, construct a chronological validation strategy. The evaluation should reproduce the information constraints of the actual forecasting problem.

Fourth, build a simple benchmark. Without a benchmark, model performance has no meaningful reference point.

Fifth, create features using only information available at the time of prediction.

Sixth, train candidate models and compare them against the benchmark.

Seventh, evaluate performance across multiple periods, not only one convenient test window.

Eighth, inspect forecast errors visually and statistically.

Ninth, communicate uncertainty and limitations.

Tenth, monitor the model after deployment.

This workflow is slower than immediately fitting a model, but it produces more reliable empirical results.

Time series forecasting in Python is not mainly about choosing the most sophisticated algorithm. It is about designing a credible forecasting experiment.

The central question is not: “Which model has the best name?”

The central question is:

Given the information available at the forecast date, does this method produce more reliable predictions than a simple benchmark under realistic out-of-sample evaluation?

That question forces us to think scientifically. It requires chronological validation, careful feature construction, explicit benchmarks, appropriate error metrics, and humility about uncertainty.

But the scientific value does not come from your Python code alone. It comes from the discipline of the workflow.

A useful forecast is not just a number produced by a model. It is an empirical claim about the future, made under uncertainty, constrained by the information available today, and tested against evidence from the past.

That is what makes time series forecasting both technically interesting and economically important.

🙏Enjoyed the read?

If this piece brought you value,give it a few claps(or a hundred!) — it helps others discover it too.

📰Want more content like this?

Hit[and join me on this journey through data, ideas, and insights.]Subscribe

☕Feeling generous?

Support my work with a coffee — your kind gesture fuels both caffeine and creativity! >>[Buy Me a Coffee]

Important Rules of Time Series Forecasting in Python was originally published in Dev Genius on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

blog.devgenius.io — original article What does software development look like when agents write 100% of the code?

Important Rules of Time Series Forecasting in Python

Run your AI side-project on zahid.host