Mastering Time Series Forecasting with PyTorch: From LSTM to Transformers

deep_learning

Mastering Time Series Forecasting with PyTorch: From LSTM to Transformers

Learn how to build accurate, production-ready time series forecasting models using PyTorch, LSTM, and Temporal Fusion Transformers.

Dec 28, 2025

Mastering Time Series Forecasting with PyTorch: From LSTM to Transformers

Let’s talk about time. More specifically, let’s talk about predicting the future using data from the past. This is the core of time series forecasting, and it’s something I’ve been thinking about a lot lately. Why? Because whether it’s planning energy grids, managing supply chains, or anticipating market trends, accurate forecasts create stability and opportunity. My goal here is to share a clear, practical guide for building robust forecasting models with PyTorch. Stick with me, and by the end, you’ll have the tools to turn historical data into reliable predictions. If you find this helpful, I’d be grateful if you’d like, share, or comment with your own experiences.

Why start with this now? The shift from statistical models to deep learning has opened new doors. Models can now learn complex patterns we might miss, handling messy, real-world data with multiple influences. But this power comes with new challenges. How do we prepare our data? Which architecture should we choose? And how do we move from a working notebook to a system that runs reliably in production? These are the questions we’ll answer together.

First, we need to understand our data. Time series isn’t just a list of numbers; it’s a sequence with a story. There’s often a trend—a general direction over years. There’s seasonality—repeating cycles like daily sales spikes or weekly energy use. And there’s always noise—random fluctuations. Capturing these elements is the first step to a good model.

So, how do we get our data ready for a neural network? We need a solid preprocessing pipeline. Think about scaling values to a common range, handling missing points, and creating informative features. The date itself is a treasure trove of information. Is it a Monday or a weekend? What month or quarter is it? Converting these into numerical features helps the model learn calendar-based patterns. We can also add lagged values—yesterday’s sales figure is often a strong hint for today’s.

But what does this look like in code? Let’s build a simple preprocessing class. This example handles scaling and adds some basic temporal features.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

class BasicTimeSeriesPreprocessor:
    def __init__(self):
        self.target_scaler = StandardScaler()
        self.feature_scalers = {}

    def fit(self, data, target_col):
        self.target_scaler.fit(data[[target_col]])
        numeric_features = data.select_dtypes(include=[np.number]).columns.drop(target_col)
        for col in numeric_features:
            scaler = StandardScaler()
            scaler.fit(data[[col]])
            self.feature_scalers[col] = scaler
        return self

    def transform(self, data, target_col):
        data = data.copy()
        data[target_col] = self.target_scaler.transform(data[[target_col]])
        for col, scaler in self.feature_scalers.items():
            if col in data.columns:
                data[col] = scaler.transform(data[[col]])
        return data

With clean data, we can build our first model. A great starting point is the Long Short-Term Memory network, or LSTM. It’s designed to remember information over long sequences, which makes it a natural fit for time series. But have you ever wondered why a simple LSTM sometimes struggles with very long-term dependencies? That’s a hint that we might need more advanced tools later.

An LSTM takes in a window of past data and predicts the next step, or several steps. The key is how we structure this data. We create overlapping windows from our time series. For example, if we use the past 30 days to predict the next 7, we slide this 30-day window across our dataset. PyTorch’s Dataset and DataLoader classes are perfect for this.

Here’s a straightforward LSTM model in PyTorch. Notice how it returns both the final prediction and the hidden states, which can be useful for complex sequences.

import torch
import torch.nn as nn

class LSTMForecaster(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size, dropout=0.2):
        super(LSTMForecaster, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, 
                            batch_first=True, dropout=dropout if num_layers>1 else 0)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden=None):
        batch_size = x.size(0)
        if hidden is None:
            h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
            c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
            hidden = (h0, c0)
        out, hidden = self.lstm(x, hidden)
        out = self.fc(out[:, -1, :])  # Use the last time step's output
        return out, hidden

Training this model involves defining a loss function, like Mean Squared Error, and an optimizer. We feed it sequences, get predictions, and adjust the weights to minimize the difference between our predictions and the actual future values. But here’s a problem: what if we need to predict multiple things at once, like electricity demand and price? Or what if we need to understand why the model made a certain prediction? This is where we go beyond LSTMs.

Enter the Temporal Fusion Transformer. The TFT is a newer architecture that excels at multi-horizon forecasting and, crucially, provides insights into its own decisions. It uses attention mechanisms to learn which past time steps are most important for the future prediction. This interpretability is a game-changer for business use, where explaining a forecast is as important as its accuracy.

Imagine you’re predicting sales. The TFT can tell you that last year’s holiday spike and a recent marketing campaign were the two main drivers of its forecast for next month. How many traditional models can offer that level of clarity? The TFT structure handles static metadata (like a store ID), known future inputs (like a planned holiday), and observed past inputs all together.

Building a full TFT from scratch is complex, but we can look at its core component: the multi-head attention block. This allows the model to focus on different parts of the input sequence simultaneously.

class MultiHeadAttentionBlock(nn.Module):
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.attention = nn.MultiheadAttention(d_model, num_heads, dropout=dropout, batch_first=True)
        self.norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, query, key, value):
        attn_output, attn_weights = self.attention(query, key, value)
        output = self.norm(query + self.dropout(attn_output))
        return output, attn_weights  # We return weights for interpretation

The real strength comes from combining this attention with other mechanisms to handle different data types and temporal patterns. So, you might start with an LSTM for a simpler, faster project, and graduate to a TFT when you need robust, interpretable forecasts on complex data. But a model is only as good as its deployment. How do we take this from a research script to a production system?

Production readiness means thinking about consistency and reliability. Your preprocessing steps must be saved and applied identically to new data. Model versioning is essential—you need to know which model made which prediction. Finally, consider quantile regression. Instead of predicting a single future value, predict a range. This gives you a measure of confidence, like forecasting that sales will be between 100 and 120 units with 90% probability. This is invaluable for risk-aware planning.

The journey from raw time-stamped data to a trustworthy forecast is detailed, but each step is manageable. Start simple, understand your data’s rhythm, choose an appropriate model, and always plan for how it will be used in the real world. I’ve walked you through the core ideas and code to begin. What’s the first time series problem you’ll solve with these tools?

I hope this guide has made the path to production-ready time series forecasting clearer. The blend of LSTM reliability and TFT sophistication gives you powerful options. If this exploration of time and prediction was useful, please consider liking or sharing this article. I’m also keen to hear your thoughts or questions in the comments—what challenges have you faced in your forecasting projects? Let’s learn from each other.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

deep_learning