Building load prediction is a solved problem in the academic literature. You can find gradient boosting models, LSTM architectures, and physics-informed neural networks applied to building energy forecasting going back more than a decade. The published accuracy numbers look impressive. What the papers rarely tell you is how those models behave when you actually try to deploy them in a real building, with real sensor gaps, with a facility manager who doesn't share your assumptions about data quality.
We started building Voltpathio's prediction pipeline with reasonable confidence that the hard part was behind us — the ML community had essentially solved this. Two years in, we think the hard part is almost entirely in the feature engineering and data reliability layers, not the model architecture. Here is what we actually learned.
Weather Normalization Is the Foundation, Not the Afterthought
The single biggest predictor of building electricity consumption is dry-bulb temperature. That's well known. What we underestimated initially was how much the relationship between temperature and load varies not just by building type but by building orientation, equipment vintage, and local microclimate effects. A model trained on weather station data from an airport 20 miles away can have systematically higher RMSE than one trained on a grid-point NWS forecast for the exact facility location, not because the model architecture is wrong but because the input data has persistent geographic bias.
Beyond dry-bulb, wet-bulb temperature and dew point add meaningful prediction power for cooling-dominated buildings — particularly in humid climates where latent load (the energy required to dehumidify air) can rival sensible load (the energy to change air temperature). A model that uses only dry-bulb will systematically underpredict load on humid days and overpredict it on hot-dry days. Solar irradiance matters for buildings with significant glazing. Wind speed matters for older buildings with poor envelope performance.
The counterintuitive finding from our feature importance analysis: getting weather inputs right matters more than model complexity. A well-engineered linear model with properly processed weather inputs outperforms a poorly-configured gradient boosting model with raw station data. We spent the first several months chasing architecture improvements when we should have been spending that time on weather data sourcing and preprocessing.
Occupancy Signal: High Value, Hard to Get Right
After weather, occupancy is the next most important predictor. The gap between occupied and unoccupied periods explains 15–30% of load variance in commercial buildings after weather normalization. Getting this feature right requires solving a data collection problem that most buildings haven't fully addressed.
The easiest occupancy signal is a calendar — business hours, known holidays, scheduled events. This is often available through integration with facilities scheduling systems or even a simple structured data feed. Calendar-based occupancy works reasonably well for the average case but fails at the tails: the Friday afternoon when everyone leaves two hours early, the holiday week when 20% of the building is occupied anyway, the building-wide event that drives double-normal occupancy for three days.
Badge reader data, where available and where privacy constraints permit aggregated use, is significantly better — it gives you actual building population at 15–30 minute resolution rather than a binary occupied/unoccupied estimate. We've seen prediction RMSE improvements of 8–12% from switching from calendar-only occupancy signals to actual badge count aggregates on commercial office buildings.
Plug load sensing — aggregate power consumption on lighting and small equipment circuits — provides an indirect occupancy signal that's often more accessible than badge data. When the desk lamps and monitors are drawing power, people are present. This is a useful proxy for buildings where direct occupancy data isn't available.
One thing we learned: don't try to predict occupancy from the energy data itself. That circularity causes problems — you end up with a model that's encoding load patterns rather than explaining them, which hurts generalization badly when operational patterns change.
Equipment Telemetry: Where the Model Gets Personalized
The third feature layer — equipment telemetry — is what makes a load prediction model specific to a particular building rather than a generic approximation. Two buildings with identical weather conditions, identical square footage, and identical occupancy profiles can have very different load curves because their equipment mix, operating conditions, and maintenance state differ.
The most impactful telemetry features for our models have been: chiller staging state (how many chillers are running and at what load ratio), AHU supply air temperature and flow rates, and compressor cycling frequency. These signals tell the model what the building's thermal systems are actually doing, not just what the weather and occupancy inputs suggest they should be doing.
Equipment telemetry also captures maintenance effects that pure load modeling misses. A chiller whose COP has degraded 15% due to fouled heat exchanger surfaces will draw more electricity for the same cooling output — that shows up in the load data, but without knowing it's the chiller causing it, the model can't generalize the pattern correctly. With chiller performance data, the degradation can be isolated as a feature and the predictions remain accurate even as equipment ages.
The challenge with telemetry features is data reliability. HVAC sensors fail, BACnet connections drop, meter points get relabeled after equipment replacements. A model that becomes dependent on a sensor that then goes offline loses meaningful prediction accuracy at exactly the moment you need it most — during a high-load event when accurate forecasting matters. Our solution is to train models with explicit sensor availability conditioning: the model learns to degrade gracefully when inputs are missing, falling back on weather-plus-occupancy predictions rather than producing invalid outputs.
The Model Architecture Decision
After extensive testing, we landed on an ensemble approach: a gradient boosting model (XGBoost) handles the baseline pattern recognition, and a shallow LSTM handles the short-range temporal autocorrelation — specifically, the fact that current load depends meaningfully on load in the preceding 1–3 hours. The two model outputs are combined with a meta-learner that weights them based on forecast horizon and current input completeness.
We're not saying this is the optimal architecture for every building type. For some buildings — particularly those with very stable operating patterns and good data quality — a well-tuned gradient boosting model alone does nearly as well and is significantly easier to maintain and debug. The LSTM adds most of its value in buildings with more volatile load patterns, where knowing recent load trajectory matters for predicting the next 2–6 hours accurately.
What we're fairly confident about is that model sophistication has diminishing returns past a certain threshold of data quality. If you're working with 15-minute interval data that has 10% missing values and weather inputs from the wrong geographic location, moving from XGBoost to a Transformer architecture buys you almost nothing. Fixing the data issues first is where the improvement lives.
Calibration Is More Important Than Accuracy for Operational Use
This is the lesson that surprised us most. In academic benchmarks, model evaluation is typically done with point accuracy metrics: RMSE, MAPE, CV(RMSE). These measure how close your predictions are to actuals on average. For operational use in a facility optimization context, calibration matters more — specifically, whether the model's uncertainty estimates reflect its actual prediction error distribution.
A model that says "tomorrow's peak load will be 2,400 kW ± 50 kW" is more useful to a facility operator than one that says "tomorrow's peak load will be 2,400 kW" with no uncertainty bound, even if the second model has a slightly lower mean RMSE. The operator can calibrate their pre-conditioning strategy to the uncertainty: narrow confidence interval → act aggressively, wide confidence interval → act conservatively.
We spent a significant chunk of development time on conformal prediction intervals that track calibrated uncertainty as a function of forecast horizon and input quality. A 48-hour forecast for a high-humidity summer day in an unfamiliar building should have wider uncertainty bounds than a 6-hour forecast for a clear spring morning in a well-characterized facility. Getting that right is harder than improving mean accuracy — but it's what makes the forecast actually useful for schedule optimization decisions rather than just impressive in a benchmark table.