Zum Inhalt springen

Zero-Shot Prediction Error from a Pre-Trained Transformer Model as a Building Energy Diagnostic: A Hierarchical Framework Beyond Annual EUI

Prometheus Redaktion

Open AccessArticle Zero-Shot Prediction Error from a Pre-Trained Transformer Model as a Building Energy Diagnostic: A Hierarchical Framework Beyond Annual EUI by Hyun-Ho Yang Hyun-Ho Yang SciProfiles Scilit Preprints.org Google Scholar 1 and Jeong-Uk Kim Jeong-Uk Kim SciProfiles Scilit Preprints.org Google Scholar 2,* 1 Department of Energy Grid, Sangmyung University, Jongno-gu, Seoul 03016, Republic of Korea 2 Department of Electrical Engineering, Sangmyung University, Jongno-gu, Seoul 03016, Republic of Korea * Author to whom correspondence should be addressed. Buildings 2026, 16(12), 2290; https://doi.org/10.3390/buildings16122290 (registering DOI) Submission received: 6 April 2026 / Revised: 27 May 2026 / Accepted: 4 June 2026 / Published: 7 June 2026 Abstract Current building energy benchmarking relies on annual Energy Use Intensity (EUI), which cannot detect temporal operational anomalies—such as after-hours equipment operation or irregular scheduling—that represent actionable efficiency opportunities. We demonstrate that 64.7% of ENERGY STAR-certifiable buildings exhibit temporal irregularities invisible to annual EUI. To capture these hidden patterns, we propose a hierarchical, three-level evaluation framework that pairs EUI with the zero-shot prediction error (CVRMSE) from a population-trained Transformer model (TransformerWithGaussian-L, pre-trained on 900,000 simulated buildings). Applied to 611 real buildings from the Building Data Genome Project 2 (9,247,992 observation–prediction pairs), we show that EUI and CVRMSE are near-orthogonal (r = −0.029), confirming they measure fundamentally distinct performance dimensions. The framework proceeds through three diagnostic levels: (L1) EUI × CVRMSE quadrant classification, (L2) decomposition of prediction error into inherent variability versus genuine atypicality (R 2 = 0.700), and (L3) NMBE directional analysis identifying over- versus under-consuming buildings. Requiring only hourly metered energy and geographic coordinates, this framework enables temporal pattern diagnostics applicable to large building portfolios. Buildings account for approximately 30% of global final energy consumption and 26% of energy-related greenhouse gas emissions [ 1, 2, 3], making operational energy performance assessment a critical lever for decarbonization. Over the past two decades, building energy benchmarking systems—from the U.S. ENERGY STAR Portfolio Manager [ 4] to the EU Energy Performance of Buildings Directive (EPBD) [ 5] and ASHRAE Building EQ [ 6]—have become standard tools for portfolio-level performance evaluation. These systems predominantly rely on annual aggregate metrics, most commonly Energy Use Intensity (EUI), to rank buildings against peer populations or regulatory thresholds. Existing regulatory frameworks reinforce this annual-metric paradigm. ENERGY STAR [ 4], the EPBD [ 5], and ASHRAE Standard 100 all evaluate buildings based on annual aggregate energy use without assessing whether hourly or daily operational patterns are consistent with efficient operation, creating a systematic blind spot for temporal anomalies. Despite the proliferation of benchmarking frameworks, a fundamental gap persists: current systems evaluate how much energy a building consumes over a year but remain structurally blind to how that energy is consumed across temporal dimensions. A building that offsets daytime efficiency gains with nighttime waste, or one that follows erratic scheduling patterns invisible in annual aggregates, receives the same score as a genuinely well-operated peer. This paper addresses this gap by proposing a hierarchical diagnostic framework that integrates annual efficiency assessment with temporal pattern evaluation using zero-shot prediction error from a population-trained forecasting model. 1.1. The Limitations of Annual Energy Benchmarking The ENERGY STAR framework [ 4] compresses dynamic operational behavior into a single annual scalar, structurally incapable of detecting buildings that achieve low annual EUI through offsetting patterns or that appear inefficient despite consistent operations. Subsequent advances—EnergyStar++ [ 7] with gradient-boosted trees and SHAP interpretability, the Benchmark 8760 initiative [ 8] advocating hourly benchmarking, and Piscitelli et al.’s [ 9] multi-KPI framework on BDG-2—have partially addressed these gaps but none provides a single, temporal pattern metric benchmarked against population norms. 1.2. The Two-Dimensional EUI × CV Framework and Its Limitations Combining EUI with the Coefficient of Variation (CV = σ/μ) in a two-dimensional framework [ 10] introduces a temporal variability dimension. However, CV is self-referential: it measures variability relative to a building’s own statistics, providing no information about whether observed variability is typical or atypical relative to comparable buildings. A sports stadium with high CV due to event scheduling is indistinguishable from an office with anomalous HVAC cycling—motivating an externally referenced alternative. These limitations motivate replacing self-referenced CV with a population-referenced pattern consistency metric—one that measures deviation not from a building’s own history but from what buildings of its type, size, and climate context typically do. This is precisely what zero-shot prediction error from a population-trained model provides. 1.3. Foundation Models as Population-Referenced Pattern Benchmarks We recognize a critical and previously unexploited property of zero-shot prediction error from such a population-trained model: when the model—trained to represent the joint distribution of building energy patterns across a diverse, statistically calibrated population—fails to predict a building’s consumption, that failure is proportional to the building’s deviation from population-representative operational norms. A building whose consumption pattern lies well within the learned manifold of typical buildings will have low CVRMSE; a building exhibiting genuinely unusual patterns will have high CVRMSE—not because the model is poorly calibrated, but because the building’s patterns are population-atypical. 1.4. Research Objectives and Contributions Against this backdrop, this paper makes the following contributions: Theoretical framing: We formally establish zero-shot CVRMSE from a population-trained forecasting model as a population-referenced pattern consistency metric that is theoretically and empirically distinct from both CV (self-referenced variability) and EUI (aggregate efficiency). We develop the conceptual justification for interpreting model prediction error as building-level diagnostic information rather than model accuracy information. Empirical independence demonstration: We demonstrate across 611 real buildings (BDG-2) that EUI and CVRMSE are near-orthogonal (r = −0.029 for all buildings; r = −0.082, R 2 = 0.007 for 583 CBECS-mapped buildings), establishing their validity as independent diagnostic dimensions and confirming that neither is a proxy for the other. Hierarchical diagnostic framework: We develop and validate a three-level hierarchical evaluation framework—(L1) EUI × CVRMSE quadrant classification, (L2) CVRMSE decomposition into CV-driven and genuinely atypical components, and (L3) NMBE directional analysis—that progressively refines diagnosis from building-level classification to actionable hourly resolution intervention recommendations. ENERGY STAR blind spot quantification: Benchmarking against CBECS 2018 [ 15] Table C14 median EUI thresholds, we empirically quantify that 64.7% of ENERGY STAR-certifiable buildings (EUI Score ≥ 75, n = 85) exhibit temporal operational irregularities undetectable by annual EUI alone. Conversely, a substantial proportion of buildings with above-median EUI exhibit consistent temporal patterns, indicating structural rather than operational inefficiency. We characterize four systematic types of evaluation reversal when CVRMSE is added to the assessment. 2. Related Work 2.1. Building Energy Benchmarking Systems: From Annual to Temporal Building energy benchmarking has evolved from simple EUI league tables to regression-based systems that control for building characteristics. The ENERGY STAR Portfolio Manager [ 4] uses WLS regression on CBECS survey data to predict building-type-specific EUI given floor area, climate zone, operating hours, worker density, and plug load intensity, producing percentile scores adjusted for these confounders. The framework has achieved substantial policy uptake—over 240,000 U.S. commercial properties are benchmarked annually, and ENERGY STAR certification is mandatory for large buildings in many U.S. cities under building performance standards. However, fundamental critiques of single-indicator benchmarking have accumulated. Bordass [ 16] argued that single indicators systematically mislead because they reduce multidimensional performance to an oversimplified scalar that obscures the distinct contributions of building systems, occupancy, and operations. Scofield [ 17] demonstrated empirically that LEED certification does not reliably predict metered energy savings in practice, partly because annual EUI masks occupancy variability and operational dynamics—a critique equally applicable to any single-metric annual benchmarking system including ENERGY STAR. ASHRAE’s Building EQ [ 6] and the EU’s Smart Readiness Indicator [ 5] have moved toward multi-dimensional assessment, but these require extensive manual data collection that constrains deployment at scale. Attempts to introduce temporal resolution into building performance assessment have followed several streams. The IMT’s Benchmark 8760 initiative [ 8] explicitly called for 8760 h benchmarking as necessary for capturing demand flexibility, grid interaction, and occupant comfort—but without prescribing a specific methodology. Granderson et al. [ 18] developed statistical methods for 15 min to hourly baseline modeling under ASHRAE Guideline 14 [ 19] for measurement and verification (M&V) applications, establishing CVRMSE and NMBE as standard accuracy metrics in that context. Our work repurposes these M&V metrics in a benchmarking context, inverting the interpretation: rather than measuring how accurately a model predicts a building (M&V view), we use the prediction error magnitude to characterize how atypical a building is relative to a population (benchmarking view). 2.3. Foundation Models for Building Energy Time Series The BuildingsBench evaluation framework includes transfer learning benchmarks where pre-trained models are fine-tuned on target buildings with limited data [ 14], and broader applications of foundation models in energy systems—including anomaly detection via prediction residuals and occupancy estimation—are an active area of research. However, none of this work uses the zero-shot prediction error magnitude itself as a building performance benchmarking metric—this is the distinctive contribution of the present study. 2.4. Positioning and Distinction from FDD Compared to ENERGY STAR (annual EUI, CBECS regression, requires area/hours/workers) [ 4], EnergyStar++ (annual EUI, full metadata) [ 7], Benchmark 8760 (hourly, no metric specified) [ 8], EUI × CV quadrant (annual + self-referential CV) [ 10], and Piscitelli et al. (hourly multi-KPI, peer-relative, expert design) [ 9], the proposed framework uniquely combines hourly resolution, a 900K-building population reference, minimal inputs (hourly load + lat/lon), and four-dimensional diagnostics (efficiency + pattern + cause + direction). The proposed framework is distinct from fault detection and diagnosis (FDD) approaches [ 22, 23], which diagnose specific equipment-level faults using detailed BMS subsystem data. The framework operates at the whole-building portfolio screening level using only aggregate meter data, identifying which buildings exhibit population-atypical temporal patterns to prioritize for detailed investigation. The two approaches are complementary. 3. Data 3.1. Evaluation Dataset: Building Data Genome Project 2 (BDG-2) 3.1.1. Dataset Overview The Building Data Genome Project 2 (BDG-2) [ 24] is one of the largest publicly available collections of real building energy meter data, released in conjunction with the ASHRAE Great Energy Predictor III (GEPIII) competition. The full dataset contains 3053 m from 1636 buildings across multiple sites in North America and Europe, covering electricity, chilled water, hot water, and steam meters. For this study, the following inclusion criteria are applied: Electricity meters only: Electricity is the only energy carrier available for all buildings; other carriers have partial coverage and require load conversion assumptions. Valid zero-shot predictions: Buildings where Box–Cox normalization succeeded (positive mean load) and the model produced no NaN or infinite prediction sequences. Sufficient prediction timesteps: A minimum of 8000 valid hourly observation-prediction pairs per building, ensuring reliable CVRMSE and NMBE estimation (equivalent to approximately 333 days). Floor area availability: Required for EUI computation (Level 1). These criteria yield a final study dataset of 611 buildings with 9,247,992 total observation-prediction timestep pairs, spanning four geographic sites (Bear, Fox, Rat, Panther) across North American climate zones 2C through 6A. 3.1.2. Study Dataset Characteristics The dataset () is dominated by education buildings (41.6%), reflecting the composition of the BDG-2 sites. All four sites are in North America, providing climate diversity (marine, humid subtropical, humid continental, subarctic) without requiring cross-continental generalization claims. Floor area (100.0%) and building type (98.7%) are available for nearly all buildings. Operating hours—a required input for ENERGY STAR’s WLS regression—are entirely unavailable in BDG-2, motivating the CBECS population-referenced z-score approach for EUI scoring, which requires no building-specific metadata beyond lat/lon. 4. Methodology: Three-Level Hierarchical Evaluation Framework 4.1. Pre-Trained Model: BuildingsBench TransformerWithGaussian-L 4.1.1. Architecture TransformerWithGaussian-L is a causal Transformer encoder [ 25] from the BuildingsBench model family [ 14] that outputs Gaussian predictive distributions over hourly building load. The architecture employs multi-head self-attention with causal masking, enabling it to process variable-length historical context sequences and produce 24-step-ahead probabilistic forecasts. Geographic context (latitude, longitude) is provided as static auxiliary features concatenated to the learned time series embeddings, enabling the model to implicitly condition on climate zone without explicit climate variable inputs. 4.1.2. Training Data: Buildings-900K The model was pre-trained on Buildings-900K, a corpus of 900,000 one-year (8760 h) hourly load profiles generated using EnergyPlus building energy simulation software. The corpus is parameterized to reflect the statistical distribution of the U.S. commercial building stock as characterized by CBECS [ 26]: Building types: A total of 11 commercial building archetypes (office, retail, warehouse, education, hotel, healthcare, food service, food sales, strip mall, religious worship, miscellaneous). Climate zones: A total of 16 U.S. climate zones (ASHRAE 169-2013, spanning 2A through 8A). Vintage: Representative construction vintages from pre-1980 through 2018, with HVAC system types and envelope properties calibrated accordingly. Schedules: ASHRAE Standard 90.1 reference schedules for each building type, providing typical occupancy, lighting, and equipment load profiles. CBECS sampling: Building geometry, floor area, number of floors, and system type assignments are drawn according to CBECS survey weights to ensure the 900K corpus represents the actual U.S. commercial stock distribution. This training procedure means the model learns not individual building behaviors but the statistical manifold of energy patterns consistent with CBECS-characterized U.S. commercial building stock—the joint distribution over diurnal shape, seasonal variation, climate response, and magnitude-to-variability relationships. This is the theoretical foundation for interpreting CVRMSE as population-referenced atypicality. 4.1.3. Zero-Shot Inference Configuration All predictions in this study are generated in strict zero-shot mode—no fine-tuning on BDG-2 data whatsoever. Model configuration: Input context length: 168 h (7 days of historical consumption). Prediction horizon: 24 h. Sliding window stride: 24 h. Input features: (1) Energy consumption time series, (2) latitude and longitude as static auxiliary inputs. Load normalization: Box–Cox transformation with λ estimated independently per building from its historical data, applied before model input and inverted for post-prediction metric computation. Output: Gaussian predictive distribution (μ, σ 2) at each prediction step; the mean prediction μ is used for CVRMSE and NMBE computation. 4.2. Metric Definitions Energy Use Intensity (EUI): Annual electricity consumption divided by gross floor area (sqft), yielding kWh/sqft/year. Computed as (mean hourly load in kWh × 8760 h)/sqft. Annual totals are computed by summing available hourly meter readings; for buildings with fewer than 365 days of data, values are pro-rated assuming uniform seasonal distribution. Data Consistency Requirement: Electricity-Only Evaluation This framework evaluates building performance using Advanced Metering Infrastructure (AMI) electricity meter data, which is the most widely deployed high-frequency energy data source in commercial building portfolios. To ensure methodologically consistent evaluation, all EUI calculations and reference benchmarks use electricity consumption only, excluding natural gas, district heat, fuel oil, and other fuel sources. Rationale for electricity-only evaluation: AMI data availability: Smart electricity meters (AMI) provide hourly or sub-hourly resolution at scale. Gas metering is typically monthly or daily, and district energy/fuel oil are rarely sub-metered at all. A framework requiring multi-fuel hourly data would exclude 80%+ of buildings that have only electricity AMI. Foundation model training data: The BuildingsBench TransformerWithGaussian-L model was trained on electricity load profiles from the Buildings-900K dataset [ 14], which contains simulated hourly electricity consumption. The model has learned electricity consumption patterns—not total site energy patterns. Feeding total site EUI to an electricity-trained model introduces category mismatch. Coefficient of Variation (CV): Standard deviation of hourly load divided by mean hourly load over the observation period: CV i = σ i μ i CV is computed from measured consumption data only; no model prediction is involved. CV characterizes a building’s inherent load variability relative to its own mean—a self-referential metric. Coefficient of Variation in Root Mean Square Error (CVRMSE): Standardized measure of zero-shot model prediction error: CVRMSE i = 1 n i ∑ t = 1 n i ( y i , t − y ^ i , t ) 2 y ପ୍ତ i where y i , t is measured consumption at timestep t for building i , y ^ i , t is the model’s zero-shot point prediction, y ପ୍ତ i is mean measured consumption, and n i is the number of valid timesteps. CVRMSE normalizes RMSE by the building’s mean load, enabling cross-building comparison of pattern deviation regardless of absolute consumption magnitude. Normalized Mean Bias Error (NMBE): Systematic directional bias in model predictions: NMBE i = 1 n i ∑ t = 1 n i ( y i , t − y ^ i , t ) y ପ୍ତ i Positive NMBE indicates the building systematically consumes more than the model predicts (model under-predicts actual; the building is over-consuming relative to population expectations). Negative NMBE indicates the building systematically consumes less (the building is under-consuming relative to expectations—potentially indicating efficient operational practices). The NMBE formulation follows ASHRAE Guideline 14 [ 19] conventions, ensuring consistency with M&V literature. Excess CVRMSE (defined in Section 4.5): The portion of a building’s CVRMSE that exceeds what would be predicted from its inherent load variability (CV) alone, capturing the model’s unique contribution to pattern atypicality diagnosis. 4.3. Conceptual Architecture The proposed framework is designed around three diagnostic questions (), each answered with increasing specificity: Level 1: Is this building performing well on both the aggregate efficiency dimension (EUI) and the temporal pattern consistency dimension (CVRMSE)? → Four-quadrant classification. Level 2: For buildings with irregular temporal patterns: Is this irregularity inherent to the building’s use type, or does it represent genuine population-level anomaly? → CVRMSE decomposition. Level 3: For genuinely anomalous buildings: Does the anomaly manifest as over-consumption or under-consumption relative to population expectations? → NMBE directional analysis. The necessity of each level can be illustrated through diagnostic failures that arise when levels are omitted. A building with EUI Score = 82 but CVRMSE = 35% and NMBE = +12% would be certified as efficient by EUI alone, yet it systematically over-consumes relative to its temporal context. Conversely, a Quadrant D building with CVRMSE = 28% appears problematic, but Level 2 decomposition reveals its high CV places it in the CV_DRIVEN category (Excess CVRMSE = 2 pp), requiring no intervention. Each level is designed to address a diagnostic ambiguity that the prior level cannot capture. The framework is strictly hierarchical—each level’s question is only meaningful given the prior level’s answer. Applying NMBE analysis to all buildings regardless of CVRMSE classification would produce noisy, uninterpretable signals (a point validated statistically in Section 5.4). 4.4. Level 1: EUI × CVRMSE Quadrant Classification 4.4.1. EUI Score Computation To enable absolute, population-referenced EUI evaluation, EUI Score is obtained through z-score normalization against CBECS 2018 Table C14 electricity consumption intensity benchmarks [ 15]. For the building type, the EUI Score is computed as: z i k = EUI i − μ CBECS , elec k σ CBECS , elec k EUI Score i = Φ ( − z i k ) ୍ଠ 100 Φ μ CBECS , elec k k σ CBECS , elec k Φ ( − z ) are the standard normal cumulative distribution function, the CBECS 2018 Table C14 median electricity EUI for building type (, see Section 4.4.1), and the estimated standard deviation. The negative sign ensures that lower EUI (more efficient) yields higher scores. presents the electricity consumption intensity reference thresholds directly from CBECS 2018 Table C14. CBECS 2018 is used (rather than 2012) because the evaluation dataset (BDG-2) contains buildings with 2016–2017 m data, making CBECS 2018 the temporally appropriate reference population. CBECS 2018 Table C14 (“Electricity consumption and expenditure intensities”) provides comprehensive quartile distributions of electricity intensity by principal building activity, including median, 25th percentile, and 75th percentile values in kWh/sqft. Unlike aggregated consumption tables, C14 directly publishes the intensity distributions needed for benchmarking, eliminating the need for estimation. These values are used directly: Median (kWh/sqft): Threshold for Score = 50; Standard deviation: Estimated from interquartile range using σ ^ = IQR / 1.35 = ( P 75 − P 25 ) / 1.35 . The IQR-based std estimation is robust and assumes an approximately normal distribution within the central 50% of the data—appropriate for population-level EUI distributions that are typically right-skewed but well-behaved in the interquartile range. Building type coverage: 583 of 611 BDG-2 buildings map to C14 categories. A total of 28 buildings (Parking, Technology, Utility, Other without specific C14 categories) are excluded from EUI scoring but retain Pattern Score evaluation. 4.4.2. Pattern Score Computation Within-type z-score normalization of CVRMSE: z i k = CVRMSE i − μ CVRMSE k σ CVRMSE k Pattern Score is mapped from z-score to percentile via the standard normal CDF, with sign inversion so that lower CVRMSE (more consistent pattern) yields higher Pattern Score: Pattern Score i = Φ ( − z i k ) ୍ଠ 100 The z-score normalization removes between-type CVRMSE differences driven by inherent building type characteristics: parking garages have structurally low CVRMSE (simple, predictable 24/7 or daytime-only patterns), while public assembly buildings have structurally high CVRMSE (event-driven, highly irregular consumption). The Pattern Score reflects within-type relative pattern consistency—directly analogous to ENERGY STAR’s type-conditional percentile scoring. 4.4.3. Four-Quadrant Classification Buildings are classified using Score = 50 as the threshold on both axes, with inclusive inequality (≥50): Pattern Score ≥ 50 (temporally consistent) Pattern Score 5 percentage points are classified as ATYPICAL (genuine pattern deviation beyond what inherent variability would predict); buildings with Excess CVRMSE ≤ 5 pp are classified as CV_DRIVEN (elevated CVRMSE primarily explained by inherent load variability). The 5 pp threshold is selected based on convergent evidence: the IQR outlier fence (Q3 + 1.5 × IQR = 11.6 pp) places the boundary well above the selected threshold; Cohen’s d for |NMBE| separation reaches the “large” effect size (d = 0.88) at 5 pp; and 5 pp is the highest threshold maintaining n ≥ 10 in all Level 3 subcategories. This threshold is sample-specific and requires recalibration on other datasets. A comprehensive sensitivity analysis varying the Excess CVRMSE threshold from 3 to 10 pp confirms that Cohen’s d exceeds the “large” effect size (d = 0.8) for all thresholds at or above 4 pp, with d = 0.88 at the selected 5 pp threshold ( Appendix D, Figure A1). Buildings with Pattern Score ≥ 50 (Quadrants A and C) are classified as NORMAL—no CVRMSE decomposition is needed as their pattern consistency is already established. 4.6. Level 3: NMBE Directional Analysis For ATYPICAL buildings, the model’s systematic prediction error direction (NMBE) provides actionable guidance on the nature of the deviation: OVER-CONSUMING (NMBE > +2%): The model systematically under-predicts actual consumption. The building consumes more than population-representative expectations for its temporal context. This may indicate after-hours equipment operation, HVAC scheduling failures, plug load proliferation, or other patterns where operational adjustments could reduce consumption toward model expectations. UNDER-CONSUMING (NMBE 20%, three distinct (non-mutually exclusive) causal mechanisms contribute: high inherent variability (87.6%), small mean load inflating the normalized metric (47.6%), and genuine pattern deviation (33.5%). The detailed causal decomposition is provided in Appendix B. 5.2. EUI and CVRMSE Are Empirically Independent The Pearson correlation between raw EUI and raw CVRMSE across all 611 buildings is r = −0.029 ( p = 0.48); on the 583 CBECS-mapped subset, r = −0.082 ( p = 0.047, R 2 = 0.007). Although the latter is marginally significant, EUI explains less than 1% of CVRMSE variance, confirming near-independence. The correlation between EUI Score (C14 median reference) and Pattern Score across 583 CBECS-mapped buildings is r = −0.291 ( p 0.999) and Pattern Score rankings (ρ > 0.985), confirming that the framework’s results are not sensitive to the choice of transformation parameter. 6.5. Practical Deployment Pathway The framework’s minimal data requirement—hourly metered energy consumption + latitude/longitude—aligns naturally with smart meter infrastructure being deployed globally under mandatory benchmarking programs. In jurisdictions with building disclosure requirements (New York Local Law 97, California AB 802, EU Energy Performance of Buildings Directive), the required data is already collected annually or continuously. A practical concern for deployment is interpretability: while the framework can flag a building as ATYPICAL, facility managers need guidance on the nature of the anomaly. The hourly prediction residuals provide a first level of temporal localization (), revealing patterns such as concentrated evening or weekend over-consumption, systematic weekday business-hour savings, or phase mismatches between actual and population-typical scheduling. While this temporal localization identifies when anomalies occur, it does not explain the underlying mechanistic causes, which would require model-internal analysis such as SHAP-based feature attribution or attention map analysis. Framework Recalibration Protocol To facilitate transfer to new building portfolios, the CV–CVRMSE regression must be re-fitted on the target portfolio, as the slope (α) and intercept (β) are population-dependent, and the Excess CVRMSE threshold re-derived using IQR-based outlier fencing and Cohen’s d analysis. A minimum of 30 buildings per type is recommended for stable within-type normalization; rare types should be aggregated or evaluated against the all-type pooled distribution. When a new foundation model replaces the forecasting backbone, the entire calibration sequence should be repeated. A Leave-One-Site-Out cross-validation across the four BDG-2 sites confirms this stability (α range: 0.47–0.61; overall agreement: 91.4%, Cohen’s κ = 0.837; Appendix E). 6.6. Limitations and Boundary Conditions Several limitations relate to the framework’s input data scope. The EUI Score computation uses a CBECS population-referenced z-score approach that does not adjust for operating hours, worker count, or plug load intensity within building types, unlike ENERGY STAR’s regression-based adjustment. This may introduce noise in the EUI Score but does not affect the CVRMSE-based Pattern Score, and the core EUI–CVRMSE independence finding is robust across alternative EUI scoring methods ( Appendix A). Buildings with mean hourly load below 5 kWh/h (23 of 611 in BDG-2) are excluded from ATYPICAL and NMBE analysis due to denominator inflation risk in normalized metrics. The framework currently uses latitude and longitude as static proxies for climate zone; while the model has internalized 16 ASHRAE climate zones through training, actual hourly dry-bulb temperature and humidity data would provide finer-grained meteorological context and help distinguish operational anomalies from weather-driven variability. Building envelope characteristics and cooling set-point strategies also influence temporal consumption patterns in ways not captured by meter data alone [ 27]. Furthermore, the framework is limited to electricity because the BuildingsBench model was trained exclusively on electricity load profiles, and hourly gas and thermal metering data are rarely available at scale. Extension to multi-fuel diagnostics awaits foundation models trained on diverse energy carriers. The CV–CVRMSE decomposition regression (CVRMSE = ( Section 4.5) and the 5 pp Excess CVRMSE threshold are fitted on BDG-2’s 611 buildings across four North American sites; these parameters will differ for other building populations, climate contexts, or meter types, and require recalibration on new datasets (see Section Framework Recalibration Protocol). Validation on independent datasets—ASHRAE GEPIII full dataset, urban energy disclosure datasets, European building portfolios—is necessary to confirm generalizability. All metrics in this study derive from a single model, TransformerWithGaussian-L; while the framework’s diagnostic logic is model-agnostic, concordance across alternative foundation models remains to be established. The model’s training corpus (Buildings-900K) reflects CBECS 2012-era building stock; modern technologies such as heat pump electrification, rooftop solar, and post-pandemic work patterns may cause technologically progressive buildings to be misclassified as ATYPICAL, making periodic retraining on updated CBECS data essential. The 168 h (7-day) context window, a design parameter of the TransformerWithGaussian-L architecture, is sufficient for capturing weekly operational cycles but may not fully represent broader seasonal transitions or multi-week operational patterns. During shoulder seasons where heating and cooling mode transitions occur, the limited temporal memory may generate elevated CVRMSE not attributable to operational anomalies; the framework’s annual aggregation of CVRMSE across all prediction window

www.mdpi.com

Zum Originalartikel