3.1 Comparing to Measurements: Timezone and Time Period alignment

In this example, our measurement data is:

5 minute averages
timestamps labelled with the period start
unknown timezone

which we want to compare to 30 minute Solcast data.

(N.B. if you do not know whether your measurement data is labelled with the start or end of the period, the best approach is to try assuming either way and seeing which gives you the best result)

In [1]:

Copied!





import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

from solcast import historic
from solcast.unmetered_locations import UNMETERED_LOCATIONS
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

from solcast import historic
from solcast.unmetered_locations import UNMETERED_LOCATIONS

Loading Data¶

In [2]:

Copied!





site = UNMETERED_LOCATIONS["Goodwin Creek"]

# NOTE: this will consume an API request!
resp = historic.radiation_and_weather(
    latitude=site["latitude"], 
    longitude=site["longitude"],
    # we get a buffer for this example, because we don't
    # know the timezone of our measurement data
    start="2021-05-14 00:00", 
    end="2021-05-18 00:00",
    output_parameters=["ghi"],
    period="PT30M",
)
solcast = resp.to_pandas()
# convert to tz-naive since we will be comparing 
# to measurements of unknown time zone.
solcast.index = solcast.index.tz_convert(None)  

solcast.head()
site = UNMETERED_LOCATIONS["Goodwin Creek"]

# NOTE: this will consume an API request!
resp = historic.radiation_and_weather(
    latitude=site["latitude"], 
    longitude=site["longitude"],
    # we get a buffer for this example, because we don't
    # know the timezone of our measurement data
    start="2021-05-14 00:00", 
    end="2021-05-18 00:00",
    output_parameters=["ghi"],
    period="PT30M",
)
solcast = resp.to_pandas()
# convert to tz-naive since we will be comparing 
# to measurements of unknown time zone.
solcast.index = solcast.index.tz_convert(None)  

solcast.head()

Out[2]:

	ghi
period_end
2021-05-14 00:30:00	63
2021-05-14 01:00:00	8
2021-05-14 01:30:00	0
2021-05-14 02:00:00	0
2021-05-14 02:30:00	0

In [3]:

Copied!





# load raw measurements
meas_raw = pd.read_csv("./data/3.1_sample_measurements.csv", index_col=0, parse_dates=True)

# clean up any clearly wrong values 
meas_raw.loc[meas_raw["ghi"]<0, "ghi"] = 0

display(meas_raw.head())
meas_raw.plot()
# load raw measurements
meas_raw = pd.read_csv("./data/3.1_sample_measurements.csv", index_col=0, parse_dates=True)

# clean up any clearly wrong values 
meas_raw.loc[meas_raw["ghi"]<0, "ghi"] = 0

display(meas_raw.head())
meas_raw.plot()

	ghi
period_start
2021-05-15 02:00:00	0.0
2021-05-15 02:05:00	0.0
2021-05-15 02:10:00	0.0
2021-05-15 02:15:00	0.0
2021-05-15 02:20:00	0.0

Out[3]:

<Axes: xlabel='period_start'>

No description has been provided for this image

Resampling Measurement Data¶

Here we resample measurement data to 30 minute averages, labelled with period_end

In [4]:

Copied!





# In this example we are targeting 30 minute periods
target_minutes = 30
# we have 5 minute measurement data, so we expect
# 6 measurements per 30 minute period, allowing one to be missing:
minimum_measurements_per_period = 5
# if the raw_measurements are averaged by:
#   - period start, then closed="left"
#   - period end, then closed="right"
#   - instantaneous, then closed="right" 
#     (but you should ensure you have at least one-minute data!)
# for conversion to period_end
closed = "left"              

# label="right" to make the timestamp timestamps labelled with the period end
# (label="left" would make the timestamps labelled with the period start)
groupby = meas_raw.resample(f"{target_minutes}T", label="right", closed=closed)
meas_resampled = groupby.mean()
measurements_count = groupby.count()
meas_resampled = meas_resampled.where(measurements_count >= minimum_measurements_per_period)
meas_resampled.index.name = "period_end"
# In this example we are targeting 30 minute periods
target_minutes = 30
# we have 5 minute measurement data, so we expect
# 6 measurements per 30 minute period, allowing one to be missing:
minimum_measurements_per_period = 5
# if the raw_measurements are averaged by:
#   - period start, then closed="left"
#   - period end, then closed="right"
#   - instantaneous, then closed="right" 
#     (but you should ensure you have at least one-minute data!)
# for conversion to period_end
closed = "left"              

# label="right" to make the timestamp timestamps labelled with the period end
# (label="left" would make the timestamps labelled with the period start)
groupby = meas_raw.resample(f"{target_minutes}T", label="right", closed=closed)
meas_resampled = groupby.mean()
measurements_count = groupby.count()
meas_resampled = meas_resampled.where(measurements_count >= minimum_measurements_per_period)
meas_resampled.index.name = "period_end"

Plotting up the results

some times have been dropped to due incomplete measurements
the PT15M timeseries is shifted to the right - this is expected, because multiple PT5M timestamps go into a single resampled timestamp, e.g.

resampled PT15M period_end | raw measurements PT5M period_starts 
12:45                      | [12:30, 12:35, 12:40]
13:00                      | [12:45, 12:50, 12:55]
...

In [5]:

Copied!





# plot up results
meas_raw["ghi"].plot(label="PT5M, period_start")
meas_resampled["ghi"].plot(label="PT30M, period_end")
plt.xlabel("timestamp")
plt.ylabel("GHI [W/m^2]")
plt.title("Resampling Timeseries")
plt.legend()
plt.show()
# plot up results
meas_raw["ghi"].plot(label="PT5M, period_start")
meas_resampled["ghi"].plot(label="PT30M, period_end")
plt.xlabel("timestamp")
plt.ylabel("GHI [W/m^2]")
plt.title("Resampling Timeseries")
plt.legend()
plt.show()

Timezone Shifting Solcast Data¶

Here we detect the timezone of the measurements, and shift the Solcast Data to match this:

In [6]:

Copied!





def detect_timezone_shift(ghi_to_be_shifted: pd.Series, target_ghi: pd.Series) -> float:
    """Computes the necessary timezone shift to best match the supplied ghi timeseries.
    
    Here we shift the timeseries by 15 min each time,
    and pick the shift that gives the lowest error.
    This works quite well because of the day/night signature
    in the timeseries, so even a moderate amount of corrupt
    data will generally not break this check.

    However, there are certaintly pathological corrupt measurements
    that would cause this timezone shift detection to fail - use with caution!
    
    """
    src = ghi_to_be_shifted.copy()
    errors = {}
    for shift in np.arange(-13, 13.25, 0.25):
        shifted = src.copy()
        shifted.index = shifted.index + pd.Timedelta(hours=shift)
        errors[shift] = (shifted - target_ghi).abs().mean()
    errors = pd.Series(errors)
    return errors.idxmin()
def detect_timezone_shift(ghi_to_be_shifted: pd.Series, target_ghi: pd.Series) -> float:
    """Computes the necessary timezone shift to best match the supplied ghi timeseries.
    
    Here we shift the timeseries by 15 min each time,
    and pick the shift that gives the lowest error.
    This works quite well because of the day/night signature
    in the timeseries, so even a moderate amount of corrupt
    data will generally not break this check.

    However, there are certaintly pathological corrupt measurements
    that would cause this timezone shift detection to fail - use with caution!
    
    """
    src = ghi_to_be_shifted.copy()
    errors = {}
    for shift in np.arange(-13, 13.25, 0.25):
        shifted = src.copy()
        shifted.index = shifted.index + pd.Timedelta(hours=shift)
        errors[shift] = (shifted - target_ghi).abs().mean()
    errors = pd.Series(errors)
    return errors.idxmin()

In [7]:

Copied!





tz = detect_timezone_shift(solcast["ghi"], meas_resampled["ghi"])
print(f"Detected timezone shift: {tz} hours")

# create a combined Solcast + Measurements DataFrame
df = solcast.copy()
df.index = df.index + pd.Timedelta(hours=tz)
df = df.add_suffix("_solcast").join(meas_resampled.add_suffix("_meas"))
df = df.loc[meas_resampled.index]

ax = df.plot(figsize=(9,4), title="Timezone shifted Irradiance")
# plot 
(
    solcast["ghi"]
    .loc[slice(
        df.index.min() - pd.Timedelta(hours=tz),
        df.index.max() - pd.Timedelta(hours=tz)
    )]
    .plot(ax=ax, linestyle="dashed", color="grey", label="ghi_solcast[UTC]")
)
plt.legend(loc="lower right")
display(df.head())
tz = detect_timezone_shift(solcast["ghi"], meas_resampled["ghi"])
print(f"Detected timezone shift: {tz} hours")

# create a combined Solcast + Measurements DataFrame
df = solcast.copy()
df.index = df.index + pd.Timedelta(hours=tz)
df = df.add_suffix("_solcast").join(meas_resampled.add_suffix("_meas"))
df = df.loc[meas_resampled.index]

ax = df.plot(figsize=(9,4), title="Timezone shifted Irradiance")
# plot 
(
    solcast["ghi"]
    .loc[slice(
        df.index.min() - pd.Timedelta(hours=tz),
        df.index.max() - pd.Timedelta(hours=tz)
    )]
    .plot(ax=ax, linestyle="dashed", color="grey", label="ghi_solcast[UTC]")
)
plt.legend(loc="lower right")
display(df.head())

Detected timezone shift: -5.0 hours

	ghi_solcast	ghi_meas
period_end
2021-05-15 02:30:00	0	0.0
2021-05-15 03:00:00	0	0.0
2021-05-15 03:30:00	0	0.0
2021-05-15 04:00:00	0	0.0
2021-05-15 04:30:00	0	0.0

In [ ]: