data_processing
frequenz.lib.notebooks.solar.maintenance.data_processing ¤
Handles all data processing and transformation tasks for the solar maintenance project.
The module contains functions to preprocess solar power production data, calculate statistical metrics, segment and analyse the data, and transform weather features.
Functions¤
frequenz.lib.notebooks.solar.maintenance.data_processing._log_outliers ¤
_log_outliers(
data: DataFrame,
outlier_mask: tuple[DataFrame, ...],
bounds: tuple[float, float],
) -> None
Log useful information about the detected outliers.
PARAMETER | DESCRIPTION |
---|---|
data
|
DataFrame with a datetime index.
TYPE:
|
outlier_mask
|
A tuple of DataFrames with boolean values indicating any detected outliers.
TYPE:
|
bounds
|
A tuple of lower and upper bound values to replace outliers with. |
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
frequenz.lib.notebooks.solar.maintenance.data_processing.calculate_stats ¤
calculate_stats(
df: DataFrame, exclude_zeros: bool = False
) -> DataFrame
Calculate statistical metrics for a given DataFrame and resampling rule.
PARAMETER | DESCRIPTION |
---|---|
df
|
DataFrame with solar power production data and a datetime index.
TYPE:
|
exclude_zeros
|
A boolean flag to exclude zero values from the calculation.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame
|
A new DataFrame with the calculated statistics or a DataFrame with NaN |
DataFrame
|
values if the input data or the data after excluding zeros is empty. |
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
frequenz.lib.notebooks.solar.maintenance.data_processing.is_more_frequent ¤
Compare two frequency strings to determine if the first is more frequent.
PARAMETER | DESCRIPTION |
---|---|
freq1
|
Frequency string (e.g., '15min')
TYPE:
|
freq2
|
Frequency string (e.g., '10min')
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
True if freq1 is more frequent than freq2, False otherwise. |
assert is_more_frequent("15min", "10min") == False assert is_more_frequent("30min", "1h") == True assert is_more_frequent("1D", "24h") == False assert is_more_frequent("1h", "1h") == False assert is_more_frequent("1h", "60min") == False
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
frequenz.lib.notebooks.solar.maintenance.data_processing.outlier_detection_iqr ¤
Detect outliers in the input data based on the interquartile range.
Method: Interquartile range (IQR) method to detect any data points outside the range defined by (Q1 - threshold * IQR, Q3 + threshold * IQR) as outliers.
PARAMETER | DESCRIPTION |
---|---|
data
|
DataFrame with a datetime index.
TYPE:
|
threshold
|
The threshold value to detect outliers.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame
|
A tuple of two DataFrames with boolean values indicating outliers. The |
DataFrame
|
first DataFrame contains the lower outliers, and the second DataFrame |
tuple[DataFrame, DataFrame]
|
contains the upper outliers. |
References
https://en.wikipedia.org/wiki/Interquartile_range https://en.wikipedia.org/wiki/Robust_measures_of_scale
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
frequenz.lib.notebooks.solar.maintenance.data_processing.outlier_detection_min_max ¤
outlier_detection_min_max(
data: DataFrame,
min_value: float = -inf,
max_value: float = inf,
) -> tuple[DataFrame, DataFrame]
Detect outliers in the input data based on the min-max threshold values.
Method: Min-max method to detect any data points outside the unbounded range (min_value, max_value) as outliers.
PARAMETER | DESCRIPTION |
---|---|
data
|
DataFrame with a datetime index.
TYPE:
|
min_value
|
The minimum threshold value to detect outliers.
TYPE:
|
max_value
|
The maximum threshold value to detect outliers.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame
|
A tuple of two DataFrames with boolean values indicating outliers. The |
DataFrame
|
first DataFrame contains the lower outliers, and the second DataFrame |
tuple[DataFrame, DataFrame]
|
contains the upper outliers. |
References
https://en.wikipedia.org/wiki/Min-max_scaling
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
frequenz.lib.notebooks.solar.maintenance.data_processing.outlier_detection_modified_z_score ¤
outlier_detection_modified_z_score(
data: DataFrame,
threshold: float = 3.5,
verbose: bool = False,
) -> tuple[DataFrame]
Detect outliers in the input data based on the modified z-score.
Method: Modified z-score method to detect any data points with absolute modified z-scores greater than the given threshold as outliers.
PARAMETER | DESCRIPTION |
---|---|
data
|
DataFrame with a datetime index.
TYPE:
|
threshold
|
The threshold value to detect outliers.
TYPE:
|
verbose
|
A boolean flag to print additional information.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[DataFrame]
|
A tuple with a DataFrame containing boolean values indicating outliers. |
References
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
frequenz.lib.notebooks.solar.maintenance.data_processing.outlier_detection_z_score ¤
Detect outliers in the input data based on the z-score.
Method: Z-score method to detect any data points with absolute z-scores greater than the given threshold as outliers.
PARAMETER | DESCRIPTION |
---|---|
data
|
DataFrame with a datetime index.
TYPE:
|
threshold
|
The threshold value to detect outliers.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[DataFrame]
|
A tuple with a DataFrame containing boolean values indicating outliers. |
References
https://en.wikipedia.org/wiki/Standard_score
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
frequenz.lib.notebooks.solar.maintenance.data_processing.outlier_removal ¤
outlier_removal(
data: DataFrame,
columns: list[str],
bounds: tuple[float, float],
method: str = "min_max",
verbose: bool = False,
**kwargs: float
) -> DataFrame
Replace outliers in the input data based on the given values.
PARAMETER | DESCRIPTION |
---|---|
data
|
DataFrame with a datetime index.
TYPE:
|
columns
|
List of column names to consider for outlier detection. |
bounds
|
A tuple of lower and upper bound values, index 0 and 1 respectively, to replace outliers with. Both or one of lower bound or upper bound must be provided depeding on the outlier detection method used. |
method
|
The outlier detection method to use.
TYPE:
|
verbose
|
A boolean flag to print additional information.
TYPE:
|
**kwargs
|
Additional keyword arguments for the outlier detection method.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame
|
A DataFrame with outliers replaced by the given values. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
|
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
frequenz.lib.notebooks.solar.maintenance.data_processing.preprocess_data ¤
preprocess_data(
df: DataFrame,
*,
ts_col: str = "ts",
power_cols: list[str] | tuple[str, ...] = ("p",),
power_unit: str = "kW",
energy_units: list[str] | tuple[str, ...] = (
"kWh",
"MWh",
),
name_suffixes: list[str] | tuple[str, ...] = (
"midDefault",
),
datetime_format: str | None = None,
in_place: bool = False
) -> DataFrame
Preprocess by converting power to the required unit and calculating energy consumed.
Details: The function converts the power column to the required unit and calculates the energy consumed based on the power and time difference between consecutive timestamps.
PARAMETER | DESCRIPTION |
---|---|
df
|
Input DataFrame.
TYPE:
|
ts_col
|
The name of the timestamp column.
TYPE:
|
power_cols
|
Power column names. |
power_unit
|
The unit to convert power into ('kW', 'MW', etc.).
TYPE:
|
energy_units
|
Units to calculate energy ('kWh', 'MWh', etc.). |
name_suffixes
|
Suffixes to add to the power and the corresponding energy column names. The strings should be unique and descriptive (e.g. midX to reflect the microgrid ID X) and should match the length of power_cols. |
datetime_format
|
Optional datetime format if auto-parsing fails.
TYPE:
|
in_place
|
Modify the DataFrame in-place or return a new DataFrame.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame
|
The transformed DataFrame. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
|
KeyError
|
If required columns are missing in the input data. |
Example
df = pd.read_csv('data.csv', parse_dates=['timestamp']) processed_df = preprocess_data(df, ts_col='timestamp', power_cols=['power']) print(processed_df.head())
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
|
frequenz.lib.notebooks.solar.maintenance.data_processing.segment_and_align ¤
segment_and_align(
data: DataFrame,
grouping_freq: str | list[Any] | None = None,
resamp_freq: str | None = None,
verbose: bool = False,
) -> dict[Any, DataFrame]
Segment the input data into periods based on the given frequency.
Notes
- If the resampling frequency is higher than the inferred frequency, the data is resampled to the higher frequency. Otherwise, the data is used as is.
- Linear interpolation is used to fill missing values when upsampling.
- The function does not downsample the data but rather segments it based on the grouping frequency.
PARAMETER | DESCRIPTION |
---|---|
data
|
DataFrame with a datetime index.
TYPE:
|
grouping_freq
|
Frequency string (e.g., '15min', '1h', '1D') to group the data. If a list is provided, the grouping is done based on the list elements. The elements can be either strings that correspond to column labels, or values of any one type (e.g. datetime.time). For the latter case, the length of the list must be equal to that of the selected axis (index by default) and the values are used as-is to determine the groups. For more details see the pandas.groupby documentation. |
resamp_freq
|
Frequency string (e.g., '15min', '1h', '1D') to resample the data. If None, the frequency is inferred from the data and used.
TYPE:
|
verbose
|
A boolean flag to print additional information.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
dict[Any, DataFrame]
|
A dictionary with the segmented data for each period. |
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
frequenz.lib.notebooks.solar.maintenance.data_processing.segment_and_analyse ¤
segment_and_analyse(
data: DataFrame,
*,
grouping_freq_list: str | list[Any],
resamp_freq_list: list[str | None],
group_labels: list[str],
exclude_zeros: list[bool],
verbose: bool = False
) -> dict[str, DataFrame]
Process data by segmenting and calculating statistics for given frequencies.
PARAMETER | DESCRIPTION |
---|---|
data
|
DataFrame with a datetime index.
TYPE:
|
grouping_freq_list
|
A list of elements that define how to group the data.
The list elements that can be one of the following:
- strings that correspond to a frequency string (e.g., '15min',
'1h', '1D')
- list of strings that correspond to column labels
- list of values of any one type (e.g. datetime.time).
For more details see |
resamp_freq_list
|
List of frequency strings (e.g., '15min', '1h', '1D') to resample the data. A list element can be also None, in which case the frequency is inferred from the data. Note that only up-sampling is supported. See segment_and_align function for more details. |
group_labels
|
List of labels to use in the output dictionary for the segment statistics for each grouping frequency. |
exclude_zeros
|
List of boolean flags to exclude zero values from the calculation for each grouping frequency. |
verbose
|
A boolean flag to print additional information.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
dict[str, DataFrame]
|
A dictionary with the segment statistics for each grouping frequency. |
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
frequenz.lib.notebooks.solar.maintenance.data_processing.transform_weather_features ¤
transform_weather_features(
data: DataFrame,
column_label_mapping: dict[str, str],
time_zone: ZoneInfo = ZoneInfo("UTC"),
verbose: bool = False,
) -> tuple[DataFrame, bool]
Transform weather data by mapping features to new columns.
Features are mapped to new columns with values from 'value' column. Creates a new column to show the time difference between 'validity_ts' and 'creation_ts'. Expects the time columns to be in UTC and converts them to the provided timezone.
PARAMETER | DESCRIPTION |
---|---|
data
|
DataFrame with weather data.
TYPE:
|
column_label_mapping
|
Dictionary that maps 'feature' entries to new column names with row entries obtained from the corresponding row of the 'value' column. |
time_zone
|
The timezone to convert the time columns to. Should be a valid zoneinfo.ZoneInfo object. |
verbose
|
A boolean flag to print additional information.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame
|
A tuple of the transformed DataFrame and a boolean flag indicating if |
bool
|
any missing or invalid date entries were found in 'validity_ts'. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If required columns are missing in the input data. |
Source code in frequenz/lib/notebooks/solar/maintenance/data_processing.py
377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 |
|