Tutorial ~20 min read

Evaluation Methods

OpenMOA provides a comprehensive evaluation framework for online / stream learning. It supports multiple evaluation paradigms, task types, and metric sets — all built on MOA's high-performance Java backend with a clean Python API.

1. Overview 2. prequential_evaluation 3. Multiple Learners 4. Semi-Supervised 5. Anomaly Detection 6. PrequentialResults 7. Metrics by Task 8. Evaluator Classes 9. Specialized Evaluations 10. Fast Evaluation Mode 11. How It Works

1. Evaluation in OpenMOA

OpenMOA's evaluation framework follows the prequential (test-then-train) paradigm, which is the standard approach for evaluating learning algorithms on data streams. In this paradigm, each instance is first used for prediction (testing), then used to update the model (training) — ensuring every instance contributes to both evaluation and learning.

Evaluations produce two complementary views:

Cumulative Metrics

Aggregate performance over all instances seen so far. Returns a single scalar value — the overall performance across the full stream.

Windowed Metrics

Performance within the most recent fixed-size window (default: 1000 instances). Returns a list of values — one per completed window — capturing how performance evolves over time.

2. `prequential_evaluation`

supervised

Standard evaluation for a single supervised learner (classification or regression).

Signature

from openmoa.evaluation import prequential_evaluation

results = prequential_evaluation(
    stream,
    learner,
    max_instances=None,        # None = full stream
    window_size=1000,
    store_predictions=False,
    store_y=False,
    optimise=True,             # Uses fast MOA evaluation loop when compatible
    restart_stream=True,
    progress_bar=False,
    batch_size=1,
)

Key Parameters

Parameter	Type	Default	Description
stream	Stream	required	Data stream to evaluate on
learner	Classifier / Regressor	required	The learner to evaluate
max_instances	int or None	None	Max instances to process; `None` = full stream
window_size	int	1000	Size of evaluation windows
store_predictions	bool	False	Whether to store all predictions
store_y	bool	False	Whether to store all ground truth labels
optimise	bool	True	Use fast MOA native loop when possible
restart_stream	bool	True	Reset stream to beginning before evaluation
batch_size	int	1	Number of instances per training batch

Returns: PrequentialResults object

Example

prequential_evaluation — basic usage

from openmoa.classifier import NaiveBayes
from openmoa.datasets import Electricity
from openmoa.evaluation import prequential_evaluation

stream = Electricity()
learner = NaiveBayes(schema=stream.get_schema())

results = prequential_evaluation(stream=stream, learner=learner, max_instances=10000)

print(results.cumulative.accuracy())   # e.g., 76.34
print(results.windowed.accuracy())     # list of values per window

3. `prequential_evaluation_multiple_learners`

multi-learner

Evaluates multiple learners on the same stream simultaneously, iterating the stream only once for efficiency.

Signature

from openmoa.evaluation import prequential_evaluation_multiple_learners

results_dict = prequential_evaluation_multiple_learners(
    stream,
    learners,           # Dict[str, Learner]
    max_instances=None,
    window_size=1000,
    store_predictions=False,
    store_y=False,
    progress_bar=False,
)

Key Parameters

Parameter	Type	Default	Description
learners	Dict[str, Learner]	required	Dictionary mapping learner names to learner objects

Returns: Dict[str, PrequentialResults] — one result per learner name

Example — comparing two classifiers

from openmoa.classifier import NaiveBayes, HoeffdingTree
from openmoa.evaluation import prequential_evaluation_multiple_learners

learners = {
    "NaiveBayes":    NaiveBayes(schema=stream.get_schema()),
    "HoeffdingTree": HoeffdingTree(schema=stream.get_schema()),
}

results = prequential_evaluation_multiple_learners(stream, learners)

for name, res in results.items():
    print(f"{name}: {res.cumulative.accuracy():.2f}%")

4. `prequential_ssl_evaluation`

semi-supervised

Evaluation for semi-supervised learning, where only a fraction of instances are labeled.

Signature

from openmoa.evaluation import prequential_ssl_evaluation

results = prequential_ssl_evaluation(
    stream,
    learner,
    max_instances=None,
    window_size=1000,
    initial_window_size=0,
    delay_length=0,
    label_probability=0.01,    # 1% of instances are labeled
    random_seed=1,
    store_predictions=False,
    store_y=False,
    optimise=True,
    restart_stream=True,
    progress_bar=False,
    batch_size=1,
)

Key Parameters

Parameter	Type	Default	Description
label_probability	float	0.01	Fraction of instances that provide a label (0.0 – 1.0)
initial_window_size	int	0	Number of fully labeled instances at the start
delay_length	int	0	Delay (in instances) before a label becomes available

Returns: PrequentialResults object

Note: label_probability=0.01 means roughly 1 in 100 instances will carry a label during evaluation. Use initial_window_size to warm-start the model with a small fully-labeled window before entering the semi-supervised regime.

5. `prequential_evaluation_anomaly`

anomaly detection

Evaluation for anomaly detection tasks, using AUC-based metrics.

Signature

from openmoa.evaluation import prequential_evaluation_anomaly

results = prequential_evaluation_anomaly(
    stream,
    learner,
    max_instances=None,
    window_size=1000,
    optimise=True,
    store_predictions=False,
    store_y=False,
    progress_bar=False,
)

Returns: PrequentialResults object with anomaly detection metrics (AUC, sAUC)

6. Reading Evaluation Results

All evaluation functions return a PrequentialResults object. The two primary views are .cumulative (scalar metrics over all instances) and .windowed (list of metrics, one per window).

Attributes & Methods

Attribute / Method	Returns	Description
.cumulative	Evaluator	Cumulative evaluator — metrics as scalars over all instances
.windowed	Evaluator	Windowed evaluator — metrics as lists, one value per window
.wallclock()	float	Total wall clock time in seconds
.cpu_time()	float	Total CPU time in seconds
.max_instances()	int	Total number of instances evaluated
.predictions()	list	All stored predictions (requires `store_predictions=True`)
.ground_truth_y()	list	All stored ground truth labels (requires `store_y=True`)
.metrics_per_window()	DataFrame	Pandas DataFrame of all windowed metrics

Export Results

Write results to CSV

results.write_to_file(path="./output", directory_name="my_experiment")
# Writes cumulative and windowed metrics to CSV files

7. Metrics by Task Type

Classification Metrics

Available on ClassificationEvaluator and ClassificationWindowedEvaluator:

Metric	Method	Description
Accuracy	.accuracy()	Percentage of correct predictions
Kappa Statistic	.kappa()	Cohen's Kappa (chance-corrected accuracy)
Kappa T	.kappa_t()	Temporal Kappa (compared to no-change predictor)
Kappa M	.kappa_m()	Marginal Kappa
F1 Score	.f1_score()	Harmonic mean of precision and recall
Precision	.precision()	Positive predictive value
Recall	.recall()	True positive rate
Per-class F1	.f1_score_N()	F1 score for class N
Per-class Precision	.precision_N()	Precision for class N
Per-class Recall	.recall_N()	Recall for class N
Instances	.instances()	Total number of instances processed

Regression Metrics

Available on RegressionEvaluator and RegressionWindowedEvaluator:

Metric	Method	Description
MAE	.mae()	Mean Absolute Error
RMSE	.rmse()	Root Mean Squared Error
RMAE	.rmae()	Relative Mean Absolute Error
RRMSE	.rrmse()	Relative Root Mean Squared Error
R²	.r2()	Coefficient of Determination
Adjusted R²	.adjusted_r2()	R² adjusted for number of features

Prediction Interval Metrics

Available on PredictionIntervalEvaluator:

Metric	Method	Description
Coverage	.coverage()	% of true values falling within predicted intervals
Average Length	.average_length()	Mean width of prediction intervals
NMPIW	.nmpiw()	Normalized Mean Prediction Interval Width

Anomaly Detection Metrics

Available on AnomalyDetectionEvaluator:

Metric	Method	Description
AUC	.auc()	Area Under the ROC Curve
sAUC	.s_auc()	Sensitivity-adjusted AUC

8. Evaluator Classes

For advanced use cases, evaluator classes can be instantiated and updated manually — giving you full control over the evaluation loop.

Classification Evaluator

ClassificationEvaluator

from openmoa.evaluation import ClassificationEvaluator

evaluator = ClassificationEvaluator(schema=stream.get_schema(), window_size=1000)

for x, y in stream:
    y_pred = learner.predict(x)
    evaluator.update(y_target_index=y, y_pred_index=y_pred)
    learner.train(x, y)

print(evaluator.accuracy())         # Cumulative accuracy
print(evaluator.metrics_dict())     # All metrics as dict

Regression Evaluator

RegressionEvaluator

from openmoa.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(schema=stream.get_schema(), window_size=1000)
evaluator.update(y=actual_value, y_pred=predicted_value)
print(evaluator.mae(), evaluator.rmse())

Prediction Interval Evaluator

PredictionIntervalEvaluator

from openmoa.evaluation import PredictionIntervalEvaluator

evaluator = PredictionIntervalEvaluator(schema=stream.get_schema())
# y_pred must be a 3-tuple: [lower_bound, prediction, upper_bound]
evaluator.update(y=actual, y_pred=[lower, pred, upper])
print(evaluator.coverage(), evaluator.nmpiw())

Clustering Evaluator

ClusteringEvaluator

from openmoa.evaluation import ClusteringEvaluator

evaluator = ClusteringEvaluator(schema=stream.get_schema())
evaluator.update(clusterer=my_clusterer)
macro, micro = evaluator.get_measurements()

9. Specialized Evaluations

Online Continual Learning (OCL) Metrics

Located in openmoa.ocl.evaluation, the OCLMetrics dataclass provides evaluation for multi-task continual learning streams:

Anytime Accuracy — measured during training within each task
Forward Transfer — how learning a task improves future tasks
Backward Transfer — how learning new tasks affects performance on past tasks
Per-task and aggregate accuracy matrices

Drift Detection Evaluation

Located in openmoa.drift.eval_detector, the EvaluateDetector class evaluates drift detector performance against known ground-truth drift points:

Metric	Description
Mean Time to Detect	Average delay between true drift and detection
Missed Detection Ratio	Fraction of true drifts that were missed
Mean Time Between False Alarms	Average gap between false positive detections

EvaluateDetector

from openmoa.drift.eval_detector import EvaluateDetector

evaluator = EvaluateDetector(max_delay=200)
results = evaluator.calc_performance(
    preds=detected_positions,
    trues=true_drift_positions
)

10. Fast Evaluation Mode

When optimise=True (default), OpenMOA automatically detects whether a stream–learner pair is compatible with MOA's native EfficientEvaluationLoops. When compatible, this significantly speeds up evaluation by running the full evaluation loop in Java rather than Python.

Fast mode is automatically disabled when:

A progress bar is requested (progress_bar=True)
A custom MOA evaluator is provided
The stream or learner is Python-native (not a MOA wrapper)

11. How Prequential Evaluation Works

Every instance passes through the same three-step cycle before the next instance arrives:

Stream Instance

→

Predict

→

Evaluate
update metrics

→

Train
update model

→

next instance

↓

Cumulative Metrics
scalar — all instances

Windowed Metrics
list — per window

↓

PrequentialResults

.cumulative .windowed .wallclock() .cpu_time() .predictions() .ground_truth_y()

Previous: Concept Drift Next: Advanced Topics

Evaluation Methods

1. Evaluation in OpenMOA

Cumulative Metrics

Windowed Metrics

2. prequential_evaluation

Key Parameters

Example

3. prequential_evaluation_multiple_learners

Key Parameters

4. prequential_ssl_evaluation

Key Parameters

5. prequential_evaluation_anomaly

6. Reading Evaluation Results

Attributes & Methods

Export Results

7. Metrics by Task Type

Classification Metrics

Regression Metrics

Prediction Interval Metrics

Anomaly Detection Metrics

8. Evaluator Classes

Classification Evaluator

Regression Evaluator

Prediction Interval Evaluator

Clustering Evaluator

9. Specialized Evaluations

Online Continual Learning (OCL) Metrics

Drift Detection Evaluation

10. Fast Evaluation Mode

11. How Prequential Evaluation Works

2. `prequential_evaluation`

3. `prequential_evaluation_multiple_learners`

4. `prequential_ssl_evaluation`

5. `prequential_evaluation_anomaly`