# Performance Evaluation

Now that we know how to generate counterfactual explanations in Julia, you may have a few follow-up questions: How do I know if the counterfactual search has been successful? How good is my counterfactual explanation? What does โgoodโ even mean in this context? In this tutorial, we will see how counterfactual explanations can be evaluated with respect to their performance.

## Default Measures

Numerous evaluation measures for counterfactual explanations have been proposed. In what follows, we will cover some of the most important measures.

### Single Measure, Single Counterfactual

One of the most important measures is validity, which simply determines whether or not a counterfactual explanation $x^{\prime}$ is valid in the sense that it yields the target prediction: $M(x^{\prime})=t$. We can evaluate the validity of a single counterfactual explanation ce using the Evaluation.evaluate function as follows:

using CounterfactualExplanations.Evaluation: evaluate, validity
evaluate(ce; measure=validity)
1-element Vector{Vector{Float64}}:
[1.0]

For a single counterfactual explanation, this evaluation measure can only take two values: it is either equal to 1, if the explanation is valid or 0 otherwise. Another important measure is distance, which relates to the distance between the factual $x$ and the counterfactual $x^{\prime}$. In the context of Algorithmic Recourse, higher distances are typically associated with higher costs to individuals seeking recourse.

using CounterfactualExplanations.Objectives: distance
evaluate(ce; measure=distance)
1-element Vector{Vector{Float32}}:
[3.2273161]

By default, distance computes the L2 (Euclidean) distance.

### Multiple Measures, Single Counterfactual

You might be interested in computing not just the L2 distance, but various LP norms. This can be done by supplying a vector of functions to the measure key argument. For convenience, all default distance measures have already been collected in a vector:

using CounterfactualExplanations.Evaluation: distance_measures
distance_measures
4-element Vector{Function}:
distance_l0 (generic function with 1 method)
distance_l1 (generic function with 1 method)
distance_l2 (generic function with 1 method)
distance_linf (generic function with 1 method)

We can use this vector of evaluation measures as follows:

evaluate(ce; measure=distance_measures)
4-element Vector{Vector{Float32}}:
[2.0]
[3.2273161]
[2.7737978]
[2.7285953]

If no measure is specified, the evaluate method will return all default measures,

evaluate(ce)
3-element Vector{Vector}:
[1.0]
Float32[3.2273161]
[0.0]

which include:

CounterfactualExplanations.Evaluation.default_measures
3-element Vector{Function}:
validity (generic function with 1 method)
distance (generic function with 1 method)
redundancy (generic function with 1 method)

### Multiple Measures and Counterfactuals

We can also evaluate multiple counterfactual explanations at once:

generator = DiCEGenerator()
ces = generate_counterfactual(x, target, counterfactual_data, M, generator; num_counterfactuals=5)
evaluate(ces)
3-element Vector{Vector}:
[1.0]
Float32[3.1955845]
[[0.0, 0.0, 0.0, 0.0, 0.0]]

By default, each evaluation measure is aggregated across all counterfactual explanations. To return individual measures for each counterfactual explanation you can specify report_each=true

evaluate(ces; report_each=true)
3-element Vector{Vector}:
BitVector[[1, 1, 1, 1, 1]]
Vector{Float32}[[3.3671722, 3.1028512, 3.2829392, 3.0728922, 3.1520686]]
[[0.0, 0.0, 0.0, 0.0, 0.0]]

## Custom Measures

A measure is just a method that takes a CounterfactualExplanation as its only positional argument and agg::Function as a key argument specifying how measures should be aggregated across counterfactuals. Defining custom measures is therefore straightforward. For example, we could define a measure to compute the inverse target probability as follows:

my_measure(ce::CounterfactualExplanation; agg=mean) = agg(1 .- CounterfactualExplanations.target_probs(ce))
evaluate(ce; measure=my_measure)
1-element Vector{Vector{Float32}}:
[0.41711217]

## Tidy Output

By default, evaluate returns vectors of evaluation measures. The optional key argument output_format::Symbol can be used to post-process the output in two ways: firstly, to return the output as a dictionary, specify output_format=:Dict:

evaluate(ces; output_format=:Dict, report_each=true)
Dict{Symbol, Vector} with 3 entries:
:validity   => BitVector[[1, 1, 1, 1, 1]]
:redundancy => [[0.0, 0.0, 0.0, 0.0, 0.0]]
:distance   => Vector{Float32}[[3.36717, 3.10285, 3.28294, 3.07289, 3.15207]]

Secondly, to return the output as a data frame, specify output_format=:DataFrame.

evaluate(ces; output_format=:DataFrame, report_each=true)

By default, data frames are pivoted to long format using individual counterfactuals as the id column. This behaviour can be suppressed by specifying pivot_longer=false.

## Multiple Counterfactual Explanations

It may be necessary to generate counterfactual explanations for multiple individuals.

Below, for example, we first select multiple samples (5) from the non-target class and then generate counterfactual explanations for all of them.

# Factual and target:
ids = rand(findall(predict_label(M, counterfactual_data) .== factual), n_individuals)
xs = select_factual(counterfactual_data, ids)
ces = generate_counterfactual(xs, target, counterfactual_data, M, generator; num_counterfactuals=5)
evaluation = evaluate(ces)
15ร4 DataFrame
Row โ sample  num_counterfactual  variable    value
โ Int64   Int64               String      Any
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ      1                   1  distance    3.35118
2 โ      1                   1  redundancy  [0.0, 0.0, 0.0, 0.0, 0.0]
3 โ      1                   1  validity    1.0
4 โ      2                   1  distance    2.64059
5 โ      2                   1  redundancy  [0.0, 0.0, 0.0, 0.0, 0.0]
6 โ      2                   1  validity    1.0
7 โ      3                   1  distance    2.93501
8 โ      3                   1  redundancy  [0.0, 0.0, 0.0, 0.0, 0.0]
9 โ      3                   1  validity    1.0
10 โ      4                   1  distance    3.53484
11 โ      4                   1  redundancy  [0.0, 0.0, 0.0, 0.0, 0.0]
12 โ      4                   1  validity    1.0
13 โ      5                   1  distance    3.9374
14 โ      5                   1  redundancy  [0.0, 0.0, 0.0, 0.0, 0.0]
15 โ      5                   1  validity    1.0

This leads us to our next topic: Performance Benchmarks.