Human Evaluation

Human Evaluation Standardization

To standardize the conduction of a rigorous human evaluation, we stipulate the criteria for each measurement as follows:

Measurement Criteria

Semantic Consistency (SC)
- Score in Range: [0, 0.5, 1]
- Description: It measures the level that the generated image is coherent in terms of the condition provided (i.e. Prompts, Subject Token, etc.).
Perceptual Quality (PQ)
- Score in Range: [0, 0.5, 1]
- Description: It measures the level at which the generated image is visually convincing and gives off a natural sense.

Meaning of Semantic Consistency (SC) score

SC=0: Image not following one or more of the conditions at all (e.g. not following the prompt at all, different background in editing task, wrong subject in subject-driven task, etc.)
SC=0.5: all the conditions are partly following the requirements.
SC=1: The rater agrees that the overall idea is correct.

Meaning of Perceptual Quality (PQ) score

PQ=0: The rater spotted obvious distortion or artifacts at first glance and those distorts make the objects unrecognizable.
PQ=0.5: The rater found out the image gives off an unnatural sense. Or the rater spotted some minor artifacts and the objects are still recognizable.
PQ=1: The rater agrees that the resulting image looks genuine.

Artifacts and Unusual sense, respectively, are:

Artifacts:
- Distortion
- Watermark
- Scratches
- Blurred faces
- Unusual body parts
- Subjects not harmonized
Unusual Sense:
- Wrong sense of distance (subject too big or too small compared to others)
- Wrong shadow
- Wrong lighting, etc.

Implementation of Human Evaluation

In execute, we require raters to strictly follow this table for rating.

Each image is rated as a list value [SC, PQ].

SC Rating Table

Condition 1	Condition 2 (if applicable)	Condition 3 (if applicable)	SC rating
no following at all	Any	Any	0
Any	no following at all	Any	0
Any	Any	no following at all	0
following some part	following some or most part	following some or most part	0.5
following some or most part	following some part	following some or most part	0.5
following some part or more	following some or most part	following some part	0.5
following most part	following most part	following most part	1

PQ Rating Table

Objects in image	Artifacts	Unusual sense	PQ rating
Unrecognizable	serious	Any	0
Recognizable	some	Any	0.5
Recognizable	Any	some	0.5
Recognizable	none	little or None	1

Collecting Human Evaluation Data

In the results folder, there should be a dataset_lookup.csv file to edit.

uid	TheModel
sample_1.jpg	[0, 1]
sample_2.jpg	[1, 1]
sample_3.jpg	[1, 0.5]
…	…

Task-specific Guidelines

Some predefined conditions:

Condition A: Is the Image generation following the prompt?
Condition B: Is the Image editing following the instruction?
Condition C: Is the Image performing minimal edit without changing the background?
Condition D_i: Is the Object_i in the Image following the token subject?
Condition E: Is the Image following the control guidance?

Text-guided Image Generation (known as Text-to-Image)

Condition A: Is the Image generation following the prompt?

SC Table:

Condition A	SC rating
no following at all	0
following some part	0.5
following most part	1

PQ Table:

Objects in image	Artifacts	Unusual sense	PQ rating
Unrecognizable	serious	Any	0
Recognizable	some	Any	0.5
Recognizable	Any	some	0.5
Recognizable	none	little or None	1

Mask-guided Image Editing

Condition B: Is the Image editing following the instruction?
Condition C: Is the Image performing minimal edit without changing the background?

SC Table:

Condition B	Condition C	SC rating
no following at all	Any	0
Any	background completely changed	0
following some part	with a few overedit or mostly minimal	0.5
following some or most part	with a few overedit	0.5
following most part	mostly minimal	1

PQ Table:

Objects in image	Artifacts	Unusual sense	PQ rating
Unrecognizable	serious	Any	0
Recognizable	some	Any	0.5
Recognizable	Any	some	0.5
Recognizable	none	little or None	1

Text-guided Image Editing

Condition B: Is the Image editing following the instruction?
Condition C: Is the Image performing minimal edit without changing the background?

SC Table:

Condition B	Condition C	SC rating
no following at all	Any	0
Any	background completely changed	0
following some part	with a few overedit or mostly minimal	0.5
following some or most part	with a few overedit	0.5
following most part	mostly minimal	1

PQ Table:

Objects in image	Artifacts	Unusual sense	PQ rating
Unrecognizable	serious	Any	0
Recognizable	some	Any	0.5
Recognizable	Any	some	0.5
Recognizable	none	little or None	1

Subject-driven Image Generation

Condition A: Is the Image generation following the prompt?
Condition D: Is the Object in the Image following the token subject?

SC Table:

Condition A	Condition D	SC rating
no following at all	Any	0
Any	no following at all	0
following some part	following some or most part	0.5
following some or most part	following some part	0.5
following most part	following most part	1

PQ Table:

Objects in image	Artifacts	Unusual sense	PQ rating
Unrecognizable	serious	Any	0
Recognizable	some	Any	0.5
Recognizable	Any	some	0.5
Recognizable	none	little or None	1

Subject-driven Image Editing

Condition C: Is the Image performing minimal edit without changing the background?
Condition D: Is the Object in the Image following the token subject?

SC Table:

Condition C	Condition D	SC rating
no following at all	Any	0
Any	no following at all	0
following some part	following some or most part	0.5
following some or most part	following some part	0.5
following most part	following most part	1

PQ Table:

Objects in image	Artifacts	Unusual sense	PQ rating
Unrecognizable	serious	Any	0
Recognizable	some	Any	0.5
Recognizable	Any	some	0.5
Recognizable	none	little or None	1

Multi-concept Image Composition

Condition A: Is the Image generation following the prompt?
Condition D_i: Is the Object_i in the Image following the token subject?

SC Table:

Condition A	Condition D_1	Condition D_2	SC rating
no following at all	Any	Any	0
Any	no following at all	Any	0
Any	Any	no following at all	0
following some part	following some or most part	following some or most part	0.5
following some or most part	following some part	following some or most part	0.5
following some part or more	following some or most part	following some part	0.5
following most part	following most part	following most part	1

PQ Table:

Objects in image	Artifacts	Unusual sense	PQ rating
Unrecognizable	serious	Any	0
Recognizable	some	Any	0.5
Recognizable	Any	some	0.5
Recognizable	none	little or None	1

Control-guided Image Generation

Condition A: Is the Image generation following the prompt?
Condition E: Is the Image following the control guidance?

SC Table:

Condition A	Condition E	SC rating
no following at all	Any	0
Any	no following at all	0
following some part	following some or most part	0.5
following some or most part	following some part	0.5
following most part	following most part	1

PQ Table:

Objects in image	Artifacts	Unusual sense	PQ rating
Unrecognizable	serious	Any	0
Recognizable	some	Any	0.5
Recognizable	Any	some	0.5
Recognizable	none	little or None	1

Statistical Tools for Human Evaluation Data

(Under Construction)