Human Evaluation

Human Evaluation Standardization

To standardize the conduction of a rigorous human evaluation, we stipulate the criteria for each measurement as follows:

Measurement Criteria

  • Semantic Consistency (SC)

    • Score in Range: [0, 0.5, 1]

    • Description: It measures the level that the generated image is coherent in terms of the condition provided (i.e. Prompts, Subject Token, etc.).

  • Perceptual Quality (PQ)

    • Score in Range: [0, 0.5, 1]

    • Description: It measures the level at which the generated image is visually convincing and gives off a natural sense.

Meaning of Semantic Consistency (SC) score

  • SC=0: Image not following one or more of the conditions at all (e.g. not following the prompt at all, different background in editing task, wrong subject in subject-driven task, etc.)

  • SC=0.5: all the conditions are partly following the requirements.

  • SC=1: The rater agrees that the overall idea is correct.

Meaning of Perceptual Quality (PQ) score

  • PQ=0: The rater spotted obvious distortion or artifacts at first glance and those distorts make the objects unrecognizable.

  • PQ=0.5: The rater found out the image gives off an unnatural sense. Or the rater spotted some minor artifacts and the objects are still recognizable.

  • PQ=1: The rater agrees that the resulting image looks genuine.

Artifacts and Unusual sense, respectively, are:

  • Artifacts:

    • Distortion

    • Watermark

    • Scratches

    • Blurred faces

    • Unusual body parts

    • Subjects not harmonized

  • Unusual Sense:

    • Wrong sense of distance (subject too big or too small compared to others)

    • Wrong shadow

    • Wrong lighting, etc.

Implementation of Human Evaluation

  • In execute, we require raters to strictly follow this table for rating.

Each image is rated as a list value [SC, PQ].

SC Rating Table

Condition 1

Condition 2 (if applicable)

Condition 3 (if applicable)

SC rating

no following at all

Any

Any

0

Any

no following at all

Any

0

Any

Any

no following at all

0

following some part

following some or most part

following some or most part

0.5

following some or most part

following some part

following some or most part

0.5

following some part or more

following some or most part

following some part

0.5

following most part

following most part

following most part

1

PQ Rating Table

Objects in image

Artifacts

Unusual sense

PQ rating

Unrecognizable

serious

Any

0

Recognizable

some

Any

0.5

Recognizable

Any

some

0.5

Recognizable

none

little or None

1

Collecting Human Evaluation Data

In the results folder, there should be a dataset_lookup.csv file to edit.

uid

TheModel

sample_1.jpg

[0, 1]

sample_2.jpg

[1, 1]

sample_3.jpg

[1, 0.5]

Task-specific Guidelines

Some predefined conditions:

  • Condition A: Is the Image generation following the prompt?

  • Condition B: Is the Image editing following the instruction?

  • Condition C: Is the Image performing minimal edit without changing the background?

  • Condition D_i: Is the Object_i in the Image following the token subject?

  • Condition E: Is the Image following the control guidance?

Text-guided Image Generation (known as Text-to-Image)

  • Condition A: Is the Image generation following the prompt?

SC Table:

Condition A

SC rating

no following at all

0

following some part

0.5

following most part

1

PQ Table:

Objects in image

Artifacts

Unusual sense

PQ rating

Unrecognizable

serious

Any

0

Recognizable

some

Any

0.5

Recognizable

Any

some

0.5

Recognizable

none

little or None

1

Mask-guided Image Editing

  • Condition B: Is the Image editing following the instruction?

  • Condition C: Is the Image performing minimal edit without changing the background?

SC Table:

Condition B

Condition C

SC rating

no following at all

Any

0

Any

background completely changed

0

following some part

with a few overedit or mostly minimal

0.5

following some or most part

with a few overedit

0.5

following most part

mostly minimal

1

PQ Table:

Objects in image

Artifacts

Unusual sense

PQ rating

Unrecognizable

serious

Any

0

Recognizable

some

Any

0.5

Recognizable

Any

some

0.5

Recognizable

none

little or None

1

Text-guided Image Editing

  • Condition B: Is the Image editing following the instruction?

  • Condition C: Is the Image performing minimal edit without changing the background?

SC Table:

Condition B

Condition C

SC rating

no following at all

Any

0

Any

background completely changed

0

following some part

with a few overedit or mostly minimal

0.5

following some or most part

with a few overedit

0.5

following most part

mostly minimal

1

PQ Table:

Objects in image

Artifacts

Unusual sense

PQ rating

Unrecognizable

serious

Any

0

Recognizable

some

Any

0.5

Recognizable

Any

some

0.5

Recognizable

none

little or None

1

Subject-driven Image Generation

  • Condition A: Is the Image generation following the prompt?

  • Condition D: Is the Object in the Image following the token subject?

SC Table:

Condition A

Condition D

SC rating

no following at all

Any

0

Any

no following at all

0

following some part

following some or most part

0.5

following some or most part

following some part

0.5

following most part

following most part

1

PQ Table:

Objects in image

Artifacts

Unusual sense

PQ rating

Unrecognizable

serious

Any

0

Recognizable

some

Any

0.5

Recognizable

Any

some

0.5

Recognizable

none

little or None

1

Subject-driven Image Editing

  • Condition C: Is the Image performing minimal edit without changing the background?

  • Condition D: Is the Object in the Image following the token subject?

SC Table:

Condition C

Condition D

SC rating

no following at all

Any

0

Any

no following at all

0

following some part

following some or most part

0.5

following some or most part

following some part

0.5

following most part

following most part

1

PQ Table:

Objects in image

Artifacts

Unusual sense

PQ rating

Unrecognizable

serious

Any

0

Recognizable

some

Any

0.5

Recognizable

Any

some

0.5

Recognizable

none

little or None

1

Multi-concept Image Composition

  • Condition A: Is the Image generation following the prompt?

  • Condition D_i: Is the Object_i in the Image following the token subject?

SC Table:

Condition A

Condition D_1

Condition D_2

SC rating

no following at all

Any

Any

0

Any

no following at all

Any

0

Any

Any

no following at all

0

following some part

following some or most part

following some or most part

0.5

following some or most part

following some part

following some or most part

0.5

following some part or more

following some or most part

following some part

0.5

following most part

following most part

following most part

1

PQ Table:

Objects in image

Artifacts

Unusual sense

PQ rating

Unrecognizable

serious

Any

0

Recognizable

some

Any

0.5

Recognizable

Any

some

0.5

Recognizable

none

little or None

1

Control-guided Image Generation

  • Condition A: Is the Image generation following the prompt?

  • Condition E: Is the Image following the control guidance?

SC Table:

Condition A

Condition E

SC rating

no following at all

Any

0

Any

no following at all

0

following some part

following some or most part

0.5

following some or most part

following some part

0.5

following most part

following most part

1

PQ Table:

Objects in image

Artifacts

Unusual sense

PQ rating

Unrecognizable

serious

Any

0

Recognizable

some

Any

0.5

Recognizable

Any

some

0.5

Recognizable

none

little or None

1

Statistical Tools for Human Evaluation Data

(Under Construction)