# Human Evaluation ## Human Evaluation Standardization To standardize the conduction of a rigorous human evaluation, we stipulate the criteria for each measurement as follows: ### Measurement Criteria - **Semantic Consistency (SC)** - **Score in Range:** [0, 0.5, 1] - **Description:** It measures the level that the generated image is coherent in terms of the condition provided (i.e. Prompts, Subject Token, etc.). - **Perceptual Quality (PQ)** - **Score in Range:** [0, 0.5, 1] - **Description:** It measures the level at which the generated image is visually convincing and gives off a natural sense. ### Meaning of Semantic Consistency (SC) score - **SC=0**: Image not following one or more of the conditions at all (e.g. not following the prompt at all, different background in editing task, wrong subject in subject-driven task, etc.) - **SC=0.5**: all the conditions are partly following the requirements. - **SC=1**: The rater agrees that the overall idea is correct. ### Meaning of Perceptual Quality (PQ) score - **PQ=0**: The rater spotted obvious distortion or artifacts at first glance and those distorts make the objects unrecognizable. - **PQ=0.5**: The rater found out the image gives off an unnatural sense. Or the rater spotted some minor artifacts and the objects are still recognizable. - **PQ=1**: The rater agrees that the resulting image looks genuine. **Artifacts and Unusual sense, respectively, are:** - **Artifacts**: - Distortion - Watermark - Scratches - Blurred faces - Unusual body parts - Subjects not harmonized - **Unusual Sense**: - Wrong sense of distance (subject too big or too small compared to others) - Wrong shadow - Wrong lighting, etc. ## Implementation of Human Evaluation * In execute, we require raters to strictly follow this table for rating. Each image is rated as a list value `[SC, PQ]`. ### SC Rating Table | Condition 1 | Condition 2 (if applicable) | Condition 3 (if applicable) | SC rating | |-----------------------|------------------------------|------------------------------|-----------| | no following at all | Any | Any | 0 | | Any | no following at all | Any | 0 | | Any | Any | no following at all | 0 | | following some part | following some or most part | following some or most part | 0.5 | | following some or most part | following some part | following some or most part | 0.5 | | following some part or more | following some or most part | following some part | 0.5 | | following most part | following most part | following most part | 1 | ### PQ Rating Table | Objects in image | Artifacts | Unusual sense | PQ rating | |-------------------|-----------|-----------------|-----------| | Unrecognizable | serious | Any | 0 | | Recognizable | some | Any | 0.5 | | Recognizable | Any | some | 0.5 | | Recognizable | none | little or None | 1 | ### Collecting Human Evaluation Data In the `results` folder, there should be a `dataset_lookup.csv` file to edit. | uid | TheModel | |-------------------|-----------| | sample_1.jpg | [0, 1] | | sample_2.jpg | [1, 1] | | sample_3.jpg | [1, 0.5] | | ... | ... | ## Task-specific Guidelines Some predefined conditions: * Condition A: Is the Image generation following the prompt? * Condition B: Is the Image editing following the instruction? * Condition C: Is the Image performing minimal edit without changing the background? * Condition D_i: Is the Object_i in the Image following the token subject? * Condition E: Is the Image following the control guidance? ### Text-guided Image Generation (known as Text-to-Image) * Condition A: Is the Image generation following the prompt? SC Table: | Condition A | SC rating | |-----------------------|-----------| | no following at all | 0 | | following some part | 0.5 | | following most part | 1 | PQ Table: | Objects in image | Artifacts | Unusual sense | PQ rating | |-------------------|-----------|-----------------|-----------| | Unrecognizable | serious | Any | 0 | | Recognizable | some | Any | 0.5 | | Recognizable | Any | some | 0.5 | | Recognizable | none | little or None | 1 | ### Mask-guided Image Editing * Condition B: Is the Image editing following the instruction? * Condition C: Is the Image performing minimal edit without changing the background? SC Table: | Condition B | Condition C | SC rating | |-----------------------|------------------------------|-----------| | no following at all | Any | 0 | | Any | background completely changed | 0 | | following some part | with a few overedit or mostly minimal | 0.5 | | following some or most part | with a few overedit | 0.5 | | following most part | mostly minimal | 1 | PQ Table: | Objects in image | Artifacts | Unusual sense | PQ rating | |-------------------|-----------|-----------------|-----------| | Unrecognizable | serious | Any | 0 | | Recognizable | some | Any | 0.5 | | Recognizable | Any | some | 0.5 | | Recognizable | none | little or None | 1 | ### Text-guided Image Editing * Condition B: Is the Image editing following the instruction? * Condition C: Is the Image performing minimal edit without changing the background? SC Table: | Condition B | Condition C | SC rating | |-----------------------|------------------------------|-----------| | no following at all | Any | 0 | | Any | background completely changed | 0 | | following some part | with a few overedit or mostly minimal | 0.5 | | following some or most part | with a few overedit | 0.5 | | following most part | mostly minimal | 1 | PQ Table: | Objects in image | Artifacts | Unusual sense | PQ rating | |-------------------|-----------|-----------------|-----------| | Unrecognizable | serious | Any | 0 | | Recognizable | some | Any | 0.5 | | Recognizable | Any | some | 0.5 | | Recognizable | none | little or None | 1 | ### Subject-driven Image Generation * Condition A: Is the Image generation following the prompt? * Condition D: Is the Object in the Image following the token subject? SC Table: | Condition A | Condition D | SC rating | |-----------------------|------------------------------|-----------| | no following at all | Any | 0 | | Any | no following at all | 0 | | following some part | following some or most part | 0.5 | | following some or most part | following some part | 0.5 | | following most part | following most part | 1 | PQ Table: | Objects in image | Artifacts | Unusual sense | PQ rating | |-------------------|-----------|-----------------|-----------| | Unrecognizable | serious | Any | 0 | | Recognizable | some | Any | 0.5 | | Recognizable | Any | some | 0.5 | | Recognizable | none | little or None | 1 | ### Subject-driven Image Editing * Condition C: Is the Image performing minimal edit without changing the background? * Condition D: Is the Object in the Image following the token subject? SC Table: | Condition C | Condition D | SC rating | |-----------------------|------------------------------|-----------| | no following at all | Any | 0 | | Any | no following at all | 0 | | following some part | following some or most part | 0.5 | | following some or most part | following some part | 0.5 | | following most part | following most part | 1 | PQ Table: | Objects in image | Artifacts | Unusual sense | PQ rating | |-------------------|-----------|-----------------|-----------| | Unrecognizable | serious | Any | 0 | | Recognizable | some | Any | 0.5 | | Recognizable | Any | some | 0.5 | | Recognizable | none | little or None | 1 | ### Multi-concept Image Composition * Condition A: Is the Image generation following the prompt? * Condition D_i: Is the Object_i in the Image following the token subject? SC Table: | Condition A | Condition D_1 | Condition D_2 | SC rating | |-----------------------|------------------------------|------------------------------|-----------| | no following at all | Any | Any | 0 | | Any | no following at all | Any | 0 | | Any | Any | no following at all | 0 | | following some part | following some or most part | following some or most part | 0.5 | | following some or most part | following some part | following some or most part | 0.5 | | following some part or more | following some or most part | following some part | 0.5 | | following most part | following most part | following most part | 1 | PQ Table: | Objects in image | Artifacts | Unusual sense | PQ rating | |-------------------|-----------|-----------------|-----------| | Unrecognizable | serious | Any | 0 | | Recognizable | some | Any | 0.5 | | Recognizable | Any | some | 0.5 | | Recognizable | none | little or None | 1 | ### Control-guided Image Generation * Condition A: Is the Image generation following the prompt? * Condition E: Is the Image following the control guidance? SC Table: | Condition A | Condition E | SC rating | |-----------------------|------------------------------|-----------| | no following at all | Any | 0 | | Any | no following at all | 0 | | following some part | following some or most part | 0.5 | | following some or most part | following some part | 0.5 | | following most part | following most part | 1 | PQ Table: | Objects in image | Artifacts | Unusual sense | PQ rating | |-------------------|-----------|-----------------|-----------| | Unrecognizable | serious | Any | 0 | | Recognizable | some | Any | 0.5 | | Recognizable | Any | some | 0.5 | | Recognizable | none | little or None | 1 | ## Statistical Tools for Human Evaluation Data (Under Construction)