BiSeNet

class face_crop_plus.models.bise.BiSeNet(attr_groups=None, mask_groups=None, max_batch_size=8)[source]

Bases: Module, LoadMixin

Face attribute parser.

This class is capable of predicting scores for 19 attributes for face images. After it identifies the closest attribute for each pixel, it can also put the whole face image to a corresponding attribute or mask group.

The 19 attributes are as follows (attributes are indicated from a person’s face perspective, meaning, for instance, left eye is the eye on the right hand-side of the picture, however, sides are not always accurate):

0 - neutral

1 - skin

2 - left eyebrow

3 - right eyebrow

4 - left eye

5 - right eye

6 - eyeglasses

7 - left ear

8 - right ear

9 - earing

10 - nose

11 - mouth

12 - upper lip

13 - lower lip

14 - neck

15 - necklace

16 - clothes

17 - hair

18 - hat

Some examples of grouping by attributes:

'glasses': [6] - this will put each face image that contains pixels labeled as 6 to a category called ‘glasses’.

'earings_and_necklace': [9, 15] - this will put each image that contains pixels labeled as 9 and also contains pixels labeled as 15 to a category called ‘earings_and_necklace’.

'no_accessories': [-6, -9, -15, -18] - this will put each face image that does not contain pixels labeled as either 6, 9, 15, or 18 to a category called ‘no_accessories’.

Some examples of grouping by mask:

'nose': [10] - this will put each face image that contains pixels labeled as 10 to a category called ‘nose’ and generate a corresponding mask.

'eyes_and_eyebrows': [2, 3, 4, 5] - this will put each image that contains pixels labeled as either 2, 3, 4, or 5 (or any combination of them) to a category called ‘eyes_and_eyebrows’ and generate a corresponding mask.

This class also inherits load method from LoadMixin class. The method takes a device on which to load the model and loads the model with a default state dictionary loaded from WEIGHTS_FILENAME file. It sets this model to eval mode and disables gradients.

For more information on how BiSeNet model works, see this repo: Face Parsing PyTorch. Most of the code was taken from that repository.

Note

Whenever an input shape is mentioned, N corresponds to batch size, C corresponds to the number of channels, H - to input height, and W - to input width.

Be default, this class initializes the following attributes which can be changed after initialization of the class (but, typically, should not be changed):

attr_join_by_and

Whether to add a face image to an attribute group if the face meets all the specified attributes in a list (joined by and) of at least one of the attributes (joined by or). Please read the definition of attr_groups to get a clearer picture. In most cases, this should be set True - if the attributes in a group list are negative, this will ensure the selected face will match none of the specified attributes. Also, if you want to join the attributes by or (any), then separate single-attribute groups can be created and manually merged into one. Defaults to True.

Type:: bool

attr_threshold

Threshold, based on which the attribute is determined as present in the face image. For instance, if the threshold is 5, then at least 6 pixels must be labeled of the same kind of attribute for that attribute to be considered present in the face image. Defaults to 5.

Type:: int

mask_threshold

Threshold, based on which the mask is considered to be a proper mask. For instance, if the threshold is 15, then face images for which the number of pixels with the values corresponding to a specified mask group (face attributes) is less than or equal to 15 will be ignored and image-mask pair for that mask category will not be generated. Defaults to 15.

Type:: int

mean

The list of mean values for each input channel. The pixel values should be shifted by those quantities during inference since this normalization was applied during training. Defaults to [0.485, 0.456, 0.406].

Type:: list[float]

std

The list of standard deviation values for each input channel. The pixel values should be scaled by those quantities during inference since this normalization was applied during training. Defaults to [0.229, 0.224, 0.225].

Type:: list[float]

WEIGHTS_FILENAME = 'bise_parser.pth'

The constant specifying the name of .pth file from which the weights for this model should be loaded. Defaults to “bise_parser.pth”.

Type:: WEIGHTS_FILENAME (str)

__init__(attr_groups=None, mask_groups=None, max_batch_size=8)[source]

Initializes BiSeNet model.

First it assigns the passed values as attributes. Then this method initializes BiSeNet layers required for face parsing, i.e., labeling face parts.

Note

Check class definition for the possible face attribute values and examples of groups. Also note that all the specified variables here are mainly relevant only for predict().

Parameters:

attr_groups (Optional[dict[str, list[int]]]) – Dictionary specifying attribute groups, based on which the face images should be grouped. Each key represents an attribute group name, e.g., ‘glasses’, ‘earings_and_necklace’, ‘no_accessories’, and each value represents attribute indices, e.g., [6], [9, 15], [-6, -9, -15, -18], each index mapping to some attribute. Since this model labels face image pixels, if there is enough pixels with the specified values in the list, the whole face image will be put into that attribute category. For negative values, it will be checked that the labeled face image does not contain those (absolute) values. If it is None, then there will be no grouping according to attributes. Defaults to None.
mask_groups (Optional[dict[str, list[int]]]) – Dictionary specifying mask groups, based on which the face images and their masks should be grouped. Each key represents a mask group name, e.g., ‘nose’, ‘eyes_and_eyebrows’, and each value represents attribute indices, e.g., [10], [2, 3, 4, 5], each index mapping to some attribute. Since this model labels face image pixels, a mask will be created with ones at pixels that match the specified attributes and zeros elsewhere. Note that negative values would make no sense here and having them would cause an error. If it is None, then there will be no grouping according to mask groups. Defaults to None.
max_batch_size (int) – The maximum batch size used when performing inference. There may be a lot of faces, in a single batch thus splitting to sub-batches for inference and then merging back predictions is a way to deal with memory errors. This is a convenience variable because batch size typically corresponds to the number of images for a single inference, but the input given in predict() might have a larger batch size because it represents the number of faces, many of which can come from just a single image. Defaults to 8.

forward(x)[source]

Performs forward pass.

Takes an input batch and performs inference based on the modules it has. The input is a batch of face images and the output is a corresponding batch of pixel-wise attribute scores.

Parameters:: x (Tensor) – The input tensor of shape (N, 3, H, W).
Return type:: Tensor
Returns:: An output tensor of shape (N, 19, H, W) where each channel corresponds to a specific attribute and each value at (H, W) is an unbounded confidence score.

group_by_attributes(parse_preds, attr_groups, offset)[source]

Groups parse predictions by face attributes.

Takes parse predictions for each face where each pixel corresponds to some attribute group (the integer value indicates that group) and extends the groups in attribute dictionary to include more samples that match the group.

Parameters:

parse_preds (Tensor) – Face parsing predictions of shape (N, H, W) with integer values indicating pixel categories.
attr_groups (dict[str, list[int]]) – The dictionary with keys corresponding to attribute group names (they match self.attr_groups keys) and values corresponding to indices that map face images from other batches of parse_preds to the corresponding group. This is the dictionary that is extended and returned.
offset (int) – The offset to add to each index. Originally, the indices will correspond only to the face parsings in the current parse_preds batch and the offset allows to generalize the each index by offsetting it by the previous number of processes face parsings, i.e., the offset is the number of previous batches (parse_preds) times the batch size.

Return type:

dict[str, list[int]]

Returns:

Similar to attr_groups, it is the dictionary with the same keys but values (which are lists of indices) may be extended with additional indices.

group_by_masks(parse_preds, mask_groups, offset)[source]

Groups parse predictions by face mask attributes.

Takes parse predictions for each face where each pixel corresponds to some parse/mask group (the integer value indicates that group) and extends the groups in mask dictionary to include more samples that match the group.

Parameters:

parse_preds (Tensor) – Face parsing predictions of shape (N, H, W) with integer values indicating pixel categories.
mask_groups (dict[str, tuple[list[int], list[ndarray]]]) – The dictionary with keys corresponding to mask group names (they match self.mask_groups keys) and values corresponding to tuples where the first value is a list of indices that map face images from other batches of parse_preds to the corresponding group and the second is a list of corresponding masks as numpy arrays of shape (H, W) of type numpy.uint8 with 255 at pixels that match the mask group specification and 0 elsewhere. This is the dictionary that is extended and returned.
offset (int) – The offset to add to each index. Originally, the indices will correspond only to the face parsings in the current parse_preds batch and the offset allows to generalize the each index by offsetting it by the previous number of processes face parsings, i.e., the offset is the number of previous batches (parse_preds) times the batch size.

Return type:

dict[str, tuple[list[int], list[ndarray]]]

Returns:

Similar to mask_groups, it is the dictionary with the same keys but values (which are tuples of a list of indices and a list of masks) may be extended with additional indices and masks.

predict(images)[source]

Predicts attribute and mask groups for face images.

This method takes a batch of face images groups them according to the specifications in self.attr_groups and self.mask_groups. For more information on how it works, see this class’ specification BiSeNet. It returns 2 groups maps - one for grouping face images to different attribute categories, e.g., ‘with glasses’, ‘no accessories’ and the other for grouping images to different mask groups, e.g., ‘nose’, ‘lips and mouth’.

Parameters:

images (Tensor | list[Tensor]) – Image batch of shape (N, 3, H, W) in RGB form with float values from 0.0 to 255.0. It must be on the same device as this model. A list of tensors can also be provided, however, they all must have the same spatial dimensions to be stack-able to a single batch.

Returns:

attr_groups - each key represents attribute category and each value is a list of indices indicating which samples from images batch belong to that category. It can be None if self.attr_groups is None.
mask_groups - each key represents attribute (mask) category and each value is a tuple where the first element is a list of indices indicating which samples from images batch belong to that mask group and the second element is a corresponding batch of masks of shape (N, H, W) of type numpy.uint8 with values of either 0 or 255. The masks are presented in that order as the indices indicate which face images to take for that mask group. It can be None if self.mask_groups is None.

Return type:

A tuple of 2 dictionaries (either can be None)