Retina Face

class face_crop_plus.models.retinaface.RetinaFace(strategy='all', vis=0.6)[source]

Bases: Module, LoadMixin

RetinaFace face detector and 5-point landmark predictor.

This class is capable of predicting 5-point landmarks from a batch of images and filter them based on strategy, e.g., “all landmarks in the image”, “a single set of landmarks per image of the largest face”. For more information, see the main method of this class predict(). For main attributes, see __init__().

This class also inherits load method from LoadMixin class. The method takes a device on which to load the model and loads the model with a default state dictionary loaded from WEIGHTS_FILENAME file. It sets this model to eval mode and disables gradients.

For more information on how RetinaFace model works, see this repo: PyTorch Retina Face. Most of the code was taken from that repository.

Note

Whenever an input shape is mentioned, N corresponds to batch size, C corresponds to the number of channels, H - to input height, and W - to input width. out_dim corresponds to the total guesses (the number of priors) the model made about each sample. Within those guesses, there typically exists at least 1 face but can be more. By default, it should be 43,008.

Be default, this class initializes the following attributes which can be changed after initialization of the class (but, typically, should not be changed):

nms_threshold

The threshold, based on which multiple bounding box or landmark predictions for the same face are merged into one. Defaults to 0.4.

Type:: float

variance

The variance of the bounding boxes used to undo the encoding of coordinates of raw bounding box and landmark predictions.

Type:: list[int]

WEIGHTS_FILENAME = 'retinaface_detector.pth'

The constant specifying the name of .pth file from which the weights for this model should be loaded. Defaults to “retinaface_detector.pth”.

Type:: WEIGHTS_FILENAME (str)

__init__(strategy='all', vis=0.6)[source]

Initializes RetinaFace model.

This method initializes ResNet-50 backbone and further layers required for face detection and bbox/landm predictions.

Parameters:

strategy (str) –
The strategy used to retrieve the landmarks when predict() is called. The available options are:
- ”all” - landmarks for all faces per single image (single batch entry) will be considered.
- ”best” - landmarks for a single face with the highest confidence score per image will be considered.
- ”largest” - landmarks for a single largest face per image will be considered.
The most efficient option is ‘best’ and the least efficient is “largest”. Defaults to “all”.
vis (float) – The visual threshold, i.e., minimum confidence score, for a face to be considered an actual face. Lower scores will allow the detection of more faces per image but can result in non-actual faces, e.g., random surfaces somewhat representing faces. Higher scores will prevent detecting faulty faces but may result in only a few faces detected, whereas there can be more, e.g., higher will prevent the detection of blurry faces. Defaults to 0.6.

decode_bboxes(loc, priors)[source]

Decodes bounding boxes from predictions.

Takes the predicted bounding boxes (locations) and undoes the encoding for offset regression used at training time.

Parameters:

loc (Tensor) – Bounding box (location) predictions for loc layers of shape (N, out_dim, 4).
priors (Tensor) – Prior boxes in center-offset form of shape (out_dim, 4).

Return type:

Tensor

Returns:

A tensor of shape (N, out_dim, 4) representing decoded bounding box predictions where the last dim can be interpreted as x1, y1, x2, y2 coordinates - the start and the end corners defining the face box.

decode_landms(pre, priors)[source]

Decodes landmarks from predictions.

Takes the predicted landmarks (pre) and undoes the encoding for offset regression used at training time.

Parameters:

pre (Tensor) – Landmark predictions for loc layers of shape (N, out_dim, 10).
priors (Tensor) – Prior boxes in center-offset form of shape (out_dim, 4).

Return type:

Tensor

Returns:

A tensor of shape (N, out_dim, 10) representing decoded landmark predictions where the last dim can be interpreted as x1, y1, …, x10, y10 coordinates - one for each of the 5 landmarks.

filter_preds(scores, bboxes, landms)[source]

Filters predictions for identified faces for each sample.

This method works as follows:

First, it filters out bad predictions based on self.vis_threshold.

Then it gathers all the remaining predictions across the batch dimension, i.e., the batch dimension becomes not the number of samples but the number of filtered out predictions.

It loops for each set of filtered predictions per sample sorting each set of confidence scores from best to worst.

For each set of confidence scores, it identifies distinct faces and keeps the record of which indices to keep. At this stage it uses self.nms_threshold to remove the duplicate face predictions.

Finally, it applies the kept indices for each person (each face) to select corresponding bounding boxes and landmarks.

Parameters:

scores (Tensor) – The confidence score predictions of shape (N, out_dim).
bboxes (Tensor) – The bounding boxes for each face of shape (N, out_dim, 4) where the last 4 numbers correspond to start and end coordinates - x1, y1, x2, y2.
landms (Tensor) – The landmarks for each face of shape (N, out_dim, num_landmarks * 2) where the last dim corresponds to landmark coordinates x1, y1, … . By default, num_landmarks is 5.

Return type:

tuple[Tensor, Tensor, list[int]]

Returns:

A tuple where the first element is a torch tensor of shape (num_faces, 4), the second element is a torch tensor of shape (num_faces, num_landmarks * 2) and the third element is a list of length num_faces. First and second elements correspond to bounding boxes and landmarks for each face across all samples and the third element provides an index for each bounding box/set of landmarks that identifies which sample that box/set (or that face) is extracted from (because each sample can have multiple faces).

forward(x)[source]

Performs forward pass.

Takes an input batch and performs inference based on the modules it has. Returns an unfiltered tuple of scores, bounding boxes and landmarks for all the possible detected faces. The predictions are encoded to comfortably compute the loss during training and thus should be decoded to coordinates.

Parameters:: x (Tensor) – The input tensor of shape (N, 3, H, W).
Return type:: tuple[Tensor, Tensor, Tensor]
Returns:: A tuple of torch tensors where the first element is confidence scores for each prediction of shape (N, out_dim, 2) with values between 0 and 1 representing probabilities, the second element is bounding boxes of shape (N, out_dim, 4) with unbounded values and the last element is landmarks of shape (N, out_dim, 10) with unbounded values.

predict(images)[source]

Predict the sets of landmarks from the image batch.

This method takes a batch of images, detect all visible faces, predicts bounding boxes and landmarks for each face and then filters those faces according to a specific strategy - see take_by_strategy() for more info. Finally, it returns those selected sets of landmarks and corresponding indices that map each set to a specific image where the face was originally detected.

The predicted sets of landmarks are 5-point coordinates (they are specified from an observer’s viewpoint, meaning that, for instance, left eye is the eye on the left hand-side of the image rather than the left eye from the person’s to whom the eye belongs perspective):

(x1, y1) - coordinate of the left eye

(x2, y2) - coordinate of the right eye

(x3, y3) - coordinate of the nose tip

(x4, y4) - coordinate of the left mouth corner

(x5, y5) - coordinate of the right mouth corner

The coordinates are with respect to the sizes of the images (typically padded) provided as an input to this method.

Parameters:: images (Tensor) – Image batch of shape (N, 3, H, W) in RGB form with float values from 0.0 to 255.0. It must be on the same device as this model.
Return type:: tuple[ndarray, list[int]]
Returns:: A tuple where the first element is a numpy array of shape (num_faces, 5, 2) representing the selected sets of landmark coordinates and the second element is a list of corresponding indices mapping each face to an image it comes from.

take_by_strategy(landms, bboxes, idx)[source]

Filters landmarks according to strategy.

This method takes a batch of landmarks and bounding boxes (one for each face) filters only specific landmarks by a specific strategy. Here are the following cases of strategy:

“all” - effectively, nothing is done and simply the already passed values landms and idx are returned without any changes.

“best” - the very first set of landmarks for each image image is returned (the first set is the best set because the landmarks were sorted when duplicates were filtered out in filter_preds()). This means the returned indices list is unique, e.g., goes from [0, 0, 0, 1, 1, 2, 3, 3] to [0, 1, 2, 3].

“largest” - similar to ‘best’, except that this strategy requires performing additional computation to find out the largest face based on the area of bounding boxes. Thus the length of the idx list (which is equal to the number of sets of landmarks) is the same as for ‘best’ strategy, except not the first (best) faces (actually, their landmarks) for each image but selected faces are returned.

Note

Strategy “best” is most memory efficient, strategy “largest” is least time efficient. Strategy “all” is as fast as “best” but takes up more space.

Parameters:

landms (Tensor) – Landmarks batch of shape (num_faces, num_landm * 2).
bboxes (Tensor) – Bounding boxes batch of shape (num_faces, 4).
idx (list[int]) – Indices where each index maps to an image from which some face prediction (landmarks and bounding box) was retrieved. For instance if the 2nd element of idx is 1, that means that the 2nd element of landms and the 2nd element of bboxes correspond to the 1st image. This list is ascending, meaning the elements are grouped and increase, for example, the list may look like this: [0, 0, 1, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6].

Raises:

ValueError – If the strategy is not supported.

Return type:

tuple[Tensor, list[int]]

Returns:

A tuple where the first element is torch tensor of shape (num_faces, num_landm * 2) representing the selected sets of landmarks and the second element is a list of indices where each index maps a corresponding set of landmarks (face) to an image identified by that index.