从上面的图中我们可以清晰的看到在以VGG16做骨干网络时,在conv5后丢弃了VGG16中的全连接层改为了1024 x 3 x 3和1024 x 1 x 1 的卷积层。其中conv4-1卷积层前面的maxpooling层的ceil_model=True,使得输出特征图长宽为38 x 38。还有conv5-3后面的一层maxpooling层参数为$(\text {kernelsize}=3, \text {stride}=1, \text {padding}=1)$,不进行下采样。然后在fc7后面接上多尺度提取的另外4个卷积层就构成了完整的SSD网络。这里VGG16修改后的代码如下,来自ssd.py:
defadd_extras(cfg, i, batch_norm=False): # Extra layers added to VGG for feature scaling layers = [] in_channels = i flag = False#flag 用来控制 kernel_size= 1 or 3 for k, v inenumerate(cfg): if in_channels != 'S': if v == 'S': layers += [nn.Conv2d(in_channels, cfg[k + 1], kernel_size=(1, 3)[flag], stride=2, padding=1)] else: layers += [nn.Conv2d(in_channels, v, kernel_size=(1, 3)[flag])] flag = not flag in_channels = v return layers
这个在前面SSD的原理篇中讲过了,这里不妨再回忆一下,SSD从魔改后的VGG16的conv4_3开始一共使用了6个不同大小的特征图,大小分别为(38,28),(19,19),(10,10),(5,5),(3,3),(1,1),但每个特征图上设置的先验框(Anchor)的数量不同。先验框的设置包含尺度和长宽比两个方面。对于先验框的设置,公式如下: $s_{k}=s_{\min }+\frac{s_{m a x}-s_{\min }}{m-1}(k-1), k \in[1, m]$,其中$m$指的是特征图个数,这里为5,因为第一层conv4_3的Anchor是单独设置的,$s_{k}$代表先验框大小相对于原图的比例。最后,$s_{min}$和$s_{max}$表示比例的最小值和最大值,论文中分别取0.2和0.9。对于第一个特征图,它的先验框尺度比例设置为$s_{min}/2=0.1$,则他的尺度为300 x 0.1 = 30,后面的特征图带入公式计算,计算时$\frac{s_{m a x}-s_{\min }}{m-1}$保留两位小数为0.17,比如300 x (0.17 x 0 + 0.2) = 60,300 x (0.17 x 1 + 0.2) = 111,所以剩下的5个特征图的min_size尺度$s_{k}$为60,111,162,213,264。综合起来,6个特征图的min_size尺度$s_{k}$为30,60,111,162,213,264。有了Anchor的尺度,接下来设置Anchor的长宽,论文中长宽设置一般为$a_{r}=1,2,3, \frac{1}{2}, \frac{1}{3}$,根据面积和长宽比可以得到先验框的宽度和高度:$w_{k}^{a}=s_{k} \sqrt{a_{r}}, h_{k}^{a}=s_{k} / \sqrt{a_{r}}$ 。这里有一些值得注意的点,如下:
classPriorBox(object): """Compute priorbox coordinates in center-offset form for each source feature map. """ def__init__(self, cfg): super(PriorBox, self).__init__() self.image_size = cfg['min_dim'] # number of priors for feature map location (either 4 or 6) self.num_priors = len(cfg['aspect_ratios']) self.variance = cfg['variance'] or [0.1] self.feature_maps = cfg['feature_maps'] self.min_sizes = cfg['min_sizes'] self.max_sizes = cfg['max_sizes'] self.steps = cfg['steps'] self.aspect_ratios = cfg['aspect_ratios'] self.clip = cfg['clip'] self.version = cfg['name'] for v in self.variance: if v <= 0: raise ValueError('Variances must be greater than 0')
defforward(self): mean = [] # 遍历多尺度的 特征图: [38, 19, 10, 5, 3, 1] for k, f inenumerate(self.feature_maps): # 遍历每个像素 for i, j in product(range(f), repeat=2): # k-th 层的feature map 大小 f_k = self.image_size / self.steps[k] # # 每个框的中心坐标 cx = (j + 0.5) / f_k cy = (i + 0.5) / f_k
classSSD(nn.Module): """Single Shot Multibox Architecture The network is composed of a base VGG network followed by the added multibox conv layers. Each multibox layer branches into 1) conv2d for class conf scores 2) conv2d for localization predictions 3) associated priorbox layer to produce default bounding boxes specific to the layer's feature map size. See: https://arxiv.org/pdf/1512.02325.pdf for more details. Args: phase: (string) Can be "test" or "train" size: input image size base: VGG16 layers for input, size of either 300 or 500 extras: extra layers that feed to multibox loc and conf layers head: "multibox head" consists of loc and conf conv layers """
defforward(self, x): """Applies network layers and ops on input image(s) x. Args: x: input image or batch of images. Shape: [batch,3,300,300]. Return: Depending on phase: test: Variable(tensor) of output class label predictions, confidence score, and corresponding location predictions for each object detected. Shape: [batch,topk,7] train: list of concat outputs from: 1: confidence layers, Shape: [batch*num_priors,num_classes] 2: localization layers, Shape: [batch,num_priors*4] 3: priorbox layers, Shape: [2,num_priors*4] """ sources = list() loc = list() conf = list()
# apply vgg up to conv4_3 relu # vgg网络到conv4_3 for k inrange(23): x = self.vgg[k](x) # l2 正则化 s = self.L2Norm(x) sources.append(s)
# apply vgg up to fc7 # conv4_3 到 fc for k inrange(23, len(self.vgg)): x = self.vgg[k](x) sources.append(x)
# apply extra layers and cache source layer outputs # extras 网络 for k, v inenumerate(self.extras): x = F.relu(v(x), inplace=True) if k % 2 == 1: # 把需要进行多尺度的网络输出存入 sources sources.append(x)
# apply multibox head to source layers # 多尺度回归和分类网络 for (x, l, c) inzip(sources, self.loc, self.conf): loc.append(l(x).permute(0, 2, 3, 1).contiguous()) conf.append(c(x).permute(0, 2, 3, 1).contiguous())
loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1) conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1) if self.phase == "test": output = self.detect( loc.view(loc.size(0), -1, 4), # loc preds self.softmax(conf.view(conf.size(0), -1, self.num_classes)), # conf preds self.priors.type(type(x.data)) # default boxes ) else: output = ( # loc的输出,size:(batch, 8732, 4) loc.view(loc.size(0), -1, 4), # conf的输出,size:(batch, 8732, 21) conf.view(conf.size(0), -1, self.num_classes), # 生成所有的候选框 size([8732, 4]) self.priors ) return output # 加载模型参数 defload_weights(self, base_file): other, ext = os.path.splitext(base_file) if ext == '.pkl'or'.pth': print('Loading weights into state dict...') self.load_state_dict(torch.load(base_file, map_location=lambda storage, loc: storage)) print('Finished!') else: print('Sorry only .pth and .pkl files supported.')
classMultiBoxLoss(nn.Module): """SSD Weighted Loss Function Compute Targets: 1) Produce Confidence Target Indices by matching ground truth boxes with (default) 'priorboxes' that have jaccard index > threshold parameter (default threshold: 0.5). 2) Produce localization target by 'encoding' variance into offsets of ground truth boxes and their matched 'priorboxes'. 3) Hard negative mining to filter the excessive number of negative examples that comes with using a large number of default bounding boxes. (default negative:positive ratio 3:1) Objective Loss: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N Where, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Loss weighted by α which is set to 1 by cross val. Args: c: class confidences, l: predicted boxes, g: ground truth boxes N: number of matched default boxes See: https://arxiv.org/pdf/1512.02325.pdf for more details. """
defintersect(box_a, box_b): """ We resize both tensors to [A,B,2] without new malloc: [A,2] -> [A,1,2] -> [A,B,2] [B,2] -> [1,B,2] -> [A,B,2] Then we compute the area of intersect between box_a and box_b. Args: box_a: (tensor) bounding boxes, Shape: [A,4]. box_b: (tensor) bounding boxes, Shape: [B,4]. Return: (tensor) intersection area, Shape: [A,B]. """ A = box_a.size(0) B = box_b.size(0) # 右下角,选出最小值 max_xy = torch.min(box_a[:, 2:].unsqueeze(1).expand(A, B, 2), box_b[:, 2:].unsqueeze(0).expand(A, B, 2)) # 左上角,选出最大值 min_xy = torch.max(box_a[:, :2].unsqueeze(1).expand(A, B, 2), box_b[:, :2].unsqueeze(0).expand(A, B, 2)) # 负数用0截断,为0代表交集为0 inter = torch.clamp((max_xy - min_xy), min=0) return inter[:, :, 0] * inter[:, :, 1]
defjaccard(box_a, box_b): """Compute the jaccard overlap of two sets of boxes. The jaccard overlap is simply the intersection over union of two boxes. Here we operate on ground truth boxes and default boxes. E.g.: A ∩ B / A ∪ B = A ∩ B / (area(A) + area(B) - A ∩ B) Args: box_a: (tensor) Ground truth bounding boxes, Shape: [num_objects,4] box_b: (tensor) Prior boxes from priorbox layers, Shape: [num_priors,4] Return: jaccard overlap: (tensor) Shape: [box_a.size(0), box_b.size(0)] """ inter = intersect(box_a, box_b)# A∩B # box_a和box_b的面积 area_a = ((box_a[:, 2]-box_a[:, 0]) * (box_a[:, 3]-box_a[:, 1])).unsqueeze(1).expand_as(inter) # [A,B]#(N,) area_b = ((box_b[:, 2]-box_b[:, 0]) * (box_b[:, 3]-box_b[:, 1])).unsqueeze(0).expand_as(inter) # [A,B]#(M,) union = area_a + area_b - inter return inter / union # [A,B]
defencode(matched, priors, variances): """Encode the variances from the priorbox layers into the ground truth boxes we have matched (based on jaccard overlap) with the prior boxes. Args: matched: (tensor) Coords of ground truth for each prior in point-form Shape: [num_priors, 4]. priors: (tensor) Prior boxes in center-offset form Shape: [num_priors,4]. variances: (list[float]) Variances of priorboxes Return: encoded boxes (tensor), Shape: [num_priors, 4] """
# dist b/t match center and prior's center g_cxcy = (matched[:, :2] + matched[:, 2:])/2 - priors[:, :2] # encode variance g_cxcy /= (variances[0] * priors[:, 2:]) # match wh / prior wh g_wh = (matched[:, 2:] - matched[:, :2]) / priors[:, 2:] g_wh = torch.log(g_wh) / variances[1] # return target for smooth_l1_loss return torch.cat([g_cxcy, g_wh], 1) # [num_priors,4]
# Adapted from https://github.com/Hakuyume/chainer-ssd defdecode(loc, priors, variances): """Decode locations from predictions using priors to undo the encoding we did for offset regression at train time. Args: loc (tensor): location predictions for loc layers, Shape: [num_priors,4] priors (tensor): Prior boxes in center-offset form. Shape: [num_priors,4]. variances: (list[float]) Variances of priorboxes Return: decoded bounding box predictions """
defnms(boxes, scores, overlap=0.5, top_k=200): """Apply non-maximum suppression at test time to avoid detecting too many overlapping bounding boxes for a given object. Args: boxes: (tensor) The location preds for the img, Shape: [num_priors,4]. scores: (tensor) The class predscores for the img, Shape:[num_priors]. overlap: (float) The overlap thresh for suppressing unnecessary boxes. top_k: (int) The Maximum number of box preds to consider. Return: The indices of the kept boxes with respect to num_priors. """
keep = scores.new(scores.size(0)).zero_().long() if boxes.numel() == 0: return keep x1 = boxes[:, 0] y1 = boxes[:, 1] x2 = boxes[:, 2] y2 = boxes[:, 3] area = torch.mul(x2 - x1, y2 - y1) v, idx = scores.sort(0) # sort in ascending order # I = I[v >= 0.01] idx = idx[-top_k:] # indices of the top-k largest vals xx1 = boxes.new() yy1 = boxes.new() xx2 = boxes.new() yy2 = boxes.new() w = boxes.new() h = boxes.new()
# keep = torch.Tensor() count = 0 while idx.numel() > 0: i = idx[-1] # index of current largest val # keep.append(i) keep[count] = i count += 1 if idx.size(0) == 1: break idx = idx[:-1] # remove kept element from view # load bboxes of next highest vals torch.index_select(x1, 0, idx, out=xx1) torch.index_select(y1, 0, idx, out=yy1) torch.index_select(x2, 0, idx, out=xx2) torch.index_select(y2, 0, idx, out=yy2) # store element-wise max with next highest score xx1 = torch.clamp(xx1, min=x1[i]) yy1 = torch.clamp(yy1, min=y1[i]) xx2 = torch.clamp(xx2, max=x2[i]) yy2 = torch.clamp(yy2, max=y2[i]) w.resize_as_(xx2) h.resize_as_(yy2) w = xx2 - xx1 h = yy2 - yy1 # check sizes of xx1 and xx2.. after each iteration w = torch.clamp(w, min=0.0) h = torch.clamp(h, min=0.0) inter = w*h # IoU = i / (area(a) + area(b) - i) rem_areas = torch.index_select(area, 0, idx) # load remaining areas) union = (rem_areas - inter) + area[i] IoU = inter/union # store result in iou # keep only elements with an IoU <= overlap idx = idx[IoU.le(overlap)] return keep, count
classDetect(Function): """At test time, Detect is the final layer of SSD. Decode location preds, apply non-maximum suppression to location predictions based on conf scores and threshold to a top_k number of output predictions for both confidence score and locations. """ def__init__(self, num_classes, bkg_label, top_k, conf_thresh, nms_thresh): self.num_classes = num_classes self.background_label = bkg_label self.top_k = top_k # Parameters used in nms. self.nms_thresh = nms_thresh if nms_thresh <= 0: raise ValueError('nms_threshold must be non negative.') self.conf_thresh = conf_thresh self.variance = cfg['variance']