Deep Vision-1: Detection | The War Of Mine

这篇文章作为深度视觉学习的第一篇: 目标检测. 由于算法类别比较多, 所以这篇可能比较长. 主要包含

[✔️] R-CNN (Slow)
[✔️] SPP(空间金字塔)-Net (Feature Map, SPP Layer)
[✔️] Fast R-CNN (Single SPP, End-to-End)
[✔️] Faster R-CNN (Region proposal Layer, anchor)
[✔️] YOLO (You Only Look Once)
[➖] SSD
[✔️] Selective Search (Raw)
[✔️] Non-maximum suppression (Suppress overlap)

就目前来看, 计算机视觉的主要应用方向有Classification, Localization, Detection, Segmentation, Captioning.

目标检测(不仅分类, 而用方框标注出感兴趣物体的位置)
目标分割(不仅是目标检测, 还要输出物体的轮廓)
图像分类
图像标注(captioning, 不知道怎么翻译..)
图像深度估计

因此准备按照这几个方向讲一讲DL在里面的应用好了. 传统的检测方向一般是利用HOG, SIFT等特征提取算子来提取特征, 之后利用SVM, ADABOOST等分类器进行分类. 比如人脸检测一般用Harr特征加Adaboost, 行人检测一般用HOG+SVM, 一般性的物体用HOG+DPM(deformable part model).

窗口提取算法一般是直接用的滑窗法(也就是穷举), 后来出来了一篇文章, 利用的是selective search, 也就是选择搜索, 减少了滑窗法的计算量, 原理后面会讲到, 这个方法主要在R-CNN, SPP-NET, fast R-CNN中用到, 在faster R-CNN中利用Region Proposal Network代替了选择搜索, 加快了计算, 并实现了端对端的流程, 而非分开的selective search和分类器.

obeject Detection

所谓目标检测, 包含两个目标, 一个是检测, 一个分类, 如下

目标DL在目标检测中效果比较好的算法有如下.

R-CNN
SSP-net
fast R-CNN
faster R-CNN
Mask R-CNN(像素级图像分割)
YOLO(You Only Look Once)
SSD(Single Shot Detection)

实际上上面的算法大致可以分为两类, 一类是利用候选区域, 一类是利用回归的思想, 前者好处是准确率高, 但是速度慢, 后者优势是速度快(Real-Time).

主要的数据集是PASCAL VOC, 下面是最近在该数据集上的进展

2014: R-CNN

可以说是R-CNN(Regions With CNNs)将CNN引入了目标检测领域, 并开始使得目标检测领域获得了突破性的进展.

由R-CNN衍生出很多改进的算法.

R-CNN 的目标是：导入一张图片，通过方框正确识别主要物体在图像的哪个地方。

输入：图像

输出：方框+每个物体的标签

但怎么知道这些方框应该在哪里呢？R-CNN 的处理方式，和我们直觉性的方式很像—— 在图像中搞出一大堆方框，看看是否有任何一个与某个物体重叠。

生成这些边框、或者说是推荐局域，R-CNN 采用的是一项名为 Selective Search 的算法. 总的来说

选择2000张不同最有可能出现感兴趣物体的区域.
将每个区域封装成可以提供给CNN的图像的尺寸.
将CNN提取的特征(4096 embending)提供给SVM进行分类
用线性回归模型找到更紧密的边框.

Ad hoc training objectives
- Fine-tune network with softmax classifier (log loss)
- Train postHhoc linear SVMs (hinge loss)
- Train postHhoc boundingHbox regressors (squared loss)

下面说一下具体的细节

训练

SVM使用的是one vs all分类器, 即训练N+1个线性分类器. 每一类一个, 再加一个背景.
训练的时候并非把2000张图片都导入进行训练, 而是选择positive ROIs, 即与ground truth区域有足够大的重叠, 先进行训练后. 在把负样本导入进行重新一轮训练.
线性分类器的参数就是4096的向量加一个bias项,

其中标注的真实数据和正样本如下

测试与可视化

生成很多ROIs后进行分类, 对于分类置信度很低(一般用0.5)的与分类为背景的直接去掉, 注意这里需要使用非最大抑制进行单一检测(避免重复检测).

为解决重叠问题, 非最大抑制不是进行候选区域融合, 而是找到最好的能够覆盖目标区域的候选区域并将其他的候选区域直接抛弃.

下面是使用非最大抑制的效果.

一般用来算法衡量的指标是mAP(mean average Precision, 即对不同类画出PR(precision-recall)图, 计算出面积的平均值, 具体的见论文)

PR图如下

一个可能的结果如下

Results using 200 ROIs (this number is too low to get good accuracy but for demo purposes allows for fast training and scoring):

|Dataset| AP(avocado)|AP(orange)|AP(butter)|AP(champagne)| | mAP
|—|—|—|—|—|—|—
|Training set|0.91 |0.76 |0.46 |0.81 |…|0.62
|Test Set| 0.64 |1.00 |0.64 |1.00 | |0.62

Results using 2000 ROIs:

|Dataset| AP(avocado)|AP(orange)|AP(butter)|AP(champagne)| | mAP
|—|—|—|—|—|—|—
|Training set|1.00 |0.76 |1.00 |1.00 |…|0.89
|Test Set| 1.00 |0.55 |0.64 |1.00 | |0.88

这里具体的可以通过调整拒绝一个检测的阈值来获PR曲线.

对于recall和precision, 注意的是recall是针对样本的, precision是针对预测结果的, 如下

#### 选择搜索实现

Goals:

1. Detect objects at any scale.
- Hierarchical algorithms are good at this.
2. Consider multiple grouping criteria.
- Detect differences in color, texture, brightness, etc.
3. Be fast.

算法步骤如下.

Step 1: Generate initial sub-segmentation
Goal: Generate many regions, each of which belongs to at most one object.

Step 2: Recursively combine similar regions into larger ones.
Greedy algorithm:

1. From set of regions, choose two that are most similar.
2. Combine them into a single, larger region.
3. Repeat until only one region remains.
This yields a hierarchy of successively larger regions, just like we want.

Step 3: Use the generated regions to produce candidate object locations.

那么我们如何定义区域融合时候的相似性呢?

Goals:

1. Use multiple grouping criteria.
2. Lead to a balanced hierarchy of small to large objects.
3. Be efficient to compute: should be able to quickly combine measurements in two regions.

对于候选区域, 我们需要显示的定义这个相似性.

Two-pronged approach:

1. Choose a color space that captures interesting things.
- Different color spaces have different invariants, and different
responses to changes in color.
2. Choose a similarity metric for that space that captures everything we’re interested: color, texture, size, and shape.

首先是颜色空间

- RGB, 缺点是亮度的改变会影响三个通道.

HSV(hue, saturation, value), This color space describes colors (hue or tint) in terms of their shade (saturation or amount of gray) and their brightness value.

Lab uses a lightness channel and two color channels (a and b). It’s calibrated to be perceptually uniform. Like HSV, it’s also somewhat invariant to changes in brightness and shadow.

接下来定义相似度

颜色相似度

对每一个通道的颜色, 创建25 bins的直方图, 这样对一张图片可以得到75维的颜色特征, 接着用一下公式计算颜色相似度.

$$s_{colour}(r_i,r_j)=\sum^n_{k=1}min(c_i^k,c_j^k)$$

纹理相似度

可以利用类HOG特征来对纹理相似度进行度量. 算法如下

在每一个通道下, 计算图片在8个方向的高斯导数.
对每一个方向导数构造10bin的直方图, 得到一共240维的特征.

大小相似度

在进行融合的时候, 最好是小的构造成大的, 使得阶级构造更加平衡, 公式如下.

$$s_{size}(r_i,r_j)=1-\frac{size(r_i)+size(r_j)}{size(im)}$$

这样就可以使得融合后的大小不要过大.

形状匹配程度

计算公式如下

$$fill(r_i,r_j) = 1-\frac{size(BB_{ij})-size(r_i)-size(r_j)}{size(im)}$$

代表的就是两者直接相加后的结果尽可能的能够填充到融合后相加的区域上去.

将上面四个线性组合在一起, 我们有

$$s_(r_i,r_j) = a_1s_{color}(r_i,r_j)+a_2s_{texture}(r_i,r_j)+a_3s_{size}+a_4s_{fill}(r_i,r_j)$$

接下来利用不同的权重组合就可以构造出不同的融合策略.

评估: 平均最好覆盖率ABO

$$ABO=\frac{1}{|G^c|}\sum_{g_i^c\in G^c}max_{l_j\in L}Overlap(g_i^c,l_j)$$

总结

R-CNN虽然效果不错, 但缺点是计算速度非常慢, 不仅仅是训练速度慢, 训练好后对一个图像进行检测可能需要几分钟, 这很难说有用, 因此后有了很多改进.

2014: SPP-Net

SPP网路与R-CNN不同主要在于其在Selective search中先把原来的图像利用卷积映射到一个特征图上(利用Conv5的特征), 然后在利用特征图将候选区域直接映射到特征上, 避免了重复计算. 由于不同候选区域的图像大小不一样, 卷积后的结果不一致, 需要利用Spatial Pyramid Pooling (SPP)layer, 映射到相同大小的特征上, 在连接一个全连接层, 后面仍然在利用SVM和一个线性回归器进行训练, 这样同样要同时训练3个模型, 整个流程如下.

SPP Net的主要改进是使得检测速度得到了很大的提升, 即 makes testing fast, 但同时使得提取特征的卷积网络不能得到训练.

注意这里的SPP层, 结构如下

上图对应的就是SPP(空间金字塔)-NET的网络结构图，任意给一张图像输入到CNN，经过卷积操作我们可以得到卷积特征（比如VGG16最后的卷积层为conv5_3，共产生512张特征图）。图中的window是就是原图一个region proposal对应到特征图的区域，只需要将这些不同大小window的特征映射到同样的维度，将其作为全连接的输入，就能保证只对图像提取一次卷积层特征。SPP-NET使用了空间金字塔采样（spatial pyramid pooling）：将每个window划分为44, 22, 11的块，然后每个块使用max-pooling下采样，这样对于每个window经过SPP层之后都得到了一个长度为(44+22+1)512维度的特征向量，将这个作为全连接层的输入进行后续操作。

简单的说, 空间金字塔层一般用来在处理不同尺寸图片输入的, 以提供变换成固定大小的向量提供给全连接层.

Mapping a Window to Feature Maps.

需要注意的是特征图如何映射到候选区域这一步, 使用的是一个近似的方法, 也就是说给了原图上的一个候选区域, 对其左上的坐标进行映射$x’=[x/S]+1$, 右下的坐标进行映射$x’=[x/S]-1$, 其中S是前面所有层的strids乘积, 也就是说这个selective search还是在原图上进行的, 然后将候选区这个窗口映射到特征图上的窗口, 在利用空间金字塔映射到固定大小的特征上, 进行下一步分类.

2015: Fast R-CNN

虽然SPP利用特征图减少了计算, 但是训练速度还是很慢, 在前向计算卷积的时候由于有很多region, 因此有很多的重复计算.

RoI (Region of Interest) Pooling

ROI pooling layer实际上是SPP-NET的一个精简版，SPP-NET对每个proposal使用了不同大小的金字塔映射，而ROI pooling layer只需要下采样到一个7x7的特征图。对于VGG16网络conv5_3有512个特征图，这样所有region proposal对应了一个77512维度的特征向量作为全连接层的输入。

把不同模型整合为一个网络

第二项特性是在一个模型里联合训练 CNN、分类器和边框回归量。而此前，提取图像特征要用 CNN，分类要用支持向量机，收紧边框要用回归。 Fast R-CNN 用一个单个的网络完成这三项任务, 使得整个模型可以通过反向传播训练, 提高了准确率, 需要注意的是这样也提高了训练时间.

Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10× faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe).

小结：Fast R-CNN融合了R-CNN和SPP-NET的精髓，并且引入多任务损失函数，使整个网络的训练和测试变得十分方便。在Pascal VOC2007训练集上训练，在VOC2007测试的结果为66.9%(mAP)，如果使用VOC2007+2012训练集训练，在VOC2007上测试结果为70%（数据集的扩充能大幅提高目标检测性能）。使用VGG16每张图像总共需要3s左右。

缺点：region proposal的提取使用selective search，目标检测时间大多消耗在这上面（提region proposal 2~3s，而提特征分类只需0.32s），无法满足实时应用，而且并没有实现真正意义上的端到端训练测试（region proposal使用selective search先提取处来）。那么有没有可能直接使用CNN直接产生region proposal并对其分类？Faster R-CNN框架就是符合这样需要的目标检测框架。

2016：Faster R-CNN

即便有上文所提到的优点，Fast R-CNN 仍然有一个重大瓶颈：选区推荐器。正如我们看到的，检测物体位置的第一步，是生成一系列候选边框来进行测试。在 Fast R-CNN 里，这些推荐通过 Selective Search 生成。后者是一个相当慢的过程，成为系统整体的瓶颈, 因此在此引入了 region proposal network (RPN)。

RPN的核心思想是使用卷积神经网络直接产生region proposal，使用的方法本质上就是滑动窗口。RPN的设计比较巧妙，RPN只需在最后的卷积层上滑动一遍，因为anchor机制和边框回归可以得到多尺度多长宽比的region proposal。

我们直接看上边的RPN网络结构图（使用了ZF模型），给定输入图像（假设分辨率为6001000），经过卷积操作得到最后一层的卷积特征图（大小约为4060）。在这个特征图上使用33的卷积核（滑动窗口）与特征图进行卷积，最后一层卷积层共有256个feature map，那么这个33的区域卷积后可以获得一个256维的特征向量，后边接cls layer和reg layer分别用于分类和边框回归（跟Fast R-CNN类似，只不过这里的类别只有目标和背景两个类别）。33滑窗对应的每个特征区域同时预测输入图像3种尺度（128,256,512），3种长宽比（1:1,1:2,2:1）的region proposal，这种映射的机制称为anchor。所以对于这个4060的feature map，总共有约20000(40609)个anchor，也就是预测20000个region proposal。这样设计的好处是什么呢？虽然现在也是用的滑动窗口策略，但是：滑动窗口操作是在卷积层特征图上进行的，维度较原始图像降低了1616倍（中间经过了4次22的pooling操作）；多尺度采用了9种anchor，对应了三种尺度和三种长宽比，加上后边接了边框回归，所以即便是这9种anchor外的窗口也能得到一个跟目标比较接近的region proposal, 这里的想法实际上也是一个对原图像的划分。

Faster R-CNN works to combat the somewhat complex training pipeline that both R-CNN and Fast R-CNN exhibited. The authors insert a region proposal network (RPN) after the last convolutional layer. This network is able to just look at the last convolutional feature map and produce region proposals from that. From that stage, the same pipeline as R-CNN is used (ROI pooling, FC, and then classification and regression heads).

2016: YOLO

制约RCNN框架内的方法精度提升的瓶颈可能是将dectection问题转化成了对图片局部区域的分类问题后，不能充分利用图片局部object在整个图片中的context信息。可能RBG也意识到了这一点，所以他最新的一篇文章YOLO又回到了regression的方法下，这个方法效果很好，在VOC2007上mAP能到63.4%，而且速度非常快，能达到对视频的实时处理（油管视频：https://www.youtube.com/channel/UC7ev3hNVkx4DzZ3LO19oebg），虽然不如Fast-RCNN，但是比传统的实时方法精度提升了太多, 且有很大的提升空间, 最近也出了YOLO2.

YOLO是一个可以一次性预测多个Box位置和类别的卷积神经网络，能够实现端到端的目标检测和识别，其最大的优势就是速度快。事实上，目标检测的本质就是回归，因此一个实现回归功能的CNN并不需要复杂的设计过程。YOLO没有选择滑窗或提取proposal的方式训练网络，而是直接选用整图训练模型。这样做的好处在于可以更好的区分目标和背景区域，相比之下，采用proposal训练方式的Fast-R-CNN常常把背景区域误检为特定目标。当然,YOLO在提升检测速度的同时牺牲了一些精度。下图所示是YOLO检测系统流程：

我们直接看上面YOLO的目标检测的流程图：

给个一个输入图像，首先将图像划分成7*7的网格
对于每个网格，我们都预测2个边框（包括每个边框是目标的置信度以及每个边框区域在多个类别上的概率）
根据上一步可以预测出772个目标窗口，然后根据阈值去除可能性比较低的目标窗口，最后NMS去除冗余窗口即可。可以看到整个过程非常简单，不需要中间的region proposal在找目标，直接回归便完成了位置和类别的判定。

那么如何才能做到直接在不同位置的网格上回归出目标的位置和类别信息呢？上面是YOLO的网络结构图，前边的网络结构跟GoogLeNet的模型比较类似，主要的是最后两层的结构，卷积层之后接了一个4096维的全连接层，然后后边又全连接到一个7730维的张量上。实际上这77就是划分的网格数，现在要在每个网格上预测目标两个可能的位置以及这个位置的目标置信度和类别，也就是每个网格预测两个目标，每个目标的信息有4维坐标信息(中心点坐标+长宽)，1个是目标的置信度，还有类别数20(VOC上20个类别)，总共就是(4+1)2+20 = 30维的向量。这样可以利用前边4096维的全图特征直接在每个网格上回归出目标检测需要的信息（边框信息加类别）。

整个算法过程如下

小结：YOLO将目标检测任务转换成一个回归问题，大大加快了检测的速度，使得YOLO可以每秒处理45张图像。而且由于每个网络预测目标窗口时使用的是全图信息，使得false positive比例大幅降低（充分的上下文信息）。但是YOLO也存在问题：没有了region proposal机制，只使用7*7的网格回归会使得目标不能非常精准的定位，这也导致了YOLO的检测精度并不是很高。

2016: SSD(Single Shot MultiBox Object Detector)

上面分析了YOLO存在的问题，使用整图特征在7*7的粗糙网格内回归对目标的定位并不是很精准。那是不是可以结合region proposal的思想实现精准一些的定位？SSD结合YOLO的回归思想以及Faster R-CNN的anchor机制做到了这点。

上图是SSD的一个框架图，首先SSD获取目标位置和类别的方法跟YOLO一样，都是使用回归，但是YOLO预测某个位置使用的是全图的特征，SSD预测某个位置使用的是这个位置周围的特征（感觉更合理一些）。那么如何建立某个位置和其特征的对应关系呢？可能你已经想到了，使用Faster R-CNN的anchor机制。如SSD的框架图所示，假如某一层特征图(图b)大小是88，那么就使用33的滑窗提取每个位置的特征，然后这个特征回归得到目标的坐标信息和类别信息(图c)。

Generating Image Descriptions

如果我们将RNN和CNN结合在一起, 可以得到非常漂亮的应用, 如李飞飞的论文, 展示了在目标检测后用自然语言进行描述的应用.

这种自然语言的描述与我们之前接触的类别标签不太一样, 称为弱标签, 训练数据会有一句话的描述作为标签.

“Using this training data, a deep neural network “infers the latent alignment between segments of the sentences and the region that they describe” (quote from the paper)”

Spatial Transformer Networks

参考

Selective search
YOLO: You only look once (How it works)
ObjectDetectionUsingCntk
drone-detection
Introduction to Detection

资料

conference：

• CVPR - Computer Vision and Pattern Recognition
• ICCV - International Conference on Computer Vision
• ECCV - European Conference on Computer Vision
• BMVC - British Machine Vision Conference
• ICIP - IEEE International Conference on Image Processing

Textbooks：

• Computer Vision: A Modern Approach (2nd Edition) by David A. Forsyth and Jean Ponce
• Computer Vision by Linda G. Shapiro and George C. Stockman
• Computer Vision: Algorithms and Applications by Richard Szeliski
• Algorithms for Image Processing and Computer Vision by J. R. Parker
• Computer Vision: Models, Learning, and Inference by Dr Simon J. D. Prince
• Computer and Machine Vision, Fourth Edition: Theory, Algorithms, Practicalities by E. R. Davies

Beginner Books：

• Programming Computer Vision with Python: Tools and algorithms for analyzing images by Jan Erik Solem
• Practical Computer Vision with SimpleCV : The Simple Way to Make Technology See by Kurt Demaagd, Anthony Oliver, Nathan Oostendorp, and Katherine Scott
• OpenCV Computer Vision with Python by Joseph Howse
• Learning OpenCV: Computer Vision with the OpenCV Library by Gary Bradski and Adrian Kaehler
• OpenCV 2 Computer Vision Application Programming Cookbook by Robert Laganière
• Mastering OpenCV with Practical Computer Vision Projects by Daniel Lélis Baggio, Shervin Emami,

David Millán Escrivá, Khvedchenia Ievgen, Jasonl Saragih, and Roy Shilkrot
• SciPy and NumPy: An Overview for Developers by Eli Bressert

Python Libraries

When I ﬁrst became interested in computer vision and image search engines over eight
years ago, I had no idea where to start. I didn’t know which language to use, I didn’t
know which libraries to install, and the libraries I found I didn’t know how to use. I WISH
there had been a list like this one, detailing the best Python libraries to use for image
processing, computer vision, and image search engines.
This list is by no means complete or exhaustive. It’s just my favorite Python libraries that
I use each and everyday for computer vision and image search engines. If you think that
I’ve left an important one out, please leave me an email at adrian@pyimagesearch.com.

NumPy
NumPy is a library for the Python programming language that (among other things)
provides support for large, multi-dimensional arrays. Why is that important? Using
NumPy, we can express images as multi-dimensional arrays. Representing images as
NumPy arrays is not only computational and resource efﬁcient, but many other image
processing and machine learning libraries use NumPy array representations as well.
Furthermore, by using NumPy’s built-in high-level mathematical functions, we can
quickly perform numerical analysis on an image.

SciPy
Going hand-in-hand with NumPy, we also have SciPy. SciPy adds further support for
scientiﬁc and technical computing. One of my favorite sub-packages of SciPy is the
spatial package which includes a vast amount of distance functions and a kd-tree
implementation. Why are distance functions important? When we “describe” an image,
we perform feature extraction. Normally after feature extraction an image is represented
by a vector (a list) of numbers. In order to compare two images, we rely on distance
functions, such as the Euclidean distance. To compare two arbitrary feature vectors, we
simply compute the distance between their feature vectors. In the case of the Euclidean
distance, the smaller the distance the more “similar” the two images are.

matplotlib
Simply put, matplotlib is a plotting library. If you’ve ever used MATLAB before, you’ll
probably feel very comfortable in the matplotlib environment. When analyzing images,
we’ll make use of matplotlib, whether plotting the overall accuracy of search systems or
simply viewing the image itself, matplotlib is a great tool to have in your toolbox.

PIL and Pillow
These two packages are good and what they do: simple image manipulations, such as
resizing, rotation, etc. If you need to do some quick and dirty image manipulations
deﬁnitely check out PIL and Pillow, but if you’re serious about learning about image
processing, computer vision, and image search engines, I would highly recommend that
you spend your time playing with OpenCV and SimpleCV instead.

OpenCV
If NumPy’s main goal is large, efﬁcient, multi-dimensional array representations, then,
by far, the main goal of OpenCV is real-time image processing. This library has been
around since 1999, but it wasn’t until the 2.0 release in 2009 did we see the incredible
NumPy support. The library itself is written in C/C++, but Python bindings are provided
when running the installer. OpenCV is hands down my favorite computer vision library,
but it does have a learning curve. Be prepared to spend a fair amount of time learning
the intricacies of the library and browsing the docs (which have gotten substantially
better now that NumPy support has been added). If you are still testing the computer
vision waters, you might want to check out the SimpleCV library mentioned below,
which has a substantially smaller learning curve.

SimpleCV
The goal of SimpleCV is to get you involved in image processing and computer vision
as soon as possible. And they do a great job at it. The learning curve is substantially
smaller than that of OpenCV, and as their tagline says, “it’s computer vision made
easy”. That all said, because the learning curve is smaller, you don’t have access to as
many of the raw, powerful techniques supplied by OpenCV. If you’re just testing the
waters, deﬁnitely try this library out.

mahotas
Mahotas, just as OpenCV and SimpleCV, rely on NumPy arrays. Much of the
functionality implemented in Mahotas can be found in OpenCV and/or SimpleCV, but in
some cases, the Mahotas interface is just easier to use, especially when it comes to
their features package.

scikit-learn
Alright, you got me, Scikit-learn isn’t an image processing or computer vision library —
it’s a machine learning library. That said, you can’t have advanced computer vision
techniques without some sort of machine learning, whether it be clustering, vector
quantization, classiﬁcation models, etc. Scikit-learn also includes a handful of image
feature extraction functions as well.

scikit-image
Scikit-image is fantastic, but you have to know what you are doing to effectively use this
library – and I don’t mean this in a “there is a steep learning curve” type of way. The
learning curve is actually quite low, especially if you check out their gallery. The
algorithms included in scikit-image (I would argue) follow closer to the state-of-the-art in
computer vision. New algorithms right from academic papers can be found in scikit-
image, but in order to (effectively) use these algorithms, you need to have developed
some rigor and understanding in the computer vision ﬁeld. If you already have some
experience in computer vision and image processing, deﬁnitely check out scikit-image;
otherwise, I would continue working with OpenCV and SimpleCV to start.

ilastik
I’ll be honest. I’ve never used ilastik. But through my experiences at computer vision
conferences, I’ve met a fair amount of people who do, so I felt compelled to put it in this
list. Ilastik is mainly for image segmentation and classiﬁcation and is especially geared
towards the scientiﬁc community.
pprocess
Extracting features from images is inherently a parallelizable task. You can reduce the
amount of time it takes to extract features from an entire dataset by using a
multithreading/multitasking library. My favorite is pprocess, due to the simple nature I
need it for, but you can use your favorite.
h5py
The h5py library is the de-facto standard in Python to store large numerical datasets.
The best part? It provides support for NumPy arrays. So, if you have a large dataset
represented as a NumPy array, and it won’t ﬁt into memory, or if you want efﬁcient,
persistent storage of NumPy arrays, then h5py is the way to go. One of my favorite
techniques is to store my extracted features in a h5py dataset and then apply scikit-
learn’s MiniBatchKMeans to cluster the features. The entire dataset never has to be
entirely loaded off disk at once and the memory footprint is extremely small, even for
thousands of feature vectors.