In machine learning parlance, clustering or similarity search looks for affinities in sets of data that normally don’t make such a job easy. If you wanted to compare 100 million images against each other and find the ones that looked most like each other, that’s a clustering job. The hard part is scaling well across multiple processors, where you’d get the biggest speedup.
Facebook’s AI research division () recently unveiled, with little fanfare, a proposed solution called . It’s an open source library, written in C++ and with bindings for Python, that allows massive data sets like still images or videos to be searched efficiently for similarities.
It’s also one of a growing class of machine learning solutions that’s exploring better methods of making algorithms operate in parallel across multiple GPUs for speed that’s only available at scale.
A magnet for the needle in the haystack
FAIR described the project and its goals in a published at the end of last February. The problem wasn’t only how to run similarity searches, or “k-selection” algorithms, on GPUs, but how to run them effectively in parallel across multiple GPUs, and how to deal with data sets that don’t fit into RAM (such as terabytes of video).
, a library of 100 million images. Faiss was fed two images — one of a red flower, and one of a yellow flower — and instructed to find a chain of similar images between them. Searching all 100 million images for such similarities took 35 minutes on a set of four Nvidia Titan X GPUs.
FAIR claims Faiss is “8.5× faster than prior GPU state of the art” and provided some benchmarks to support its claim. When compared against two previous GPU k-selection algorithms, FAIR claimed, the Faiss algorithm was not only faster, but came a good deal closer to maximizing the available memory bandwidth for the GPU.
Another advantage with Faiss, said FAIR, was the total end-to-end time for the search — the time needed to construct the PQ version of the data, plus the time needed to actually run the search. Competing solutions took days on end simply to build PQ graph data for one test; with Faiss, a “high-quality” graph can be built in “about half a day.”
to allow smartphone-grade GPUs to perform image-recognition work.
Another likely long-term advantage of algorithms that can efficiently trade accuracy for speed is to divide labor between a local device (fast, but not as accurate) and a remote back end (more accurate, but requires more processing power). Classifications made by a local device could be used as-is or augmented with more horsepower on the back end if there’s a network connection.
The biggest takeaway with Faiss: There’s still plenty of work to be done in figuring out how machine learning of all stripes can further benefit from massively parallel hardware.