Since new CEO Satya Nadella, Microsoft's stance on open-source has been much more, well, open. The company has released code for several projects, and that's now extending to Bing derivatives.
Over the past few days, Microsoft has been releasing components of Bing search technology, including BitFunnel. The tool allows for high-performance full-text search across a big portion of the internet.
So far only a few bits and pieces are available, but an open source just-in-time compiler could have uses far beyond search.
The BitFunnel GitHub page has three project listings on it so far. The first is the BitFunnel search tool itself, but “Workbench” and “NativeJIT” are also core to its operation.
Out of the three, NativeJIT is probably the most exciting. Here's Microsoft's own description:
“NativeJIT is an open-source cross-platform library for high-performance just-in-time compilation of expressions involving C data structures. The compiler is light weight and fast and it takes no dependencies beyond the standard C++ runtime. It runs on Linux, OSX, and Windows. The generated code is optimized with particular attention paid to register allocation.”
One of the most important uses is to search documents for keywords. It looks at each result and orders them based on how well it matches the user's intent. A custom expression is created for each query, and NativeJIT then compiles it into x64 code which can be run on a large set of documents across multiple machines.
According to Microsoft, the resulting assembly code is fast, and most useful in scenarios where:
- “The expression isn't known until runtime.
- The expression will be evaluated enough times to amortize the cost of compilation.
- Latency and throughput demands require low cost for compilation.”
The project is still in development, so it's fairly basic. However, Microsoft has plans to add extra optimizations. One example is reworking the code generator to “restrict execution [of conditionals] to either the true or the false path.”
Workbench, on the other hand, is a package of Java and Lucene tools to prepare information for BitFunnel use. The tools allow users to convert Wikipedia database dumps into the BitFunnel corpus filetype.
First, the dump files are converted and extracted, removing any wiki markup code. Then a Lucene analysis is run for tokenization and streaming, and the data is finally encoded and written into the BitFunnel format.
The team states that although initial conversion may be slow, later experiments should be quick and reliable.
What Does This Mean for Bing?
Of course, the big question is how the open-sourcing will affect this. It's unclear exactly what impact the release will have. One the one hand, aspects of the tools could be used by competitors.
On the other, an open source project could allow for faster adoption of new standards, and input from other developers may lead to further innovation. Whatever the case, it's great to see Microsoft's renewed commitment to the open-source community.
You can read and test the projects yourself by visiting the BitFunnel GitHub page.