Arrow of time
Arrow of time
The Needle Search Server - alpha

I've written before about my Needle light-weight full-text search server. To recap: it's a full-text search server written ...

I've written before about my Needle light-weight full-text search server. To recap: it's a full-text search server written in C++ with a FastCGI interface, using Google's LevelDB for storage, and with a pure REST API. It's available at BitBucket if you want to test it yourself!

As these things go, it took me a lot more effort to find the time to work on Needle, but I'm managing it here and there. Most importantly, the use case I need it for (hosting a searchable and subscribale index of Croatia's state gazette on a Raspberry Pi) still exists and needs Needle to progress.

In the last few months I've finalized most of Needle's internals so it's now actually usable! The most imporant thing it lacks right now is some kind of smart search query syntax (currently, it searches for all documents containing any of the given words, i.e. it performs OR). With the infrastructure I have in place it should be reasonably easy to implement logical operators and phrase searching (i.e. words near each other), with ranking.

The full list of completed features is:

  • A pure FastCGI server in C++ with enough tweaks to run under Nginx, Apache and Lighttpd, with a REST API
  • The ability to create (and use) multiple, arbitrary databases (each database is a collection of documents), and up to 4 billion different documents stored in the database
  • Word indexing uses simple stemming, enabling the search of similar word forms. Currently, stemmers for English and Croatian are provided
  • Text document import (indexing) is completely implemented
  • Support for importing JSON dictionaries, describing documents with multiple parts, each of which can have different ranking (e.g. title, abstract, body) is experimental

To run Needle, you need to:

  1. Obtain the source from BitBucket and compile it
  2. Create a config file in /etc as described below
  3. Configure a web server to interface to the Needle's FastCGI server
  4. Run the created executable

If everything goes OK, you should be able to immediately run some REST API operations on the server.

Building Needle

To build Needle, your system needs the following development libraries (the names below are valid for Ubuntu):

  • libfcgi-dev
  • libjsoncpp-dev
  • libsnappy-dev
  • libboost1.54-all-dev

If all the dependencies are installed, needled can be compiled simply by running make.

The configuration file

The configuration file is named /etc/needle.conf.json on Linux, and it should contain content like the following:

{
      // Basic server configuration
      "server": {
              // Verbosity level; 0=quiet, 3=most verbose
              "verbosity": 1,
              // FastCGI socket path; use ":12345" for TCP port instead of Unix socket
              "socket_path": "/tmp/needle.sock"
      },
      // Database configuration
      "db": {
              // Parent directory for all databases
              "dir": "/tmp"
      },
      // Word manipulation
      "words": {
              // Configure stemmer ("none" | "croatian" | "english")
              "stemmer": "croatian"
      }
}

The server.socket_path entry governs where and how the FastCGI socket will be created. It can be a path to a Unix socket file, or a syntax like ":9999" for a TCP socket. The db.dir entry specifies the top-level directory under which Needle will create its databases. Note that the user running the Needle server needs the permissions to write into this directory. Finally the words.stemmer entry specifies which stemmer to use when indexing documents. Currently, this is a per-server setting and influences all databases and all documents.

Configuring Nginx

Nginx is the simplest to configure for the Needle FastCGI server, though additional configuration examples are available in the README file and in the doc directory. You should simply add lines like the following in your nginx configuration file (e.g. in one of the sites-enabled files):

location /needle {
    include         fastcgi_params;
    root            /needle;
    fastcgi_pass    unix:/tmp/needle.sock;
}

Starting Needle

You should run the needled executable with the -d argument to make it stay in foreground. A small number of critical messages will be written to the standard output, while the majority of messages will be passed to syslog.

Testing Needle

A small number of helper Python (2.7) scripts are provided in the utils directory.

  • To create a Needle database named test, run ./ncreatedb test
  • To import a text document into the test database, run ./nimporttxt test doc_id filename.txt. The doc_id is the unique document identifier. It can be almost anything you like, it's just an opaque string which must match the regex ([a-zA-Z][a-zA-Z0-9._$%]+).
  • To search for a word, run ./nsearch test word

Of course, Needle is not meant to be used with such helper scripts, but directly from your applications by accessing its REST API. Each of the scripts prints out the URL it is using to perform its task if it's run with the -v argument.

With this option, you can easily see that, e.g. creating a database means issusing a GET request to an URL in the form of http://example.com/needle/+create/dbname, that importing a text document means a POST request to http://example.com/needle/dbname/doc_id, that searching means a GET request to http://example.com/needle/dbname?q=foo, etc.

Have fun, and I would very much welcome your feedback!


How about a Digital Price Tag?

How about adding intelligent information to our price tags? BitCoin and DogeCoin (and others) already have payment URLs, but we ...

How about adding intelligent information to our price tags? BitCoin and DogeCoin (and others) already have payment URLs, but we can do better than that! The easiest way to offer something for purchase using cryptocurrency is simply to link and/or describe what you want to the buyers on your web shop. You don't really need middle-men and merchants to do so: it's easy to create your own "BUY" link yourself. Option #1: payment URL's Most people know...

Read More
The Needle Search Server - pre-alpha

I've talked about my new project, the Needle Search Server before - it is supposed to be a light-weight full ...

I've talked about my new project, the Needle Search Server before - it is supposed to be a light-weight full text search server written in C++ and using LevelDB for storage. I've arrived at a point where the code actually does something useful and I want to talk about it some more. Of course, you will need to fetch and compile the code yourself and once you get over that hurdle, you...

Read More
Starting a new project - Needle

I have recently built an "Open Government" service which takes all the documents from the official Croatian government gazette which ...

I have recently built an "Open Government" service which takes all the documents from the official Croatian government gazette which, among other things, publishes laws, changes to laws, decisions of the Constitutional court, etc. and indexes them, offering two new services: better full-text searchability and data "push" approach, allowing users to "subscribe" to arbitrary search queries and get notified when there are new documents published which are matched by those queries. Though these documents are pro-forma published on-line at...

Read More