Using PostgreSQL As A Search Engine For Your Site

Image provided by Stocklove

Image provided by Stocklove

Ever wonder how Google, Yahoo, and Bing do their searching? Basically, they aggregate and store massive amounts of data and then they do “full text” searches across various indexes and data storages. Of course, that’s a simplified way of looking at it but that’s it. A “full text” search, similar to what your computer would have had during the time of Windows XP, is a search that takes you across all your files and all your data. Whenever you made a search, you could request a “full text” search which would look across all your .txt, .doc files and bring up your results.

The difference between a “full text” search and Google is also an algorithm of weights as to what is important, what is not. If proximity of two search terms is important or not and so on. But outside of that, it’s pretty much the same thing.

I used a slideshare presentation from ’09 on PgSQL Full Text search as an inspiration for writing this post so you’ll see a lot of similarities between the two.

Using It For Your Site

Now, you don’t need hundreds of servers, and a billion dollar infrastructure to harness the power of full text search, you just need your database for that! Most database systems have a form of a “search engine” in one way or another.

The inefficient, “easy” way

The simplest of which is the “LIKE” statement like so:

select id from posts where content LIKE '%searchterm%'

But that’s not a very good or efficient way of doing it. It’s not only taxing (though unnoticeable on small sites) but it also skips a lot of information. What if you’re looking for a title? Or a tag? or otherwise? Right now, you’d be only searching through the content of a post. That’s crap. The wildcards present make sure that it doesn’t matter where the term is by the way. So, using this method, let’s see what we can do:

SELECT id FROM posts WHERE content LIKE '%searchterm%' OR WHERE tags LIKE '%searchterm%' OR WHERE title LIKE '%searchterm%'

Bleh, so now you’re using three inefficient ways of looking for content. Now, if you’re just searching through tags, by all means, just use “LIKE”. If you’re searching just titles, go ahead. But if you find yourself using “like” several times, and having to add multiple “OR WHERE”s and you end up unsatisfied with your results, it may be time to switch to full-text search and indexing. Sounds scary already, doesn’t it? Well, it’s actually fairly simple.

The REALLY hard way

Before I move to using your database as a search engine, there are several popular other ways of doing a good full-text search. There are entire systems built around this. So let’s review some popular ones:

  • Lucene – by far the most popular that I know of. Why do I say that? Because I’ve worked with it and I’ve worked with Solr (the PHP fork of Lucene) and they’re great but they are WAY too difficult to setup.
  • Sphinx  - another popular “search engine” for your server. Often compared to Lucene in terms of performance.
  • Search Engine Services – There are many services out there that will externally document and search through your website and data. The easiest of which is google with its custom site search. This is great for beginners. Outside of google, you should check out swiftype

Why are these part of the “hard” way? Because there are parts you have to setup, synchronize, and figure out on top of the tools you already have in your toolkit. Your results WILL be better with a dedicated “full text” search engine and an inverted index, but for the most part, unless you’re building google or a high-quality recommendation service, you won’t need it.

The Best Way

The best way is to utilize your own database and its search capabilities. I’m not going to get into the tips and tricks of messing with weights so let’s do a full-on full-text search that will include your post name, post content, and tags. To do this the most efficient way, we need to create a new column that will aggregate our search data. We’ll use a tsvector type for the column like so:

ALTER TABLE posts ADD COLUMN searchtext TSVECTOR;

A tsvector is a data format meant for full text search (ts = text search). In a normal query, you’d have to declare and change the typing on fly (meaning that we’d search through a “text” field which would be converted to tsvector during each query) which is why we’re creating a special column for this.

Next, we’ll add all of our information in a tsvector format into that column:

UPDATE posts SET searchtext = to_tsvector('english', title || '' || content || '' || tags)

Substitute with your own column names. This is great but it will not give you an awesome performance just yet. It may also take a while (not too long!). The cool thing is that this query will coalesce our data into a single column, again, giving us an additional boost. Again, if we just did this on the fly, you could have 3 columns being converted every time a row is hit. We’re specifying the language here as well which allows PgSQL to throw out some words (like “and”, “a”, “an” and others) that hold little to no meaning.

Now we have that cool but wait, there’s more! We’ll need to create an index so that the search doesn’t take 30 minutes per query ;)

CREATE INDEX searchtext_gin ON posts USING GIN(searchtext);

We’re using a GIN index (generalized inverted index), an inverted index is commonly used for full text search and is what Lucene uses. By the way, this CAN take a LONG time, depending on your database. It took my production server about 10-20 minutes on about 50K rows full of data (and by “data”, I mean lots and lots of data in each row).

We’re almost there. Now, you’ll probably update your blog, your site or whatever, so let’s create a trigger that will update our index whenever something happens:

CREATE TRIGGER ts_searchtext BEFORE INSERT OR UPDATE ON posts FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('searchtext', 'pg_catalog.english', 'title', 'content', 'tags')

Sweet. So this trigger will (as the words say) update our searchtext column whenever something changes :) By the way, I’m using “pg_catalog.english” instead of just “english” because for some reason “english” would give me errors when updating a row. :/

Finally, let’s actually try to find something!

SELECT * FROM posts where searchtext @@ to_tsquery('keyword');

I can bet you that your search will be super quick. As you can see it’s pretty simple. My searches went down to a few milliseconds per query once I implemented all of this.

Are you using Laravel?

I’m currently using Laravel with PgSQL as the basis for my app. I’d like to show you a quick implementation for a search ability. Let’s create a “route” called “search”. By the way, I’m using laravel 3.2.12 for future reference.

Route::get('search/(:any)', array('as'=> 'search', 'do'=>function($searchquery){
	$query = urldecode($searchquery);
	$posts = Post::raw_where("searchtext @@ to_tsquery('".$query."')")->get();
	return View::make('search')->with('results', $posts);
}));

Pretty simple, right? Laravel’s “raw_where” is a way of creating your own custom raw SQL to sort through data. Sometimes Laravel’s eloquent doesn’t exactly have the right functions available. There is no way to do full text search like this for example, so we have an opportunity to still take advantage of eloquent ORM but keep PgSQL’s power intact. You’re also welcome to paginate your results.

Hope this helps you guys!

Comments

  1. thank you for writing the article but you have not mentioned rank usage anywhere which is needed for getting quality search

    • Antonin Januska says:

      Weighted search gets a bit more complicated and takes more than just these few lines of code. However, I will put this article idea on my queue list and reply back when I write it :)

  2. Hi Antonin, thanks for this tutorial, excellent!

Add Your Comment