College Conductor gives educational consultants and counselors access to the information of thousands of U.S. colleges and universities. The most natural method to find these schools is through search. Since College Conductor keeps up with the latest versions of Django, I was able to add search by using PostgreSQL’s full text search that is exposed to the Django ORM. The search features are not perfect, but they provide results quickly and avoid bringing in a more targeted tool like Elasticsearch. This post will cover some of the advantages and disadvantages of using these Django features.
The search ability of College Conductor centers around searching for university names. That means that the data has a couple of interesting characteristics:
When I reviewed the search options that Django exposes in PostgreSQL, I found two reasonable approaches: full text search and trigram similarity.
Let’s look at the one that I didn’t choose first.
Trigram similarity operates
by chunking up data
into three letter groups.
Using some computation,
PosgreSQL can compare the trigrams
and attempt to deduce
what the result should be.
This technique is great
for resolving typos
(which is sometimes called “fuzzy matching”).
if you’re searching for fruit
a trigram search would find
the expected result of
I found that trigrams failed spectacularly
for long phrases.
I liked the idea of fuzzy matching
so I started with the trigram option.
When I tested out the functionality,
it didn’t work at all.
Not only would
fail to find
University of Virginia,
but searching for
would also fail to find the school.
I believe this failure occurred
because so much of the school’s full name was missing.
Who wants to type out
Massachustets College of Pharmacy and Health Sciences
to benefit from spelling Massachusetts incorrectly?
Answer: no one.
Let’s try option two.
Getting a reasonable solution working for full text search was tricky. Again, phrases proved to be troublesome. For any level of success, I had to break up a search query into its parts and join them together with an OR operator. By taking this approach, I could avoid the granularity problem that was unavoidable with the trigram version.
Here’s the code to build the queryset:
# `search` is the user's provided search string. terms = [SearchQuery(term) for term in search.split()] # `name` is where the name of the school is stored in the model. vector = SearchVector('name') query = functools.reduce(operator.or_, terms) queryset = queryset.annotate( rank=SearchRank(vector, query)).order_by('-rank') queryset = queryset.filter(rank__gte=0.04)
Using this join strategy with an OR operator had one additional snag. There was a level of junk results that would appear that had nothing to do with the search phrase.
The result include a rank annotation
that measure of how good the match is
for the query.
Inspecting the rank
showed that all the junk was below a numeric threshold.
Because the list of U.S. schools changes
I decided to cut off anything below the threshold
(which happened to be close to
If the data set changed frequently,
I wouldn’t trust this magic number;
but I get to benefit from the unchanging nature
of my data set.
There are still problems with this method of searching that I’m not crazy about:
Virgwill not find any results. Also annoying.
In the long term, I don’t think PostgreSQL full text search will remain for College Conductor. In the short term, however, it is a great choice for getting decent search functionality cheaply.
If you want to chat about this with me, I'm @mblayman on Twitter.
Rollbar provides some excellent middleware that makes setup a snap for a Django application. But what if you're running Celery with your Django application too? There are a few bumps in the road that I'll explain how to resolve.
Matt is the lead software engineer at Storybird.
Always eager to talk about Python and other technology topics, Matt organizes Python Frederick in Frederick, Maryland (NW of Washington D.C.) and seeks to grow software skills for people in his community.