Wednesday, October 15, 2008

Build a Search Engine with 10 Open Source Software Projects

Developing a large software system is always about standing on the shoulders of giants and dodging the proverbial reinvention of the wheel. By using the following open source projects in our search engine effort, we were able to do both. This is the first in a series of technical articles about the Cruxlux architecture in which we will explore how we use exceptional free technologies. Cruxlux is not a whole-web search engine like Google or MSN, but many of the development hurdles are common across any type of search. Without further adieu, and in no particular order of importance, here is the list.

Operating System: Linux

In choosing a foundation OS, there really wasn’t a choice for us. Linux provides a solid architecture and a lot of the open source mindshare is constantly improving it. It is comfortable to develop on with powerful tools such as wmii, vim, gcc, and valgrind as well as simple to use package managers such as synaptic. In an added bonus, there are a wide variety of heavily tested images for Amazon EC2. If you are wondering what distros, we use a mix of Ubuntu and Gentoo.

Database: MySQL

Every web service needs some kind of datastore, so we went with our favorite open source database: MySQL. It has widely tested client libraries in many languages, and a lot of exciting features and development going into it. Guha had a stellar experience with it in Folding@Home, which generated terabytes of data. We are especially interested in tracking Drizzle, given that it will be specifically tailored toward high levels of concurrency and cloud computing. MySQL is used to store metadata that our backend leverages, our user data, as well as posts in our debate infrastructure.

HTTP Client: Curl

Every information extraction system needs a powerful way to grab data from the net. Curl stands the test of time as the best HTTP networking client library out there. It provides us with very high performance, highly concurrent crawls that can easily fill our bandwidth pipe with fresh content. Its threading support is very clean, and the event based support is something that may yield ludicrous speed. We are looking forward to exploring that more.

General Purpose
Library: Boost

Boost is frankly amazing. The library is well thought out and the API usage is consistent throughout, so you don’t have to make a mental context switch every time you use a different Boost tool. Inside of Boost alone, we use Build, Date Time, Filesystem, Math, Pool, Regex, Serialization, Smart Ptr, String, Test, and, last but not least, bjam to build the whole shebang.

Networking Services: Libevent

Event based programming has intrigued everyone with its scalability as well as how it allows developers to achieve concurrency while thinking in a single threaded mindset. See the C10K problem. We utilize libevent in our heavily service oriented architecture. Once you go non blocking you never go back.

Hash Table: Google Sparse/Dense Hash

Don't leave home without your trusty hash table. Thankfully, Google released some of its well guarded secrets out to the world because this library is really imperative to anyone wanting to deal with a lot of data in memory. We use this not only to cache certain data used by our web server, but also during calculation phases by our backend. Couple it with Paul Hsieh’s speedy string hash, and you can have an elegant way to quickly address a great deal of websites on a single machine. If any of you has any other hash functions that you know of, please let us know in the comments. Murmur Hash looks fun to play with, and we will explore it in a later article when we look into hash functions in the domain of URLs.

Indexing Engine: Sphinx

Once you get data into a system, you gotta have a way to get it out again! We tried a lot of different indexing systems: CLucene, SOLR, Mysql Fulltext, but Sphinx won out because of the speed of indexing, powerful delta indexing, and a lightweight, scalable server. Each of them have their own strengths, but Sphinx fit our bill the best. Whatever you are looking for always seems to be at your fingertips in the documentation, and the community is top notch.

Web Server: Nginx

To continue theme of using fast, lightweight, Russian open source projects, we went with nginx for our web server. It’s solid proxying abilities let us use different app servers on the mid tier for various tasks, whether it be mongrel, merb, or a custom C search server.

Web Framework: Ruby/Rails
Ruby has a vast amount of libraries and has been a very powerful tool in prototyping a lot our algorithms and search features we research before porting them to C/C++. Ruby golf has become one of our hobbies. We use Rails throughout our webapp to provide a lot of the structure and additional features around search.

Javascript Libraries: Jquery, Prototype, Scriptaculous

What is core to our design is to provide quick access to mass amounts of data, then use the power of modern clients to process and filter that data inside the browser. It at least gives us a chance of scaling. Javascript, extended via jquery, Prototype, and Scriptaculous, gives us the tools necessary to create a unique interface, that will get a nice face lift over the next few weeks. Jquery plugins in our posse include sparkline, Cycle Lite, and marquee, for example.

So that’s the short list, with plenty of other open source projects sprinkled in there that we will address in future articles. Is there a project we should be using in this mix? Feel free to let us know in the comments. We’ll be doing a series of posts in the coming weeks that focus in how we use these different projects in our own service, and we hope you find them useful in your own pursuits.

Share/Save/Bookmark

Tuesday, October 14, 2008

Updated look

Hope you like the updated look of the front page, and the improved navigation. There are many more changes in the works! 
Also, note that we've moved a bit in the direction of having more clusters. So whereas before you may have seen all political (or sports) sites in one cluster, with positioning within that cluster telling you have interrelated they were, now you're more likely to see multiple clusters for a category. It appears to be simpler to understand this way for first time users, but let us know if you like the old way.

Share/Save/Bookmark

Sunday, August 17, 2008

Cruxlux Search Launch

We’re happy to announce the launch of a powerful new search feature on our home page! Search for any topic and get back not only what Cruxlux users are talking about but also what blogs across the Internet are saying about it, graphically and intuitively presented.

Try it out!

Search for something that interests you, or just click on one of the spotlighted topics.

What do the boxes mean?

Each of the boxes in the map corresponds to a given site, whose name you can see in the top left corner of the box. The box also shows the title of an article that site has recently posted related to the terms you searched for. Click on a box to view more details on the article (in some cases the first few sentences) and a link to it, based on the site’s RSS feed. The closer together two boxes are in the map, the more related those two sites are. For example, let’s say your query is related to something political that’s been discussed a lot in the blogosphere recently. You’ll see in the results that sites with a liberal perspective will tend to cluster together. Conservative sites will not be as close to them, but will be closer to them than sites that don’t have a political view at all, and so on. All the blog relationships are computed automatically. Our algorithm seems to perform well, but it’s not perfect and sometimes it may have to “take a guess” when it doesn’t know much about a particular site, so there may be some queries where some boxes seem misplaced.

Focus your search by using the “Sites Like” field

You can focus your search by specifying a web site or blog that’s representative of the news sources you interested in hearing from. For example, for a hollywood search, focus your search by specifiying your favorite gossip site.

What do the stacks mean?

Sometimes a box has more than one article from that site. You can cycle through the articles in a stack by first clicking on the stack and then using the links.

What are you waiting for?

Please enjoy using this new feature to gather a variety of current opinions on any topic of interest. And also, we hope you’ll jump into some discussions related to the topic, listed on the left side of the home page (or start a new one!) and help everyone get to the crux of any issue.
Share/Save/Bookmark