lucene/solr meetup, july 28

July 31, 2010

I attended the Lucene/Solr meetup this week — quite a swank event sponsored by Salesforce with tasty appetizers, beers and an incredible view of the bay. The three speakers were very knowledgeable and well spoken and I enjoyed hearing about the different applications of Lucene and Solr. Below are my rough notes. For folks who want to learn more about Lucene and Solr, check out the upcoming conference Lucene Revolution, Oct 5-8, 2010 in Boston., Bill Press, Salesforce

Salesforce uses Lucene 2.2 (not Solr) and shared some stats about their seriously large scale operation:

It’s a multi-tenant architecture, each org has 1-100,000s users and had a single codebase, which means there is just 1 version to support at one time.

They use post-filtering for:

They query db to bridge the gap with indexing lag.

They are faced with new search challenges driven by what Salesforce CEO calls “the facebook imperative.” When he started Salesforce, he used to ask “why donesn’t every enterprise app look like amazon?” Now he asks: “why doesn’t every enterprise app look like Facebook?” (side note: this is an echo of what many folks have been saying for a while, that social networking makes sense as a feature of an app, rather than just destinations like Facebook and LinkedIn.)

Salesforce allows you to have a feed on a record, follow accounts, status updates for accounts. They index tracked changed. They need to search this rich set of data which is people articulating their interests. Bill noted that the needs of structured data are really different from unstructured data.

Practical Relevance, Grant Ingersoll, Lucid Imagination

Grant Ingersoll spoke of “two tales of relevance”

Better search results = less time searching, more time acting

Other cases to consider:

Befre undertaking any relevance tuning, you need to define what “better search” means to you. There are many ways to test and measure:

Capturing user feedback:

Grant notes that Lucene searches default to “or” out of the box, when “and” is typically better today. He had a list of links that he suggested we check out (sadly I couldn’t type fast enough, but here are some I wrote down):

auto-add phrases to your questies — surround with quotes — automtric win
auto-add a “sloppy phrase” — large slop factor, like an AND, boost when words are close

Logs, Search, Cloud, Jon Gifford, Loggly

Logfile managemetn in the cloud (no Hadoop). Logs are painful — distributed, large, ephemeral. Most log search is hightly skewed. “We’re just implementing grep across terabytes of data.” This was a compelling talk, but it took most of my attention to follow, so my notes are weak and may make sense to no one except me:

syslog + 0MQ + SolrCloud
0MQ – not traditional queing, it fails, when it fails we lose data, but it is very fast
Solr give s us facets which gives us graphs

run many indexers, “hot shards” — the indexers update small shards

0MQ gives us node-specific input queues for Solr

nrt + solrCloud = Our Nirvana

Hot shards re chilled when we stop writing to them

Solr is awesome at what it does, but not so good for data mining
— plan to plug in Hadoop for large-volume analytics
Syslog is the only way in for now, adding others, http, scribe, flume,