Geeking with Greg: 2009

Thursday, December 31, 2009

YouTube needs to entertain

Miguel Helft at the New York Times has a good article this morning, "YouTube's Quest to Suggest More", on how YouTube is trying "to give its users what they want, even even when the users aren't quite sure what that is."

The article focuses on YouTube's "plans to rely more heavily on personalization and ties between users to refine recommendations" and "suggesting videos that users may want to watch based on what they have watched before, or on what others with similar tastes have enjoyed."

What is striking about this is how little this has to do with search. As described in the article, what YouTube needs to do is entertain people who are bored but do not entirely know what they want. YouTube wants to get from users spending "15 minutes a day on the site" closer to the "five hours in front of the television." This is entertainment, not search. Passive discovery, playlists of content, deep classification hierarchies, well maintained catalogs, and recommendations of what to watch next will play a part; keyword search likely will play a lesser role.

And it gets back to the question of how different of a problem Google is taking on with YouTube. Google is about search, keyword advertising, and finding content other people own. YouTube is about entertainment, discovery, content advertising, and cataloging and managing content they control. While Google certainly has the talent to succeed in new areas, it seems they are only now realizing how different YouTube is.

If you are interested in more on this, please see my Oct 2006 post, "YouTube is not Googly". Also, for a little on the technical challenges behind YouTube recommendations and managing a video catalog, please see my earlier posts "Video recommendations on YouTube" and "YouTube cries out for item authority".

Monday, December 28, 2009

Most popular posts of 2009

In case you might have missed them, here is a selection of some of the most popular posts on this blog in the last year.

Jeff Dean keynote at WSDM 2009
Describes Google's architecture and computational power
Put that database in memory
Claims in-memory databases should be used more often
How Google crawls the deep web
How Google probes and crawls otherwise hidden databases on the Web
Advice from Google on large distributed systems
Extends the first post above with more of an emphasis on how Google builds software
Details on Yahoo's distributed database
A look at another large scale distributed database
Book review: Introduction to Information Retrieval
A detailed review of Manning et al.'s fantastic new book. Please see also a recent review of Search User Interfaces.
Google server and data center details
Even more on Google's architecture, this one focused on data center cost optimization
Starting Findory: The end
A summary of and links to my posts describing what I learned at my startup, Findory, over its five years.

Overall, according to Google Analytics, the blog had 377,921 page views and 233,464 unique visitors in 2009. It has about 10k regular readers subscribed to its feed. I hope everyone is finding it useful!

Wednesday, December 16, 2009

Toward an external brain

I have a post up on blog@CACM, "The Rise of the External Brain", on how search over the Web is achieving what classical AI could not, an external brain that supplements our intelligence, knowledge, and memories.

Tuesday, December 08, 2009

Personalized search for all at Google

As has been widely reported, Google is now personalizing web search results for everyone who uses Google, whether logged in or not.

Danny Sullivan at Search Engine Land has particularly good coverage. An excerpt:

Beginning today, Google will now personalize the search results of anyone who uses its search engine, regardless of whether they've opted-in to a previously existing personalization feature.

The short story is this. By watching what you click on in search results, Google can learn that you favor particular sites. For example, if you often search and click on links from Amazon that appear in Google's results, over time, Google learns that you really like Amazon. In reaction, it gives Amazon a ranking boost. That means you start seeing more Amazon listings, perhaps for searches where Amazon wasn't showing up before.

Searchers will have the ability to opt-out completely, and there are various protections designed to safeguard privacy. However, being opt-out rather than opt-in will likely raise some concerns.

There now appears to be a big push at Google for individualized targeting and personalization in search, advertising, and news. Google now appears to be going full throttle on personalization, choosing it as the way forward to improve relevance and usefulness.

With only one generic relevance rank, Google has been finding it is increasingly difficult to improve search quality because not everyone agrees on how relevant a particular page is to a particular search. At some point, to get further improvements, Google has to customize relevance to each person's definition of relevance. When you do that, you have personalized search.

For more on recent moves to personalize news and advertising at Google, please see my posts, "Google CEO on personalized news" and "Google AdWords now personalized".

Update: Two hours later, Danny Sullivan writes a second post, "Google's Personalized Results: The 'New Normal' That Deserves Extraordinary Attention", that also is well worth reading.

Thursday, December 03, 2009

Recrawling and keeping search results fresh

A paper by three Googlers, "Keeping a Search Engine Index Fresh: Risk and Optimality in Estimating Refresh Rates for Web Pages" (not available online), is one of several recent papers looking at "the cost of a page being stale versus the cost of [recrawling]."

The core idea here is that people care a lot about some changes to web pages and don't care about others, and search engines need to respond to that to make search results relevant.

Unfortunately, our Googlers punt on the really interesting problem here, determining the cost of a page being stale. They simply assume any page that is stale hurts relevance the same amount.

That clearly is not true. Not only do some pages appear more frequently than other pages in search results, but also some changes to pages matter more to people than others.

Getting at the cost of being stale is difficult, but a good start is "The Impact of Crawl Policy on Web Search Effectiveness" (PDF) recently presented at SIGIR 2009. It uses PageRank and in-degree as a rough estimate of what pages people will see and click on in search results, then explores the impact of pages people want more frequently.

But that still does not capture whether the change is something people care about. Is, for example, the change below the fold on the page, so less likely to be seen? Is the change correcting a typo or changing an advertisement? In general, what is the cost of showing stale information for this page?

"Resonance on the Web: Web Dynamics and Revisitation Patterns" (PDF), recently presented at CHI, starts to explore that question, looking at the relationship between web content change and how much people want to revisit the pages, as well as thinking about the question of what is an interesting content change.

As it turns out, news is something where change matters and people revisit frequently, and there have been several attempts to treat real-time content such as news differently in search results. One recent example is "Click-Through Prediction for News Queries" (PDF), presented at SIGIR 2009, that describes one method of trying to know when people will want to see news articles for a web search query.

But, rather than coming up with rules for when content from various federated sources should be shown, I wonder if we cannot find a simpler solution. All of these works strive toward the same goal, understanding when people care about change. Relevance depends on what we want, what we see, and what we notice. Search results need only to appear fresh.

Recrawling high PageRank pages is a very rough attempt at making results appear fresh, since high PageRank means a page more likely to be shown and noticed at the top of search results, but it clearly is a very rough approximation. What we really want to know is: Who will see a change? If people see it, will they notice? If they notice, will they care?

Interestingly, people's actions tell us a lot about what they care about. Our wants and needs, where our attention lies, all live in our movements across the Web. If we listen carefully, these voices may speak.

For more on that, please see also my older posts, "Google toolbar data and the actual surfer model" and "Cheap eyetracking using mouse tracking".

Update: One month later, an experiment shows that new content on the Web can be generally available on Google search within 13 seconds.

Thursday, November 19, 2009

Continuous deployment at Facebook

E. Michael Maximilien has a post, "Extreme Agility at Facebook", on blog@CACM. The post reports on a talk at OOPSLA by Robert Johnson (Director of Engineering at Facebook) titled "Moving Fast at Scale".

Here is an interesting excerpt on very frequent deployment of software and how it reduces downtime:

Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of system and surely fixing any bugs that would result from these frequent small changes.

Second, there is limited QA (quality assurance) teams at Facebook but lots of peer review of code. Since the Facebook engineering team is relatively small, all team members are in frequent communications. The team uses various staging and deployment tools as well as strategies such as A/B testing, and gradual targeted geographic launches.

This has resulted in a site that has experienced, according to Robert, less than 3 hours of down time in the past three years.

For more on the benefits of deploying software very frequently, not just for Facebook but for many software companies, please see also my post on blog@CACM, "Frequent Releases Change Software Engineering".

Monday, November 16, 2009

Put that database in memory

An upcoming paper, "The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM" (PDF), makes some interesting new arguments for shifting most databases to serving entirely out of memory rather than off disk.

The paper looks at Facebook as an example and points out that, due to aggressive use of memcached and caches in mysql, the memory they use already is about "75% of the total size of the data (excluding images)." They go on to argue that a system designed around in-memory storage with disk just used for archival purposes would be much simpler, more efficient, and faster. They also look at examples of smaller databases and note that, with servers getting to 64G of RAM and higher and most databases just a couple terabytes, it doesn't take that many servers to get everything in memory.

An excerpt from the paper:

Developers are finding it increasingly difficult to scale disk-based systems to meet the needs of large-scale Web applications. Many people have proposed new approaches to disk-based storage as a solution to this problem; others have suggested replacing disks with flash memory devices.

In contrast, we believe that the solution is to shift the primary locus of online data from disk to random access memory, with disk relegated to a backup/archival role ... [With] all data ... in DRAM ... [we] can provide 100-1000x lower latency than disk-based systems and 100-1000x greater throughput .... [while] eliminating many of the scalability issues that sap developer productivity today.

One subtle but important point the paper makes is that the slow speed of current databases have made web applications both more complicated and more limited than they should be. From the paper:

Traditional applications expect and get latency significantly less than 5-10 μs ... Because of high data latency, Web applications typically cannot afford to make complex unpredictable explorations of their data, and this constrains the functionality they can provide. If Web applications are to replace traditional applications, as has been widely predicted, then they will need access to data with latency much closer to what traditional applications enjoy.

Random access with very low latency to very large datasets ... will not only simplify the development of existing applications, but they will also enable new applications that access large amounts of data more intensively than has ever been possible. One example is ... algorithms that must traverse large irregular graph structures, where the access patterns are ... unpredictable.

The authors point out that data access patterns currently need to be heavily optimized, carefully ordered, and must conservatively acquire extra data in case it is later needed, all things that mostly go away if you are using a database where access has microsecond latency.

While the authors do not go as far as to argue that memory-based databases are cheaper, they do argue that they are cost competitive, especially once developer time is taken into account. It seems to me that you could go a step further here and argue very low latency databases brings such large productivity gains to developers and benefits to application users that they are in fact cheaper, but the paper does not try to do that.

If you don't have time to read the paper, slides (PDF) are also available that are very quick to skim from a talk by one of the authors.

If you can't get enough of this topic, please see my older post, "Replication, caching, and partitioning", which argues that big caching layers, such as memcached, are overdone compared to having each database shard serve most data out of memory.

HT, James Hamilton, for first pointing to the RAMClouds slides.

Thursday, November 12, 2009

The reality of doing a startup

Paul Graham has a fantastic article up, "What Startups Are Really Like", with the results of what happened when he asked all the founders of the Y Combinator startups "what surprised them about starting a startup."

A brief excerpt summarizing the findings:

Unconsciously, everyone expects a startup to be like a job, and that explains most of the surprises. It explains why people are surprised how carefully you have to choose cofounders and how hard you have to work to maintain your relationship. You don't have to do that with coworkers. It explains why the ups and downs are surprisingly extreme. In a job there is much more damping. But it also explains why the good times are surprisingly good: most people can't imagine such freedom. As you go down the list, almost all the surprises are surprising in how much a startup differs from a job.

There are 19 surprises listed in the essay. Below are excerpts from some of them:

Be careful who you pick as a cofounder ... [and] work hard to maintain your relationship.

Startups take over your life ... [You will spend] every waking moment either working or thinking about [your] startup.

It's an emotional roller-coaster ... How low the lows can be ... [though] it can be fun ... [But] starting a startup is fun the way a survivalist training course would be fun, if you're into that sort of thing. Which is to say, not at all, if you're not.

Persistence is the key .... [but] mere determination, without flexibility ... may get you nothing.

You have to do lots of different things ... It's much more of a grind than glamorous.

When you let customers tell you what they're after, they will often reveal amazing details about what they find valuable as well what they're willing to pay for.

You can never tell what will work. You just have to do whatever seems best at each point.

Expect the worst with deals ... Deals fall through.

The degree to which feigning certitude impressed investors .... A lot of what startup founders do is just posturing. It works.

How much of a role luck plays and how much is outside of [your] control ... Having skill is valuable. So is being determined as all hell. But being lucky is the critical ingredient ... Founders who succeed quickly don't usually realize how lucky they were.

Definitely worth reading the entire article if you are at all considering a startup.

For my personal take on some surprises I hit, please see my earlier post on Starting Findory.

Tuesday, November 10, 2009

Scary data on Botnet activity

An amusingly titled paper to be presented at the CSS 2009 conference, "Your Botnet is My Botnet: Analysis of a Botnet Takeover" (PDF), contains some not-so-funny data on how sophisticated hijacking computers has now become, the data they are able to collect, and the profits that fuel the development of more and more dangerous botnets.

Extended excerpts from the paper, focusing on the particularly scary bits:

We describe our experience in actively seizing control of the Torpig (a.k.a. Sinowal, or Anserin) botnet for ten days. Torpig ... has been described ... as "one of the most advanced pieces of crimeware ever created." ... The sophisticated techniques it uses to steal data from its victims, the complex network infrastructure it relies on, and the vast financial damage that it causes set Torpig apart from other threats.

Torpig has been distributed to its victims as part of Mebroot. Mebroot is a rootkit that takes control of a machine by replacing the system's Master Boot Record (MBR). This allows Mebroot to be executed at boot time, before the operating system is loaded, and to remain undetected by most anti-virus tools.

Victims are infected through drive-by-download attacks ... Web pages on legitimate but vulnerable web sites ... request JavaScript code ... [that] launches a number of exploits against the browser or some of its components, such as ActiveX controls and plugins. If any exploit is successful ... an installer ... injects a DLL into the file manager process (explorer.exe) ... [that] makes all subsequent actions appear as if they were performed by a legitimate system process ... loads a kernel driver that wraps the original disk driver (disk.sys) ... [and] then overwrite[s] the MBR of the machine with Mebroot.

Mebroot has no malicious capability per se. Instead, it provides a generic platform that other modules can leverage to perform their malicious actions ... Immediately after the initial reboot ... [and] in two-hour intervals ... Mebroot contacts the Mebroot C&C server to obtain malicious modules ... All communication ... is encrypted.

The Torpig malware ... injects ... DLLs into ... the Service Control Manager (services.exe), the file manager, and 29 other popular applications, such as web browsers (e.g., Microsoft Internet Explorer, Firefox, Opera), FTP clients (CuteFTP, LeechFTP), email clients (e.g., Thunderbird, Outlook, Eudora), instant messengers (e.g., Skype, ICQ), and system programs (e.g., the command line interpreter cmd.exe). After the injection, Torpig can inspect all the data handled by these programs and identify and store interesting pieces of information, such as credentials for online accounts and stored passwords. ... Every twenty minutes ... Torpig ... upload[s] the data stolen.

Torpig uses phishing attacks to actively elicit additional, sensitive information from its victims, which, otherwise, may not be observed during the passive monitoring it normally performs ... Whenever the infected machine visits one of the domains specified in the configuration file (typically, a banking web site), Torpig ... injects ... an HTML form that asks the user for sensitive information, for example, credit card numbers and social security numbers. These phishing attacks are very difficult to detect, even for attentive users. In fact, the injected content carefully reproduces the style and look-and-feel of the target web site. Furthermore, the injection mechanism defies all phishing indicators included in modern browsers. For example, the SSL configuration appears correct, and so does the URL displayed in the address bar.

Consistent with the past few years' shift of malware from a for-fun (or notoriety) activity to a for-profit enterprise, Torpig is specifically crafted to obtain information that can be readily monetized in the underground market. Financial information, such as bank accounts and credit card numbers, is particularly sought after. In ten days, Torpig obtained the credentials of 8,310 accounts at 410 different institutions ... 1,660 unique credit and debit card numbers .... 297,962 unique credentials (username and password pairs) .... [in] information that was sent by more than 180 thousand infected machines.

The paper estimates the value of the data collected by this sophisticated piece of malware to be between $3M - $300M/year on the black market.

[Paper found via Bruce Schneier]

Saturday, November 07, 2009

Starting Findory: The end

This is the end of my Starting Findory series.

Findory was my first startup and a nearly five year effort. Its goal of personalizing information was almost laughably ambitious, a joy to pursue, and I learned much.

I learned that a cheap is good, but too cheap is bad. It does little good to avoid burning too fast only to starve yourself of what you need.

I re-learned the importance of a team, one that balances the weaknesses of some with the strengths of another. As fun as learning new things might be, trying to do too much yourself costs the startup too much time in silly errors born of inexperience.

I learned the necessity of good advisors, especially angels and lawyers. A startup needs people who can provide expertise, credibility, and connections. You need advocates to help you.

And, I learned much more, some of which is detailed in the other posts in the Starting Findory series:

I hope you enjoyed these posts about my experience trying to build a startup. If you did like this Starting Findory series, you might also be interested in my Early Amazon posts. They were quite popular a few years ago.

Wednesday, November 04, 2009

Using only experts for recommendations

A recent paper from SIGIR, "The Wisdom of the Few: A Collaborative Filtering Approach Based on Expert Opinions from the Web" (PDF), has a very useful exploration into the effectiveness of recommendations using only a small pool of trusted experts.

The results suggest that using a small pool of a couple hundred experts, possibly your own experts or experts selected and mined from the web, has quite a bit of value, especially in cases where big data from a large community is unavailable.

A brief excerpt from the paper:

Recommending items to users based on expert opinions .... addresses some of the shortcomings of traditional CF: data sparsity, scalability, noise in user feedback, privacy, and the cold-start problem .... [Our] method's performance is comparable to traditional CF algorithms, even when using an extremely small expert set .... [of] 169 experts.

Our approach requires obtaining a set of ... experts ... [We] crawled the Rotten Tomatoes web site –- which aggregates the opinions of movie critics from various media sources -- to obtain expert ratings of the movies in the Netflix data set.

The authors certainly do not claim that using a small pool of experts is better than traditional collaborative filtering.

What they do say is that using a very small pool of experts works surprisingly well. In particular, I think it suggests a good alternative to content-based methods for bootstrapping a recommender system. If you can create a high quality pool of experts, even a fairly small one, you may have good results starting with that while you work to gather ratings from the broader community.

Thursday, October 29, 2009

Google CEO on personalized news

Google CEO Eric Schmidt has been talking quite a bit about personalization in online news recently. First, Eric said:

We and the industry ... [should] personalize the news.

At its best, the on-line version of a newspaper should learn from the information I'm giving it -- what I've read, who I am and what I like -- to automatically send me stories and photos that will interest me.

Then, Eric described how newspapers could make money using personalized advertising:

Imagine a magazine online that knew everything about you, knew what you had read, allowed you to go deep into a subject and also showed you things... that are serendipit[ous] ... popular ... highly targetable ... [and] highly advertisable. Ultimately, money will be made.

Finally, Eric claimed Google has a moral duty to help newspapers succeed:

Google sees itself as trying to make the world a better place. And our values are that more information is positive -- transparency. And the historic role of the press was to provide transparency, from Watergate on and so forth. So we really do have a moral responsibility to help solve this problem.

Well-funded, targeted professionally managed investigative journalism is a necessary precondition in my view to a functioning democracy ... That's what we worry about ... There [must be] enough revenue that ... the newspaper [can] fulfill its mission.

Eric's words come at a time when, as the New York Times reports, newspapers are cratering, with "revenue down 16.6 percent last year and about 28 percent so far this year."

For more on personalized news, please see my earlier posts, "People who read this article also read", "A brief history of Findory", and "Personalizing the newspaper".

For more on personalized advertising, please see my July 2007 post, "What to advertise when there is no commercial intent?"

Update: Some more useful references in the comments.

Update: Five weeks later, Eric Schmidt, in the WSJ, imagines a newspaper that "knows who I am, what I like, and what I have already read" and that makes sure that "like the news I am reading, the ads are tailored just for me" instead of being "static pitches for products I'd never use." He also criticizes newspapers for treating readers "as a stranger ... every time [they] return."

Wednesday, October 21, 2009

Advice from Google on large distributed systems

Google Fellow Jeff Dean gave a keynote talk at LADIS 2009 on "Designs, Lessons and Advice from Building Large Distributed Systems". Slides (PDF) are available.

Some of this talk is similar to Jeff's past talks but with updated numbers. Let me highlight a few things that stood out:

A standard Google server appears to have about 16G RAM and 2T of disk. If we assume Google has 500k servers (which seems like a low-end estimate given they used 25.5k machine years of computation in Sept 2009 just on MapReduce jobs), that means they can hold roughly 8 petabytes of data in memory and, after x3 replication, roughly 333 petabytes on disk. For comparison, a large web crawl with history, the Internet Archive, is about 2 petabytes and "the entire [written] works of humankind, from the beginning of recorded history, in all languages" has been estimated at 50 petabytes, so it looks like Google easily can hold an entire copy of the web in memory, all the world's written information on disk, and still have plenty of room for logs and other data sets. Certainly no shortage of storage at Google.

Jeff says, "Things will crash. Deal with it!" He then notes that Google's datacenter experience is that, in just one year, 1-5% of disks fail, 2-4% of servers fail, and each machine can be expected to crash at least twice. Worse, as Jeff notes briefly in this talk and expanded on in other talks, some of the servers can have slowdowns and other soft failure modes, so you need to track not just up/down states but whether the performance of the server is up to the norm. As he has said before, Jeff suggests adding plenty of monitoring, debugging, and status hooks into your systems so that, "if your system is slow or misbehaving" you can quickly figure out why and recover. From the application side, Jeff suggests apps should always "do something reasonable even if it is not all right" on a failure because it is "better to give users limited functionality than an error page."

Jeff emphasizes the importance of back of the envelope calculations on performance, "the ability to estimate the performance of a system design without actually having to build it." To help with this, on slide 24, Jeff provides "numbers everyone should know" with estimates of times to access data locally from cache, memory, or disk and remotely across the network. On the next slide, he walks through an example of estimating the time to render a page with 30 thumbnail images under several design options. Jeff stresses the importance of having an at least high-level understanding of the operation of the performance of every major system you touch, saying, "If you don't know what's going on, you can't do decent back-of-the-envelope calculations!" and later adding, "Think about how much data you're shuffling around."

Jeff makes an insightful point that, when designing for scale, you should design for expected load, ensure it still works at x10, but don't worry about scaling to x100. The problem here is that x100 scale usually calls for a different and usually more complicated solution than what you would implement for x1; a x100 solution can be unnecessary, wasteful, slower to implement, and have worse performance at a x1 load. I would add that you learn a lot about where the bottlenecks will be at x100 scale when you are running at x10 scale, so it often is better to start simpler, learn, then redesign rather than jumping into a more complicated solution that might be a poor match for the actual load patterns.

The talk covers BigTable, which was discussed in previous talks but now has some statistics updated, and then goes on to talk about a new storage and computation system called Spanner. Spanner apparently automatically moves and replicates data based on usage patterns, optimizes the resources of the entire cluster, uses a hierarchical directory structure, allows fine-grained control of access restrictions and replication on the data, and supports distributed transactions for applications that need it (and can tolerate the performance hit). I have to say, the automatic replication of data based on usage sounds particularly cool; it has long bothered me that most of these data storage systems create three copies for all data rather than automatically creating more than three copies of frequently accessed head data (such as the last week's worth of query logs) and then disposing of the extra replicas when they are no longer in demand. Jeff says they want Spanner to scale to 10M machines and an exabyte (1k petabytes) of data, so it doesn't look like Google plans on cutting their data center growth or hardware spend any time soon.

Data center guru James Hamilton was at the LADIS 2009 talk and posted detailed notes. Both James' notes and Jeff's slides (PDF) are worth reviewing.

Monday, October 19, 2009

Using the content of music for search

I don't know much about analyzing music streams to find similar music, which is part of why I much enjoyed reading "Content-Based Music Information Retrieval" (PDF). It is a great survey of the techniques used, helpfully points to a few available tools, and gives several examples of interesting research projects and commercial applications.

Some extended excerpts:

At present, the most common method of accessing music is through textual metadata .... [such as] artist, album ... track title ... mood ... genre ... [and] style .... but are not able to easily provide their users with search capabilities for finding music they do not already know about, or do not know how to search for.

For example ... Shazam ... can identify a particular recording from a sample taken on a mobile phone in a dance club or crowded bar ... Nayio ... allows one to sing a query and attempts to identify the work .... [In] Musicream ... icons representing pieces flow one after another ... [and] by dragging a disc in the flow, the user can easily pick out other similar pieces .... MusicRainbow ... [determines] similarity between artists ... computed from the audio-based similarity between music pieces ... [and] the artists are then summarized with word labels extracted from web pages related to the artists .... SoundBite ... uses a structural segmentation [of music tracks] to generate representative thumbnails for [recommendations] and search.

An intuitive starting point for content-based music information retrieval is to use musical concepts such as melody or harmony to describe the content of music .... Surprisingly, it is not only difficult to extract melody from audio but also from symbolic representations such as MIDI files. The same is true of many other high-level music concepts such as rhythm, timbre, and harmony .... [Instead] low-level audio features and their aggregate representations [often] are used as the first stage ... to obtain a high-level representation of music.

Low-level audio features [include] frame-based segmentations (periodic sampling at 10ms - 1000ms intervals), beat-synchronous segmentations (features aligned to musical beat boundaries), and statistical measures that construct probability distributions out of features (bag of features models).

Estimation of the temporal structure of music, such as musical beat, tempo, rhythm, and meter ... [lets us] find musical pieces having similar tempo without using any metadata .... The basic approach ... is to detect onset times and use them as cues ... [and] maintain multiple hypotheses ... [in] ambiguous situations.

Melody forms the core of Western music and is a strong indicator for the identity of a musical piece ... Estimated melody ... [allows] retrieval based on similar singing voice timbres ... classification based on melodic similarities ... and query by humming .... Melody and bass lines are represented as a continuous temporal-trajectory representation of fundamental frequency (F0, perceived as pitch) or a series of musical notes .... [for] the most predominant harmonic structure ... within an intentionally limited frequency range.

Audio fingerprinting systems ... seek to identify specific recordings in new contexts ... to [for example] normalize large music content databases so that a plethora of versions of the same recording are not included in a user search and to relate user recommendation data to all versions of a source recording including radio edits, instrumental, remixes, and extended mix versions ... [Another example] is apocrypha ... [where] works are falsely attributed to an artist ... [possibly by an adversary after] some degree of signal transformation and distortion ... Audio shingling ... [of] features ... [for] sequences of 1 to 30 seconds duration ... [using] LSH [is often] employed in real-world systems.

The paper goes into much detail on these topics as well as covering other areas such as chord and key recognition, chorus detection, aligning melody and lyrics (for Karaoke), approximate string matching techniques for symbolic music data (such as matching noisy melody scores), and difficulties such as polyphonic music or scaling to massive music databases. There also is a nice pointer to publicly available tools for playing with these techniques if you are so inclined.

By the way, for a look at an alternative to these kinds of automated analyses of music content, don't miss this last Sunday's New York Times Magazine section article, "The Song Decoders", describing Pandora's effort to manually add fine-grained mood, genre, and style categories to songs and articles and then use it for finding similar music.

Friday, October 16, 2009

An arms race in spamming social software

Security guru Bruce Schneier has a great post up, "The Commercial Speech Arms Race", on the difficulty of eliminating spam in social software. An excerpt:

When Google started penalising a site's search engine rankings for having ... link farms ... [then] people engaged in sabotage: they built link farms and left blog comment spam to their competitors' sites.

The same sort of thing is happening on Yahoo Answers. Initially, companies would leave answers pushing their products, but Yahoo started policing this. So people have written bots to report abuse on all their competitors. There are Facebook bots doing the same sort of thing.

Last month, Google introduced Sidewiki, a browser feature that lets you read and post comments on virtually any webpage ... I'm sure Google has sophisticated systems ready to detect commercial interests that try to take advantage of the system, but are they ready to deal with commercial interests that try to frame their competitors?

This is the arms race. Build a detection system, and the bad guys try to frame someone else. Build a detection system to detect framing, and the bad guys try to frame someone else framing someone else. Build a detection system to detect framing of framing, and well, there's no end, really.

Commercial speech is on the internet to stay; we can only hope that they don't pollute the social systems we use so badly that they're no longer useful.

An example that Bruce did not mention is shill reviews on Amazon and elsewhere, something that appears to have become quite a problem nowadays. The most egregious example of this is paying people using Amazon MTurk to write reviews, as CMU professor Luis voh Ahn detailed a few months ago.

Some of the spam can be detected using algorithms, looking for atypical behaviors in text or actions, and using community feedback, but even community feedback can be manipulated. It is common, for example, to see negative reviews get a lot of "not helpful" votes on Amazon.com, which, at least in some cases, appears to be the work of people who might gain from suppressing those reviews. An arms race indeed.

An alternative to detection is to go after the incentive to spam, trying to reduce the reward from spamming. The winner-takes-all effect of search engine optimization -- where being the top result for a query has enormous value because everyone sees it -- could be countered, for example, by showing different results to different people. For more on that, please see my old July 2006 post, "Combating web spam with personalization".

Monday, October 12, 2009

A relevance rank for communication

Nick Carr writes of our communication hell, starting with "the ring of your phone would butt into whatever you happened to be doing at that moment" and left you "no choice but to respond immediately", going through false saviors such as asynchronous voice mail and e-mail, and leading to "the approaching [Google] Wave [which] promises us the best of both worlds: the realtime immediacy of the phone call with the easy broadcasting capacity of email. Which is also, as we'll no doubt come to discover, the worst of both worlds."

In all of these communications, the problem is not so much the difference of synchronous or asynchronous but the lack of priority. Phone calls, voice mails, e-mails, and text messages, all of these appear to us sorted by date. Reverse chronological order works well as a sort order when either the list is short or when we only care about the tip of the stream. Otherwise, it rapidly becomes overwhelming.

When your communication becomes overwhelming, when there is just too much to look at it all, you need a way to prioritize. You need a relevance rank for communication.

The closest I have seen to this still is an ancient project out of Microsoft Research called Priorities. This project and work that followed ([1] [2]) tried to automate the process we all currently do manually of prioritizing incoming communication. The idea was to look at who we talk to, what we talk to them about, and add in information such as overall social capital to rank order the chatter by usefulness and importance.

Going one step further, not only do we need a relevance rank for communication, but also for all the information streaming at us in our daily lives. We need personalized rankers for news, communications, events, and shopping. Information streams need to be transformed from a flood of noise to a trickle of relevance. Information overload must be tamed.

For more on that, please also see my Dec 2005 post, "E-mail overload, social sorting, and EmailRank", and my Mar 2005 post, "A relevance rank for news and weblogs".

Monday, September 28, 2009

Starting Findory: Acqusition talks

[I wrote a draft of this post nearly two years ago as part of my Starting Findory series, but did not publish it at the time; it seemed inappropriate given my position at Microsoft and the economic downturn. Recently, Google and Microsoft both announced ([1] [2]) that they intend to make 12-15 acquisitions a year, which makes this much more timely.]

At various points when I was running Findory, I approached or was approached by other firms about acquisition. For the most part, these talks went well, but, as with many experiences at Findory, my initial expectations proved naive. It gave me much to contemplate, both for what startups should think about when entering these talks and changes bigger companies might consider in how they do acquisitions.

For a startup, acquisition talks can be a major distraction. It is time away from building features for customers. It creates legal bills that increase burn rate. It distracts the startup with nervous flutters over uncertainty over the future, potential distant payouts, and the complexity of a move.

Acquisition talks also can be dangerous for a startup. Some companies might start due diligence, extract all the information they can, then decided to try to build themselves.

There is some disagreement on this last point. For example, Y-Combinator's Paul Graham wrote, "What protects little companies from being copied by bigger competitors is ... the thousand little things the big company will get wrong if they try." Paul is claiming that big companies have such poor ability to execute that the danger of telling them everything is low.

However, big companies systematically underestimate the risk of failure and cost of increased time to market. For internal teams, which often already are jealous of the supposedly greener grass of the startup life, the perceived fun of trying to build it themselves means they are unreasonably likely to try to do so. Paul is right that a big company likely will get it wrong when they try, but they also are likely to try, which means the startup got nothing from their talks but a distraction.

There are other things to watch out for in acquisition talks. At big companies, acquisitions of small startups often are channeled into the same slow, bureaucratic process as an acquisition of a 300-person company. Individual incentives at large firms often reward lack of failure more than success, creating a bias toward doing nothing over doing something. In fact, acquiring companies usually feel little sense of urgency until an executive is spooked by an immediate competitive threat, at which point they panic like a wounded beast, suddenly motivated by fear.

Looking back, my biggest surprise was that companies show much less interest than I expected in seeking out very small, early-stage companies to acquire.

As Paul Graham argued in his essay, "Hiring is obsolete", for the cost of what looks like a large starting bonus, the companies get experience, passion, and proven ability to deliver. By acquiring early, there is only talent and technology. There is no overhead, no markups for financiers, and no investment in building a brand.

Moreover, as the business literature shows, it is the small acquisitions that usually bring value to a company and the large acquisitions that destroy value. On average, companies should prefer doing 100 $2-5M acquisitions over one large $200-500M one, but business development groups at large companies are not set up that way.

There is a missed opportunity here. Bigger companies could treat startups like external R&D, letting those that fail fail at no cost, scooping up the talent that demonstrates ingenuity, passion, and ability to execute. It would be a different way of doing acquisitions, one that looks more like hiring than a merging of equals, but also one that is likely to yield much better results.

For more on that, see also Paul Graham's essay, "The Future of Web Startups", especially his third point under the header "New Attitudes to Acquisition".

Thursday, September 17, 2009

Book review: Search User Interfaces

UC Berkeley Professor Marti Hearst has a great new book out, "Search User Interfaces".

The book is a survey of recent work in search, but with an unusual focus on the importance of interface design on searcher's perceptions of the quality and usefulness of the search results.

Marti writes with the opinionated authority of an expert in the field, usefully pointing at techniques which have shown promise while dismissing others as consistently confusing to users. Her book is a guide to what works and what does not in search, warning of paths that likely lead into the weeds and counseling us toward better opportunities.

To see what I mean, here are some extended excerpts. First, on why web search result pages still are so simple and spartan in design:

[The] search results page from Google in 2007 [and] ... Infoseek in 1997 ... are nearly identical. Why is the standard interface so simple?

Search is a means towards some other end, rather than a goal in itself. When a person is looking for information, they are usually engaged in some larger task, and do not want their flow of thought interrupted ... The fewer distractions while reading, the more usable the interface.

Almost any feature that a designer might think is intuitive and obvious is likely to be mystifying to a significant proportion of Web users.

On the surprising importance of small and subtle design tweaks:

Small [design] details can make [a big] difference ... For example, Franzen and Karlgren, 2000 found that showing study participants a wider entry form encouraged them to type longer queries.

[In] another example ... [in] an early version of the Google spelling suggestions ... searchers generally did not notice the suggestion at the top of the page of results ... Instead, they focused on the search results, scrolling down to the bottom of the page scanning for a relevant result but seeing only the very poor matches to the misspelled words. They would then give up and complain that the engine did not return relevant results ... [The solution] was to repeat the spelling suggestion at the bottom of the page.

[More generally,] Hotchkiss, 2007b attributed [higher satisfaction on Google] not to the quality of the search results, but rather to [the] design ... A Google VP confirmed that the Web page design is the result of careful usability testing of small design elements.

Hotchkiss, 2007b also noted that Google is careful to ensure that all information in ... the upper left hand corner ... where users tend to look first for search results ... is of high relevance ... He suggested that even if the result hits for other search engines are equivalent in quality to Google's, they sometimes show ads that are not relevant at the top of the results list, thus degrading the user experience.

Speedy search results are important for staying on task, getting immediate feedback, and rapidly iterating:

Rapid response time is critical ... Fast response time for query reformulation allows the user to try multiple queries rapidly. If the system responds with little delay, the user does not feel penalized for [experimenting] ... Research suggests that when rapid responses are not available, search strategies change.

So how do people search? Marti summarizes many models, but here are excerpts on my favorites, berry-picking and information foraging:

The berry-picking model of information seeking ... [assumes] the searchers' information needs, and consequently their queries, continually shift.

Information encountered at one point in a search may lead in a new, unanticipated direction. The original goal may become partly fulfilled, thus lowering the priority of one goal in favor of another ... [The] searchers' information needs are not satisfied by a single, final retrieved set of documents, but rather by a series of selections and bits of information found along the way.

[Similarly,] information foraging theory ... assumes that search strategies evolve toward those that maximize the ratio of valuable information gained to unit of cost for searching and reading.

The berry-picking model is supported by a number of observational studies (Ellis, 1989, Borgman, 1996b ... O'Day and Jeffries, 1993) .... A commonly-observed search strategy is one in which the information seeker issues a quick, imprecise query in the hopes of getting into approximately the right part of the information space, and then doing a series of local navigation operations to get closer to the information of interest (Marchionini, 1995, Bates, 1990).

One part of ... information foraging theory discusses the notion of information scent : cues that provide searchers with concise information about content that is not immediately perceptible. Pirolli, 2007 notes that small pertubations in the accuracy of information scent can cause qualitative shifts in the cost of browsing; improvements in information scent are related to more efficient foraging .... Search results listings must provide the user with clues about which results to click.

What can we do to help people forage for information? Let's start with providing strong information scent in our search result snippets:

[It is important to] display ... a summary that takes the searcher's query terms into account. This is referred to as keyword-in-context (KWIC) extractions.

[It] is different than a standard abstract, whose goal is to summarize the main topics of the document but might not contain references to the terms within the query. A query-oriented extract shows sentences that summarize the ways the query terms are used within the document.

Visually highlighting query terms ... helps draw the searcher's attention to the parts of the document most likely to be relevant to the query, and to show how closely the query terms appear to one another in the text. However, it is important not to highlight too many terms, as the positive effects of highlighting will be lost.

The prevalence of query-biased summaries is relatively recent ... [when] Google began storing full text of documents, making them visible in their cache and using their content for query-biased summaries. Keyword-in-context summaries [now] have become the de facto standard for web search engine result displays.

There is an inherent tradeoff between showing long, informative summaries and minimizing the screen space required by each search hit. There is also a tension between showing fragments of sentences that contain all or most of the query terms and showing coherent stretches of text containing only some of the query terms. Research is mixed about how and when chopped-off sentences are preferred and when they harm usability (Aula, 2004, Rose et al., 2007). Research also shows that different results lengths are appropriate depending on the type of query and expected result type (Lin et al., 2003, Guan and Cutrell, 2007, Kaisser et al., 2008), although varying the length of results has not been widely adopted in practice.

Next, let's help people iterate on their searches:

Roughly 50% of search sessions involve some kind of query reformulation .... Term suggestion tools are used roughly 35% of the time that they are offered to users.

Usability studies are generally positive as to the efficacy of term suggestions when users are not required to make relevance judgements and do not have to choose among too many terms ... Negative results... seem to stem from problems with the presentation interface.

[When used,] search results should be shown immediately after the initial query, alongside [any] additional search aids .... A related recent development in rapid and effective user feedback is an interface that suggests a list of query terms dynamically, as the user types the query.

10-15% of queries contain spelling or typographical errors .... [Searchers] may prefer [a spelling] correction to be made automatically to avoid the need for an extra click ... [perhaps] with [the] guess of the correct spelling interwoven with others that contain the original, most likely incorrect spelling.

Web search ... query spelling correction [is] a harder problem than traditional spelling correction because of the prevalence of proper names, company names, neologisms, multi-word phrases, and very short contexts ... A key insight for improving spelling suggestions on the Web was that query logs often show not only the misspelling, but also the corrections that users make in subsequent queries.

Avoid the temptation to prioritize anything other than very fast delivery of very relevant results into the top 3 positions. That is all most people will see. Beyond that, we've likely missed our shot, and we probably should focus on helping people iterate:

Searchers rarely look beyond the first page of search results. If the searcher does not find what they want in the first page, they usually either give up or reformulate their query ... Web searchers expect the best answer to be among the top one or two hits in the results listing.

The book has much advice on designs and interfaces that appear to be helpful as well as those that do not. Here is some of the advice on what appears to be helpful:

Numerous studies show that an important search interface design principle is to show users some search results immediately after their initial query ... This helps searchers understand if they are on the right track or not, and also provides them with suggestions of related words that they might use for query reformulation. Many experimental systems make the mistake of requiring the user to look at large amounts of helper information, such as query refinement suggestions or category labels, before viewing results directly.

Taking [query term] order and proximity into account ... [in ranking can] improve the results without confusing users despite the fact that they may not be aware of or understand those transformations .... In general, proximity information can be quite effective at improving precision of searches (Hearst, 1996, Clarke et al., 1996, Tao and Zhai, 2007).

Research shows that people are highly likely to revisit information they have viewed in the past and to re-issue queries that they have written in the past (Jones et al., 2002, Milic-Frayling et al., 2004) .... A good search history interface should substantially improve the search experience for users.

Recent work has explored how to use implicit relevance judgements from multiple users to improve search results rankings ... [For example] Joachims et al., 2005 conducted experiments to assess the reliability of clickthrough data ... [and found] several new effective ways for generating relative signals from this implicit information ... Agichtein et al., 2006b built upon this work and showed even more convincingly that clickthrough and other forms of implicit feedback are useful when gathered across large numbers of users.

A general rule of thumb for search usability is to avoid showing the user empty results sets .... eBay introduced an interesting ... [technique where] when no results are found for a query ... the searcher is shown a view indicating how many results would be brought back if only k out of n terms were included in the query.

The wording above the text box can influence the kind of information that searchers type in .... A hint [in] the search box [can] indicate what kind of search the user should do .... Short query forms lead to short queries.

It is important not to force the user to make selections before offering a search box.

Graphical or numerical displays of relevance scores have fallen out of favor ... [Studies] tend to find that users do not prefer them.

Stemming is useful, but removing stopwords can be hazardous.

Diversity... [of] the first few results displayed ... [is] important for ambiguous queries.

Three ideas -- universal search including multimedia, faceted search, and implicit personalization -- appear to be helpful only in some cases:

Web search engines are increasingly blending search results from multiple information sources ... Multimedia results [are] best placed a few positions down in the search results list ... When placed just above the "fold" (above where scrolling is needed) they can increase clickthrough.

Eye-tracking studies suggest that even when placed lower down, an image often attracts the eye first (Hotchkiss et al., 2007). It is unclear if information-rich layouts ... are desirable or if this much information is too overwhelming for users on a daily basis.

Hierarchical faceted metadata ... [allows] users to browse information collections according to multiple categories simultaneously ... [by selecting] a set of category hierarchies, each of which corresponds to a different facet (dimension or feature type) ... Most documents discuss several different topics simultaneously ... Faceted metadata provides a usable solution to the problems with navigation of strict hierarchies ... A disadvantage of category systems is that they require the categories to be assigned by hand or by an algorithm.

Usability results suggest that this kind of interface is highly usable for navigation of information collections with somewhat homogeneous content (English et al., 2001, Hearst et al., 2002, Yee et al., 2003). [People] like and are successful using hierarchical faceted metadata for navigating information collections, especially for browsing tasks ... [but] there are some deficiencies ... If the facets do not reflect a user's mental model of the space, or if items are not assigned facet labels appropriately, the interface will suffer ... The facets should not be too wide nor too deep ... and the interface must be designed very carefully to avoid clutter, dead ends, and confusion. This kind of interface is heavily used on Web sites today, including shopping and specialized product sites, restaurant guides, and online library catalogs.

One site that had some particularly interesting design choices is the eBay Express online shopping interface ... The designers determined in advance which subset of facets were of most interest to most users for each product type (shoes, art, etc.), and initially exposed only [those] ... After the user selected a facet, one of the compressed facets from the list below was expanded and moved up ... [They also] had a particularly interesting approach to handling keyword queries. The system attempted to map the user-entered keywords into the corresponding facet label, and simply added that label to the query breadcrumb. For example, a search on "Ella Fitzgerald" created a query consisting of the Artists facet selected with the Ella Fitzgerald label. Search within results was accomplished by nesting an entry form within the query region.

Most personalization efforts make use of preference information that is implicit in user actions .... A method of gathering implicit preference information that seems especially potent is recording which documents the user examines while trying to complete an extended search task ... The information "trails" that users leave behind them as a side-effect of doing their tasks have been used to suggest term expansions (White et al., 2005, White et al., 2007), automatically re-rank search results (Teevan et al., 2005b) , predict next moves (Pitkow and Pirolli, 1999) , make recommendations of related pages (Lieberman, 1995) , and determine user satisfaction (Fox et al., 2005) .... Individual-based personalized rankings seem to work best on highly ambiguous queries.

Several search interfaces, despite being popular in the literature, repeatedly have been shown to either hurt or yield no improvement in usability when applied to web search. Here are brief excerpts on boolean queries, thumbnails of result pages, clustering, pseudo-relevance feedback, explicit personalization, and visualizations of query refinements and search results:

Studies have shown time and again that most users have difficulty specifying queries in Boolean format and often misjudge what the results will be ... Boolean queries ... strict interpretation tends to yield result sets that are either too large, because the user includes many terms in a disjunct, or are empty, because the user conjoins terms in an effort to reduce the result set.

This problem occurs in large part because the user does not know the contents of the collection or the role of terms within the collection ... Most people find the basic semantics counter-intuitive. Many English-speaking users assume everyday meanings are associated with Boolean operators when expressed using the English words AND and OR, rather than their logical equivalents ... Most users are not familiar with the use of parentheses for nested evaluation, nor with the notions associated with operator precedence.

Despite the generally poor usability of Boolean operators, most search engines support [the] notation.

One frequently suggested idea is to show search results as thumbnail images ... but [no attempts] have shown a proven advantage for search results viewing .... The downside of thumbnails are that the text content in the thumbnails is difficult to see, and text-heavy pages can be difficult to distinguish from one another. Images also take longer to generate and download than text .... The extreme sensitivity of searchers to delays of even 0.5 seconds suggests that such highly interactive and visual displays need to have a clear use-case advantage over simple text results before they will succeed.

In document clustering, similarity is typically computed using associations and commonalities among features, where features are usually words and phrases (Cutting et al., 1992) ; the centroids of the clusters determine the themes in the collections .... Clustering methods ... [are] fully automatable, and thus applicable to any text collection, but ... [have poor] consistency, coherence, and comprehensibility.

Despite its strong showing in artificial or non-interactive search studies, the use of classic relevance feedback in search engine interfaces is still very rare (Croft et al., 2001, Ruthven and Lalmas, 2003), suggesting that in practice it is not a successful technique. There are several possible explanations for this. First, most of the earlier evaluations assumed that recall was important, and relevance feedback's strength mainly comes from its ability to improve recall. High recall is no longer the standard assumption when designing and assessing search results; in more recent studies, the ranking is often assessed on the first 10 search results. Second, relevance feedback results are not consistently beneficial; these techniques help in many cases but hurt results in other cases (Cronen-Townsend et al., 2004, Marchionini and Shneiderman, 1988, Mitra et al., 1998a). Users often respond negatively to techniques that do not produce results of consistent quality. Third, many of the early studies were conducted on small text collections. The enormous size of the Web makes it more likely that the user will find relevant results with fewer terms than is the case with small collections. And in fact there is evidence that relevance feedback results do not significantly improve over web search engine results (Teevan et al., 2005b).

But probably the most important reason for the lack of uptake of relevance feedback is that the method requires users to make relevance judgements, which is an effortful task (Croft et al., 2001, Ruthven and Lalmas, 2003) ... Users often struggle to make relevance judgements (White et al., 2005), especially when they are unfamiliar with the domain (Vakkari, 2000b, Vakkari and Hakala, 2000, Spink et al., 1998) ... The evidence suggests it is more cognitively taxing to mark a series of relevance judgements than to scan a results listing and type in a reformulated query.

The evidence suggests that manual creation of ... [personalization] profiles does not work very well. [In] Yang and Jeh, 2006 ... [most] participants ... [said it] required too much effort ... Several researchers have studied ... [instead] allowing users to modify ... profiles after they are created by or augmented by machine learning algorithms. Unfortunately, the outcome of these studies tends to be negative. For example ... Ahn et al., 2007 examined whether allowing users to modify a machine-generated profile for news recommendations could improve the results ... [and] found that allowing them to adjust the profiles significantly worsened the results.

Applying visualization to textual information is quite challenging ... When reading text, one is focused on that task; it is not possible to read and visually perceive something else at the same time. Furthermore, the nature of text makes it difficult to convert it to a visual analogue.

Many text visualizations have been proposed that place icons representing documents on a 2-dimensional or 3-dimensional spatial layout ... Adjacency on maps like these is meant to indicate semantic similarity along an abstract dimension, but this dimension does not have a spatial analogue that is easily understood. Usability results for such displays tend not to be positive.

Nodes-and-link diagrams, also called network graphs, can convey relationships ... [but] do not scale well to large sizes -- the nodes become unreadable and the links cross into a jumbled mess. Another potential problem with network graphs is that there is evidence that lay users are not particularly comfortable with nodes-and-links views (Viégas and Donath, 2004) ... [and they] have not been shown to work well to aid the standard search process .... Kleiboemer et al., 1996 [for example found] that graphical depictions (representing clusters with circles and lines connecting documents) were much harder to use than textual representations ... [and] Swan and Allan, 1998 implement ... [a] node-and-link networks based on inter-document similarity ... [but] results of a usability study were not positive.

Applications of visualization to general search have not been widely accepted to date, and few usability results are positive. For example, Chen and Yu, 2000 conducted a meta-analysis of information visualization usability studies ... [and] found no evidence that visualization improved search performance. This is not to say that advanced visual representations cannot help improve search; rather that there are few proven successful ideas today.

Finally, three ideas -- social search, dialogue-based interfaces, and sensemaking -- may be fertile ground:

[An] idea that has been investigated numerous times is that of allowing users to explicitly comment on or change the ranking produced by the search engine ... [For example] Google has recently introduced SearchWiki which allows the user to move a search hit to the top of the rankings, remove it from the rankings, and comment on the link, and the actions are visible to other users of the system ... Experimental results on this kind of system have not be strongly positive in the past (Teevan et al., 2005b), but have not been tried on a large scale in this manner.

Another variation on the idea of social ranking is to promote web pages that people in one's social network have rated highly in the past, as seen in the Yahoo MyWeb system ... Small studies have suggested that using a network of one's peers can act as a kind of personalization to bias some search results to be more effective (Joachims, 2002, Mislove et al., 2006).

There [also] is an increasing trend in HCI to examine how to better support collaboration among users of software systems, and this has recently extended to collaborative or cooperative search. At least three studies suggest that people often work together when performing searches, despite the limitations of existing software (Twidale et al., 1997, Morris, 2008, Evans and Chi, 2008) ... Pickens et al., 2008 ... found much greater gains in collaboration on difficult tasks than on simple ones.

Dialogue-based interfaces have been explored since the early days of information retrieval research, in an attempt to mimic the interaction provided by a human search intermediary (e.g., a reference librarian) ... Dialogue-style interactions have not yet become widely used, most likely because they are still difficult to develop for robust performance.

[We can] divide the entire information access process into two main components: information retrieval through searching and browsing, and analysis and synthesis of results. This [second] process is often referred to ... as sensemaking.

The standard Web search interface does not do a good job of supporting the sensemaking process .... A more supportive search tool would ... help [people] keep track of what they had already viewed ... suggest what to look for next ... find additional documents similar to those already found ... allow for aliasing of terms and concepts ... [and] flexibly arrange, re-arrange, group, and name and re-name groups of information ... A number of research and commercial tools have been developed that attempt to mimic physical arrangement of information items in a virtual representation.

I only touched on the material in the book here. There is an interesting section on mobile search, more surveys of past academic work, screen shots from various clever but crazy visualization interfaces, and many other worthwhile goodies in the full text.

Marti was kind to make her book available free online -- a great resource -- but this is too good of a book for a casual skim. I'd recommend picking up a copy so you can sit down for a more thorough read.

If you think you might enjoy Marti's new book, I also can't recommend strongly enough the recently published "Introduction to Information Retrieval". Earlier this year, I posted a review of that book with extended excerpts.

[Full disclosure: I offered comments on a draft of one of the chapters in Marti's book prior to publication; I have no other involvement.]

Monday, September 14, 2009

Experiments and performance at Google and Microsoft

Despite frequently appearing together at conferences, it is fairly rare to see public debate on technology and technique between people from Google and Microsoft. A recent talk on A/B testing at Seattle Tech Startups is a fun exception.

In the barely viewable video of the talk, the action starts at the Q&A around 1:28:00. The presenters of the two talks, Googler Sandra Cheng and Microsoft's Ronny Kohavi, aggressively debate the importance of performance when running weblabs, with others chiming in as well. Oddly, it appears to be Microsoft, not Google, arguing for faster performance.

Making this even more amusing is that both Sandra and Ronny cut their experimenting teeth at Amazon.com. Sandra Cheng now is the product manager in charge of Google Website Optimizer. Ronny Kohavi now runs the experimentation team at Microsoft. Amazon is Experimentation U, it seems.

By the way, if you have not seen it, the paper "Online Experimentation at Microsoft" (PDF) that was presented at a workshop at KDD 2009 has great tales of experimentation woe at the Redmond giant. Section 7 on "Cultural Challenges" particularly is worth a read.

Friday, September 11, 2009

Google AdWords now personalized

It has been a long time coming, but Google finally started personalizing their AdWords search advertising to the past behavior of searchers:

When determining which ads to show on a Google search result page, the AdWords system evaluates some of the user's previous queries during their search session as well as the current search query. If the system detects a relationship, it will show ads related to these other queries, too.

It works by generating similar terms for each search query based on the content of the current query and, if deemed relevant, the previous queries in a user's search session.

There have been hints of this coming for some time. Last year, there were suggestions that this feature was being A/B tested. Earlier, Google did a milder form of personalized ad targeting if the immediately previous query could be found in the referrer. Now, they finally have launched real personalized advertising using search history out to everyone.

For more on Google's personalized advertising efforts, you might be interested in my earlier post, "Google launches personalized advertising", on the interest-based behavioral targeted advertising Google recently launched for AdSense.

Please see also my July 2007 post, "What to advertise when there is no commercial intent?"

Thursday, September 10, 2009

Rapid releases and rethinking software engineering

I have a new post up at blog@CACM, "Frequent releases change software engineering", on why software companies should consider deploying software much more frequently than they usually do.

Here is an excerpt, the last two paragraphs, as a teaser:

Frequent releases are desirable because of the changes it forces in software engineering. It discourages risky, expensive, large projects. It encourages experimentation, innovation, and rapid iteration. It reduces the cost of failure while also minimizing the risk of failure. It is a better way to build software.

The constraints on software deployment have changed. Our old assumptions on the cost, consistency, and speed of software deployments no longer hold. It is time to rethink how we do software engineering.

This CACM article expands on some of the discussion in an earlier post, "The culture at Netflix", on this blog and in the comments to that post. By the way, if you have not yet seen Reed Hastings' slides on the culture at Netflix, they are worth a look.

Sunday, August 23, 2009

The culture at Netflix

Netflix CEO Reed Hastings has a very interesting presentation, "Our Freedom & Responsibility Culture", with some thought-provoking ideas on how to run a company.

Some excerpts:

Imagine if every person [you worked with] is someone you respect and learn from ... In creative work, the best are x10 better than the average ... [A] great workplace [is made] of stunning colleagues.

Responsible people thrive on freedom and are worthy of freedom ... [They are] self-motivating, [pick] up the trash lying on the floor, [and behave] like an owner ... Our model is to increase employee freedom as we grow rather than limit it ... Avoid chaos as you grow with ever more high performance people, not with rules.

Pay at the top of the market is core to high performance culture. One outstanding employee gets more done and costs less than two adequate employees ... We pay at the top of the market ... Give people big salaries ... no bonuses ... no stock options ... [and a] great health plan ... [Everyone feels] they are getting paid well relative to their other options ... Nearly all ex-employees will take a step down in comp for their next job.

We try to get rid of rules when we can .... The Netflix vacation tracking policy [is that] there is no policy or tracking. There also is no clothing policy at Netflix, but no one has come to work naked lately ... Netflix policy for expensing ... [is] five words long ... "act in Netflix's best interest" ... You don't need detailed policies for everything.

Reed also makes a great point about how to organize large companies, saying he prefers to align groups on goals and strategies while minimizing meetings over tactics. He contrasts this with "tightly-coupled monoliths" where everything is inefficiently controlled (usually from the top down) and "independent silos" where groups (e.g. engineering and marketing) work so independently that "alienation and suspicion" creep in.

There are some suggestions I disagree with. First, I think Reed's claim that Netflix should fire with "generous severance" people who managers would not "fight hard to keep at Netflix" if they were to threaten to leave conflicts with Reed's later advice that managers should not blame someone who "does something dumb" but rather ask themselves what "context [the manager] failed to set." Personally, when someone I manage is not doing well, I blame myself, not them, and I think Reed should have emphasized finding people the right challenge rather than suggesting just giving them the boot.

Second, I think Reed's advice to push new software to the website every two weeks is not nearly frequent enough -- I prefer at least daily -- and I also see this as at odds with his later claim that he wants "rapid innovation", "excellent execution", and "to be big and fast and flexible".

But, overall, a great presentation with excellent food for thought. It is a must-read for anyone thinking about how to use organizational culture to help manage a company, from little startups to bloated corporate empires.

Please see also my old 2006 post, "Management and incentives at Google", that discusses Google's corporate culture.

[Netflix slides found via Ruben Ortega, TechCrunch, and Hacking Netflix]

Update: Scott Berkun has some good thoughts on the slide deck, including nice references to Zappos' "pay to quit" idea and the "Lefferts law of management".

Tuesday, August 18, 2009

Rapid innovation using online experiments

Erik Brynjolfsson and Michael Schrage at MIT Sloan Management Review have an interesting take on the value of A/B tests in their article, "The New, Faster Face of Innovation".

Some excerpts:

Technology is transforming innovation at its core, allowing companies to test new ideas at speeds -- and prices -- that were unimaginable even a decade ago. They can stick features on Web sites and tell within hours how customers respond. They can see results from in-store promotions, or efforts to boost process productivity, almost as quickly.

The result? Innovation initiatives that used to take months and megabucks to coordinate and launch can often be started in seconds for cents.

That makes innovation, the lifeblood of growth, more efficient and cheaper. Companies are able to get a much better idea of how their customers behave and what they want ... Companies will also be willing to try new things, because the price of failure is so much lower.

The article goes on to discuss Google, Wal-mart, and Amazon as examples and talk about the cultural changes necessary (such as switching to a bottom-up, data-driven organization and reducing management control) for rapid experimentation and innovation.

I am briefly quoted in the article, making the point that even failed experiments have value because failures teach us about what paths might lead to success.

Monday, August 17, 2009

Can we make make all advertising useful, relevant, and helpful?

I have a new post at blog@CACM titled, "Is advertising inherently deceptive?"

It discusses some of the moral and ethical qualms I have when working on personalized advertising. It attempts to start a discussion around the question of whether personalized advertising will be used for good.

An excerpt:

Let's say we build more personalization techniques and tools that allow advertisers and publishers to understand people's interests and individually target ads. How will our tools be used? Will they be used to provide better information to people about useful products and services? Or will they be used for deeper and trickier forms of deception?

Is advertising an industry fundamentally fueled by deception? Or is advertising better understood as a stream of information that, if well directed, can help people?

If you have thoughts on this topic, please contribute to the discussion, either here or over on the full post at blog@CACM.

Update: About one month later, in the October 12 issue of the New Yorker, Ken Auletta has an article, "Searching for Trouble", that describes a 2003 conflict between the COO of Viacom and the founders of Google on exactly this issue, deception in advertising. An excerpt:

[You want] salesmanship, emotion, and mystery. [Viacom COO Karmazin said], "You don't want to have people know what works. When you know what works or not, you tend to charge less money than when you have this aura and you're selling this mystique."

The Google executives thought Karmazin's method manipulated emotions and cheated advertisers.

Thursday, July 30, 2009

Microsoft and Yahoo, kissing behemoths

It looks like we have a Microsoft-Yahoo deal. If you smack two amorous giants together enough times, I guess you are going to get a love child.

I doubt Google has much to fear from the laggard that likely will result. Just as two wrongs don't make a right, combining two struggling organizations is unlikely to fix dysfunction.

I hope I am wrong. If the deal provides focus rather than distractions, if it allows both organizations to rapidly iterate on, develop, and deploy products people actually want, it has some chance of succeeding.

But, as separate groups, both organizations barely can control the internal squabbling of hordes of product managers, "none f-ing getting anything done", as Carol Bartz colorfully put it. Combined, someone will have to keep these beasts from pulling on and tripping over each other while they desperately pursue the leader of the pack.

Update: Excellent commentary on the deal by Danny Sullivan, Jason Calacanis, and Saul Hansell.

Wednesday, July 29, 2009

Facebook versus Google?

Greg Sterling has some good thoughts on why Facebook and Google are not in competition:

Currently the use cases for Facebook and for search are quite different.

Facebook is entertaining, Facebook is fun, Facebook kills time, Facebook enables me to keep in touch with people. But Facebook, generally speaking, is not "useful" in the sense that Google is.

For its part, Google delivers information efficiently but is generally not "entertaining" or "fun."

It's very likely that the two sites will simply co-exist fulfilling different types of needs and interests ... Neither can be expected to fundamentally undermine the core business of the other.

But what are Facebook and Google's core businesses?

It is true that the uses of Facebook and Google differ. People mostly seem to go to Facebook because they find it entertaining. People mostly go to Google because it is useful.

But, the core business of both, where they get their revenue, is from advertising. And, while Google's search advertising does quite well, they have struggled much more in non-search advertising. And non-search advertising is the problem Facebok needs to solve.

Toward the end of the Fred Vogelstein's Wired article (which Greg Sterling references), Fred pinpoints the critical area of conflict:

Facebook [is] confronted with a difficult challenge: turning [their] massive user base into a sustainable business.

[Google] inked a disastrous $900 million partnership with MySpace in 2006, a failure that taught them how hard it is to make money from social networking. And privately, [Googlers] don't think Facebook's staff has the brainpower to succeed where they have failed.

"If [Facebook] found a way to monetize all of a sudden, sure, that would be a problem," says one highly placed Google executive. "But they're not going to."

Monday, July 27, 2009

Google's thin client distraction

Recently, Chris O'Brien at the San Jose Mercury News wrote:

It's getting harder every day to articulate what Google is. Is it a Web company? A software company? Something else entirely?

It's not just that it's hard to see how [Google's operating systems] fit into Google's stated mission. It's also that it's hard to explain to someone exactly what they are, or why they might, or might not, want to use them. Or to communicate why they are different from or better than any other things out there.

These new products have the whiff of engineers building things for other engineers, rather than you and me.

Even worse, these new products have the whiff of executives being unable to let go of their past battles.

For decades, Google CEO Eric Schmidt led Sun and Novell in mostly failed attempts to build thin client computers. At Google, Eric appears to be doing it again.

But Google is not a computer company. It is an advertising company. Google makes its money from advertising.

It is not as if there isn't enough to do in advertising. Despite Google's success in making search advertising more useful and helpful, most other advertising remains awful.

Fixing advertising not only would be lucrative, but also it directly fits into Google's mission to "organize the world's information and make it universally accessible and useful." At their best, ads provide useful information about interesting products and services. Right now, most contextual and display advertisements are more annoying than useful. It doesn't have to be that way.

If Google could be the solution to annoying advertising, it could reap all the rewards. Instead, Google is being led off by its generals to fight the last war.

Tuesday, July 14, 2009

Time effects in recommendations

The best paper award at the recent KDD 2009 conference went to Yehuda Koren's "Collaborative Filtering with Temporal Dynamics" (PDF).

The paper is a great read, not only because Yehuda is part of the team currently winning the Netflix Prize, but also because it has some surprising conclusions about how to deal with changing preferences and interests over time.

In particular, it is common in recommender systems to favor recent activity, such as more recent ratings by a user, either by only using the last N data points or by weighting more recent data more heavily. But Yehuda found that ineffective on the Netflix data:

The consistent finding was that prediction quality improves as we moderate ... time decay, reaching [the] best quality when there is no delay at all. This is despite the fact that users do change their taste and rating scale over the years.

Underweighting past action loses too much signal along with the lost noise, which is detrimental given the scarcity of data per user .... We require an accurate modeling of each point in the past, which will allow us to distinguish between persistent signal that should be captured and noise that should be isolated .... for understanding the customer ... [and] modeling other customers.

As in some of Yehuda's past work, he combines two models, one a latent factor model, the other an item-item approach. The models yielded "the best results published so far" on the Netflix data set by allowing them to represent temporal effects such as finding stronger relationships between items related in a short timeframe, handling that people tend to give higher ratings to older movies (if they bother to rate them at all), allowing for people to shift to giving higher or lower ratings on average over time, and capturing that people tend to use the same rating for multiple items rated in a short timeframe.

The paper is full of other cute tidbits too, like that they tried to detect day of the week effects -- do people rate lower on Mondays? -- but could not. They also discovered an unusual jump in the average rating in the data in 2004, which they hypothesize was due to features launched on the Netflix.com site that started showing people more movies they liked. Definitely worth a read.