ML Illustrated aims to help those who are working on (and often struggling to) apply AI/ML research into the “real world”, as in into the hands of users and customers. I have been operating in this liminal space for the better part of 20 years, starting off as a AI/ML researcher in my graduate student days, where writing software was a means to an end for getting out publications. That is to say, code quality and software best practices took low priority over hitting those state-of-the-art metrics and getting papers published.

While it’s the case that the code quality of projects open sourced by researchers has dramatically improved over the past decade, they are still very much about research reproducibility (very important!) and not well suited for deployment in production environments. For those like me who have been working on transferring AI/ML research into the real world, this process feels more like developing pharmaceuticals than software! As an antidote, here’s my personal journey.

After finishing my thesis, another graduate student (Alex N.) and I were foolish enough to launch a startup, having the notion that we can combine our expertise, mine in statistical natural language processing (NLP) and Alex’s in information retrieval, to create a better search engine by indexing the web based on AI-extracted concepts and not just keywords. Mind you this was in 2003/2004, where Bing hadn’t even launched, AskJeeves (remember them?) was a viable option, and Taoma was unveiled and quickly acquired.

Our engine got some early interest, but needless to say, it didn’t take over the market. However, having had to build a web-scale, distributed, fault-tolerant, and of course, fast search engine on a bootstrap budget in less than 2 years taught me a great deal about building AI-powered software systems. Many of those lessons are still very relevant today, even if the underlying AI and ML algorithms and software infrastructure have improved by leaps and bounds since then.

My AI/ML journey didn’t end there when our search engine didn’t get sufficient traction. Instead, with the help of another friend, Nikos I., we converted the technology and targeted the enterprise market, in the form of a SaaS service for a semantic search and discovery engine. We gained traction, revenue, raised money, and built out a development team. This was when “s* hits the fan,” as real users were interacting with our service 24/7, and screw ups would mean our customers’ websites grind to a halt.

One could say we commercialized our AI/ML technology, and that’s when software development took a front and center role, as well as the entire deployment pipeline so we didn’t end up with angry customers every time new code was deployed into production. (We still got enough of those angry emails or calls, but very proud of the team on how rarely that happened.)

Along the way, the startup branched out from semantic search using NLP into adjacent areas, including content recommendations, big data (for analyzing clickstreams), and extraction of content intelligence from video streams, all using some combination of AI, ML, and good’o algorithms. It was the latter phase of our technology in video analysis and understanding using deep neural nets that lead to the company’s acquisition by a leading provider of a SaaS platform for delivering videos online.

Once I joined this much larger company, the process and complexity in developing and deploying AI/ML software got elevated by another notch (or two). As the team size increases, the role of AI/ML in the grand scheme of things becomes narrower, which may seem paradoxical that the relationship is inversely proportional. It’s not that AI/ML gets simpler as the teams and companies go up in size, but rather the rest of the system for delivering AI/ML services grows quickly.

The short explanation is that as more people are involved in building and maintaining a system (AI/ML or otherwise), the more necessary it is to ensure the entire system is nearly unbreakable, ideally bullet-proof. As such, it may be the case that the kernel of the system, the AI/ML model, is the same size between research and production versions, but the rest of the system can be orders of magnitude larger to build and operate.

Depending on one’s background in AI/ML, this situation could be puzzling or frustrating to work within. This is natural and can be alleviated by better understanding of the many ways complex systems can go wrong. Consequently, different teams use different techniques and safeguards to minimize failures, not to flummox AI/ML practitioners, but to ensure robust management of change over time (code, requirements, staff, etc). Hopefully with some improved understanding between the practitioners and the rest of the teams, the boundaries of roles and responsibilities can be improved so AI/ML projects progressed (and launched!) more smoothly.

My journey hasn’t concluded yet, but the tie that binds it thus far is this desire to put AI and ML into the real world to solve real world problems in a better way. A lot of things have improved compared to when I started my travails, but unfortunately the path from AI/ML in research into production is still fraught with perils and pitfalls. This blog will not solve all of these issues by a long shot, but hopefully some of the hard earned lessons could save others from needless “wondering the desert” phase in trying to get their AI/ML projects off the ground, or avoid those soul-crushing dead ends when some esoteric limitations kill an entire project after months of work. If this blog can save some team somewhere from such agony and instead deliver their AI/ML project into production, it would have served its purpose.

— Gerald