ShopRunner is an e-commerce company that receives feeds of product data from over 140 different retailer partners, including large department stores and retailers that specialize in clothing, electronics, appliances, nutritional products, and more. We present this marketplace to our members using both a web-based platform and our new mobile app, District. In order to provide an excellent user experience, we need to have one easy-to-navigate method for classifying products that effectively combines all of them into one coherent shopping experience. As a part of that experience, we want to make it as easy as possible to drill down into very specific details like color, pattern, fabric type, neckline, et cetera, by using computer vision and NLP.
From our retailer partners, we receive at least one image and multiple text fields (including name and description) for each product. One method to categorize everything could be to pass each product to a human labeler. However, this process obviously does not scale well to a large marketplace like ShopRunner, which has a continuously shifting catalog of millions of active products at any given time. Instead, we’re having machine learning do the work for us.
[Related article: Essential NLP Tools, Code, and Tips]
This situation raises a lot of questions, including:
- How do we deal with the possibility that the data provided may be insufficient for machine learning?
- How do we deal with new categories, where we do not yet have enough training data?
- Should you use separate, simpler models for overall classification vs specific attributes, or a larger monolith that handles everything at once?
- How do you best combine predictions from images and from text, given that some product attributes are better learned from one or the other?
- How can we monitor model performance over time?
Classification with images and text
The ShopRunner taxonomy has several different levels, with increasing levels of specificity. For example, a product could be classified as “womens:clothing:bottoms:shorts:denim-shorts.” In this and in most cases, both the image and text are important. For example, in this case, an image-only model could confuse mens, womens, and unisex denim shorts. Similarly, an image-only model could confuse men’s vs women’s running shoes and slacks vs colored jeans. On the other hand, the text data that we receive tends to be far messier, more limited, and in some cases, either the image or the text could be corrupted.
For all of these reasons, we choose a combined model. This involves passing image data and text data through separate computer vision and natural language processing models to condense each down into an embedding vector, which are then combined and passed into a final modeling stage. With this methodology, we can further add in any additional information that we want, for example from behavioral data that quantifies which types of members have been purchasing each product.
[Related article: Create Your First Face Detector in Minutes Using Deep Learning]
Tagging auxiliary attributes
In a similar vein, combining images with text can help us automatically detect the sorts of finer details that are useful for guiding customers as they drill down into the details of our products. Some of these are better predicted from images, such as color, pattern, and sleeve length. However, some attributes are better predicted from text, such as material and petite vs plus sizing. These problems are particularly well-suited to multi-task models (see this excellent blog post from one of our team’s senior data scientists), where a single model optimizes to solve a series of related problems.
Editor’s note: Want to learn more about computer vision and NLP in-person? Attend ODSC East 2020 this April 13-17 in Boston and learn from the experts who define the field!