A framework for building and evaluating data products

23 August 2024

0

In this episode of the Data Show, I spoke with Grace Huang, data science lead at Pinterest. With its combination of a large social graph, enthusiastic users, and multimedia data, I’ve long regarded Pinterest as a fascinating lab for data science. Huang described the challenge of building a sustainable content ecosystem and shared lessons from the front lines of machine learning product launches. We also discussed recommenders, the emergence of deep learning as a technique used within Pinterest, and the role of data science within the company.

Here are some highlights from our conversation:

Learn faster. Dig deeper. See farther.

Join the O’Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Using machine learning to strengthen content ecosystems

Pinterest content is a giant, complicated corpus, that has a very rich meta data associated with it. If you build a recommendation system where there’s a lot of bias in it, over time you can start showing just a particular corner of that corpus to the world—because you think your user might find a piece of that corner of content particularly engaging. This is an issue when you’re basing your algorithms only on your existing users.

When Pinterest first started out, we had a very strong user base around particular user demographics. That part of the content corpus becomes very well curated, which makes those content pieces rank really high in our machine learning products. Then we had to start consciously thinking about how to combat that problem because otherwise, over time, you’re just going to build a product that only appeals to that segment of users.

From the user perspective, you want to make sure you’re creating a corpus that covers enough in terms of topics and interests, in terms of different languages people speak, in terms of different cultural backgrounds. Then, I think on the content side, we have the same problem where fresher, newer content may have trouble competing with older content that’s been around for a long time and has really good historical performance.

Maintaining this healthy ecosystem involves creating mechanisms to jump start new content so we can show it enough times to quickly learn whether or not it’s high quality. And whether or not it might be relevant for certain segments of users. We then want to be able to use that information very efficiently to drive our downstream products.

Building data products: Three anti-patterns

The first one is, do not build a model for users today. You have to think about your users tomorrow as well. Second, it’s really easy to build a system where the rich get richer. There are a lot of techniques out there to prevent that from happening; it’s often not by design. It’s very subtle, and it takes a long time to observe this rich-get-richer effect and for it to build up. You have to be very vigilant about it. … The third anti-pattern is that you might find yourself optimizing not quite the right thing. You can get exactly what you wish for with a machine learning system. It’s very good at optimizing a goal that you specify. But that goal may not necessarily correlate with the ultimate goal. Keeping your ultimate goal in mind and evaluating your products with the ultimate goal, instead your intermediate goal, is really important. For example, I think short-term metrics are easier to optimize toward. But they may or may not correlate with a long-term goal like retention.

Related resources:

Peeking into the black box: Lessons from the front lines of machine-learning product launches – A 2017 Strata Data Conference keynote by Grace Huang
Recommending 1+ billion items to 100+ million users in real time—Harnessing the structure of the user-to-object graph to extract ranking signals at scale: A 2017 Strata Data Conference presentation by Pinterest’s chief scientist, Jure Leskovec
When is data science a house of cards? Replicating data science conclusions: A 2017 Strata Data Conference presentation by Frances Haugen and June Andrews of Pinterest
Data preparation in the age of deep learning: Featuring Crowdflower co-founder Lukas Biewald

Post topics: AI & ML, Data, O’Reilly Data Show Podcast

Post tags: Podcast

A framework for building and evaluating data products

Learn faster. Dig deeper. See farther.

Using machine learning to strengthen content ecosystems

Building data products: Three anti-patterns

Run Local AWS Cloud Stack using LocalStack on Linux

Learn Terraform Automation in 3 days using Video Courses

How To Expose Ansible AWX Service using Nginx Ingress

LEAVE A REPLY Cancel reply

Most Popular

How to Manage Saved Passwords in Chrome: 2025 Guide by Manual Thomas

What Is Zero-Knowledge Encryption? Your 2025 Guide by Tyler Cross

How Do I Know if My Email Has Been Hacked in 2025? by Manual Thomas

How to Cancel LastPass Subscription in 2025 by Tyler Cross

Recent Comments

EDITOR PICKS

How to Manage Saved Passwords in Chrome: 2025 Guide by Manual Thomas

What Is Zero-Knowledge Encryption? Your 2025 Guide by Tyler Cross

How Do I Know if My Email Has Been Hacked in 2025? by Manual Thomas

POPULAR POSTS

How to Manage Saved Passwords in Chrome: 2025 Guide by Manual Thomas

What Is Zero-Knowledge Encryption? Your 2025 Guide by Tyler Cross

How Do I Know if My Email Has Been Hacked in 2025? by Manual Thomas

POPULAR CATEGORY

ABOUT US

FOLLOW US