Listen to the podcast:
Every week, we talk about important data and analytics topics with data science leaders from around the world on Facebook Live. You can subscribe to the DataTalk podcast on iTunes, Google Play, Stitcher, SoundCloud and Spotify.
This data science video and podcast series is part of Experian’s effort to help people understand how data-powered decisions can help organizations develop innovative solutions and drive more business.
Here’s the transcript:
Mike Delgado: Hello, and welcome to Experian’s weekly #DataTalk, a show where we talk to data science leaders from around the world. Today, we are talking about how data science is improving e-commerce, and we are honored and excited to have Dr. LiangJie Hong, who is the head of data science at Etsy, which is one of my favorite online stores. He previously served as the Senior Manager of Research at Yahoo! He received his Ph.D. in computer science from Lehigh University. It’s an honor, Dr. Hong, to have you today.
LiangJie Hong: Thank you for having me.
Mike Delgado: Can you share with our community your path that led you to start to work in data science?
LiangJie Hong: Sure. I first studied machine learning data mining in my private school. Then I slowly developed the interest in machine learning and how machine learning can apply to real-world problems. At that time, probably 10 years ago, social networks were very popular. So the majority of my dissertation work is about how to apply designs to social networks. Then I came to Yahoo! research. I spent quite a bit of time to apply cutting-edge machine learning techniques for a wider range of problems. Then I came to Etsy, where I spent a lot of time studying those problems for e-commerce as well as how to interact with product design, etc.
Mike Delgado: Very cool. Did you always know you wanted to work in data science?I remember even 10 years ago, the term “data scientist” wasn’t around.
LiangJie Hong: Data science, this buzzword, came around in 2011, 2012. That was funny because during that time I was an intern in LinkedIn. I believe LinkedIn was the first of a couple of places where people coined the terms “data scientist” and “data science.”
So yes, before that I was in general interested in data mining and machine learning. Those are more toward the algorithm part. I think the beauty and the passion of data science is we read models and real-world problems together. And trying to help out the business in the sense of that.
Mike Delgado: Now that you’re head of data science at Etsy, can you talk about the work you’re doing?
LiangJie Hong: Absolutely. We have roughly 15 data scientists on the team. We have half our team in San Francisco, half our team in Brooklyn, New York, which is our headquarters.
We are part of an engineering organization where the goal of our teams is to build engineering-quality end-to-end machine learning solutions to a lot of our products inside Etsy. For example, we build machine learning solutions for search ranking, where you type a keyword, then we want to return the most relevant result to you, and we also develop algorithms and solutions for our recommendation.
If you come to Etsy, you see different modules and how they can recommend a most relevant result to you. This is a mixture of engineering because we work very closely with our product managers, designers. You ask researchers to flesh out what is the best user experience to present the users.
Mike Delgado: And for those who have never gone to Etsy.com, you have to check it out. I was browsing yesterday and there were over 9 million items, just in the home and living section. It boggles my mind.
LiangJie Hong: Exactly. We have 39 million active listings. We also have more than 40 million active buyers and more than 3 million active sellers.
It is a large-scale marketplace. Of course, compared to Amazon or eBay, we’re still small, but in terms of unique goods, handcrafted goods, we are definitely a very large marketplace.
Mike Delgado: With so many millions of items, how does machine learning help people find what they need?
LiangJie Hong: This is an ongoing challenge for us. When we process hundreds of thousands of new listings per day, and we have to tag them, we give them the label. We have to put them into different categories. For example, this is female, small-size wedding dress, versus a stylish ring or something. We use a lot of machine learning techniques to process the data. A lot of data is not perfect, and a majority of the data is very noisy. So we have to make a lot of effort there.
After that, it’s basically reinvesting into search engines or recommendation engines, where it is really a challenging job to find the things you would like to interact with in the future.
You know, either case, search or recommendation, we need to search among millions of listings and narrow it down to several thousands. Let’s say one with 2,000 candidates in the pool. Then we use more advanced machine learning to rank those things and make sure that we recommended the last five or six items for you. It is a challenging problem.
Mike Delgado: If you’re trying to sell something, you can write your own descriptions and maybe categorize and tag. How well do humans tag versus machine learning tagging?
LiangJie Hong: Human tagging is very useful because you need to describe what they are. Users need to give us their input: Size, color, material, etc.
Machine learning helps solve the scalability issue — when you have a very tiny portion of the data that can be tagged by human beings. In general, for that portion, the human quality is very high. Of course, machine learning algorithms can only be as good as the data (provided by people). But for the majority of the non-human tagging world, that’s where the machine learning models can be put into play. That’s why we used these models. It’s not that they’re more accurate necessarily, but it’s more like they can be applied in life skill.
Mike Delgado: You mentioned that there’s a lot of noisy data. Can you explain what you mean by that?
LiangJie Hong: From our side, we want to know more about your product. You upload a product or product listing to Etsy, we want to know what kind of materials you have, what’s the color, what’s the size, where some of the raw materials are coming from… there are all kinds of aspects. In fact, we probably have hundreds of such aspects, or such attributes.
Mike Delgado: Wow.
LiangJie Hong: But imagine you are a seller. You just want to upload a photo and just list it on the website. That’s a very cumbersome process. We need a balance between user experience and the quality of the data.
Usually we ask some key elements that we need to fill in, but we can leave the rest of the things blank. Then we can ask for some crowdsourcing support to help us to tag. If you talk about crowdsourcing, because they are not the owner, they are not the seller of those products, they can misunderstand things. That’s one of the places that the noisy data comes from.
Mike Delgado: I can’t imagine the amount of data you’re having to collect to help make the user experience better, especially with all the human involvement of people categorizing their own content, writing descriptions. And then on top of that you have the various machine learning algorithms helping to sort through that. For data scientists who are interested in getting involved in e-commerce, can you talk about some popular machine learning algorithms or techniques that are used in e-commerce?
LiangJie Hong: That’s a great question. Machine learning e-commerce is extremely challenging. And the interesting part of that is there are not too many off-of-shelf algorithms or models they can use. The beauty of the work is that you keep exploring. And a lot of things you can borrow, of course, from traditional domains, but there’s a lot of things you need to innovate. I usually give this example:
Let’s say you want to recommend movies to users. Let’s say you watch House of Cards 1, and you say, “Let me recommend House of Cards 2, House of Cards 3. That’s OK for Netflix. Some kind of Netflix recommendation system will do that. But imagine recommendations for e-commerce. You just bought a camera. And then we start to show you all the other cameras. Then people might complain. In fact, we had a situation where a customer from Britain purchased a wedding dress. Then we keep showing the wedding dress. This person complained, wrote an email to us. “Stop showing me the wedding dress.” You know, you buy chairs. Then you’re welcome to show you chairs. That’s the phenomenon of machine learning e-commerce. We are dealing with a lot of problems.
Mike Delgado: What a huge challenge, because for certain items, like a wedding dress, you just want one.
LiangJie Hong: For a very short time.
Mike Delgado: Yeah, and you don’t need to see any more after you choose that. I guess for certain types of products there have to be rules in place. If someone buys this, probably not a good idea to show other similar items.
LiangJie Hong: Even coming up with those rules is challenging. We’re talking about 40 million buyers, and all buyers are different. Some of them might be resellers In fact, we have highly-engaged, high-volume buyers buying wedding-related stuff.
If you have a hard logic — say, you purchase wedding-related things — and then we just stop showing you for the next two weeks or next two months, those users may say, “What’s wrong with your algorithm? I want to see similar things. I think I showed this store enough of my personal preferences. Why don’t you take my personal preferences into account?” It’s an extremely difficult part of all this.
Mike Delgado: I can see how complex the work is for you and your team.
LiangJie Hong: Another challenging problem with e-commerce is the sparsity of the data. I guess a lot of users go to Amazon or eBay on a daily basis, so they actually give a lot of opportunities for these types to be able to exploit their personal preferences.
For Etsy, a lot of people come here to buy gifts, to buy things for their special occasions. We do have a lot of buyers. They show up at Thanksgiving or the holiday season. Then they disappear for the whole year. They show up again in the next holiday season. You can imagine we don’t have too many points. The last batch was last year. Are you waiting to utilize those data points, or do you say those data points are outdated? So we don’t have too much information about these guys. It’s a very difficult situation that … we need to provide personalized and engaging experiences for these users.
Mike Delgado: Obviously there are people who do trolling and do inappropriate things. How do you help prevent and care for the Etsy community by making sure there’s no offensive content.
LiangJie Hong: That’s a great question. We have dedicated teams to basically vet through a lot of shops and sellers and a lot of companies that we have on our site. We also have machine learning algorithms to scan fraud and even money laundering. There’s a mixture of a lot of human investment as well as machine learning algorithms.
Mike Delgado: Can you talk a little bit about that? Machine learning being used to prevent fraud. I’m curious about that.
LiangJie Hong: We have teams and to look at the user activities and look at how people might want to exploit the site or want to exploit a lot of the rules that we put in place, and we use those behaviors and join our models such that we can detect those things.
It’s a never-ending process because people change their behavior. They invent new games and then we have to catch up. But yes, we utilize ML for fraud detection problems.
Mike Delgado: I’m also curious about how machine learning is helping shoppers when they’re looking through their mobile devices. I think behaviors are sometimes different on how we use mobile devices versus desktop computers.
LiangJie Hong: Yep. Half of the traffic to Etsy is from mobile devices. And we also understand that your behaviors on mobile devices are very different from desktop. One thing that’s probably special for Etsy is that people tend to browse and explore on their mobile devices but eventually check out from desktop machines. One thing is that a lot of things are very expensive. It’s not a commodity. You buy a painting from the UK, that’s probably going to be like 70 or 80 or above dollars. Then a lot of people want to make sure that transactions and all the things like that are done on their desktop. But the mobile devices are driving more and more traffic.
Mike Delgado: When I watch my wife shop, she loves to browse. She’ll browse Etsy on her mobile device and add things to her cart that look interesting, but then she’ll go to her desktop to make the purchase. Is that what you see a lot?
LiangJie Hong: Yes. And that’s a very common pattern. On the other side, we are trying to improve the checkout procedure on mobile devices such that people feel comfortable to check out on their mobile devices.
Mike Delgado: When I shop online, usually I’ll do my research, look at product reviews and then I’ll buy within a short period.I shop fast, kind of like when I go to a physical store. I’m set on what I want to buy. My wife, on the other hand, likes taking her time. She will spend a lot of time thinking before she will buy something. How does machine learning adjust what different people see based on their shopping patterns.
LiangJie Hong: That’s a very good question. Let’s say you go to a shopping mall. I would say not everybody is waiting to buy something. A lot of people are exploring and just walking around, and they also enjoy the atmosphere, the environment. And from the shop perspective, they also understand that not everybody is interested in buying instantly. They want to inspire you, and maybe you have a purchase here next time. But that’s very normal in our offline shopping experience.
The challenge is how can we mimic that experience online? We’re doing a reasonably good job for the folks who know exactly what they want to buy. They have keywords, and they just type that in and check out. With the real checkout, it’s all very straight. And we have challenges … I think that’s not only for Etsy, but that’s for the e-commerce across the board. How can we model a discovery process? Say I come to the site, I have 10 minutes to kill, I don’t have anything specific in mind to buy. What is this inspirational kind of process such that we can inspire people to to purchase things?
That’s where the machine learning comes into play. A lot of innovations in machine learning and e-commerce should happen. Right now we are at the very early stage of this. Because nobody has already defined it. There’s no such thing as an e-commerce discovery model or an e-commerce inspiration model. That will change the way people shop online.
Mike Delgado: It’s fascinating hearing about all the different ways you’re leveraging machine learning for e-commerce and all the challenges involved. I was on your personal website and I saw you were at a big data meetup recently, and you talked about optimizing gross merchandise value in e-commerce Can you talk about that?
LiangJie Hong: That’s one example where I mentioned a little earlier that we need to adopt traditional models to the e-commerce store. Traditional information retrieval or traditional search. The classic example is Google, where they optimize static relevance. Say you want to search “Barack Obama,” then you have … you know, there’s a Wikipedia page probably, and you jump to the top and you have some other sites. And these rankings are basically golden for every single person. But e-commerce, it’s different. A, let’s say you search “Harry Potter” and you want to buy some magic sneaker or something, and I search I want to buy a T-shirt. So the notion of relevance is personalized in general in the e-commerce search.
Relevance is one way to look at the things, but we want to optimize revenue, which is called gross merchandise value. We’re going to optimize when people search things. It’s not only we want to provide the most revenue result, but also the result that can generate the most revenue. Then we need to model how likely you are going to click on that. And then after you click on that thing, how likely you are going to purchase that thing. And you also need to take the pricing into account. Do we recommend things that have higher conversion rate by the lower price or a low conversion rate but a very high price? You see all these trade-outs and all these compromises that we need to make, such as that we adopt the traditional model to optimization of revenue in the e-commerce area.
Mike Delgado: That’s amazing. I never even thought about how you’re placing products or recommending to different people based on different conversion rates and different costs. This item generates more revenue for the company, but lower conversion rate. Are you doing a lot of testing?
LiangJie Hong: We do hundreds of A/B testing. Offline we also do a lot of testing to make sure that all the algorithms, all the models that were put out there have measurable effects. We know that every single one we put out there, what’s the increment of revenue it’s generating, what’s the increment of user engagement we’re generating?
Mike Delgado: One of the questions I always love to ask data science leaders is … when you’re hiring someone for your team, what skill sets and personality types are important to you for someone who’s going to be good to work in a machine learning team specialized in e-commerce?
LiangJie Hong: I always get such questions going to meetups and conferences and so on. I want to emphasize something that probably is not super emphasized.
One is the ability to formulate the real-world problems into machine learning settings. A lot of students, a lot of people are very interested in the field. They tend to think machine learning or data science is a basket of models, a basket of techniques, and I need to learn these 20 models, I need to learn these five programming languages, so on and so forth. Those are definitely important. Those are the hard skills you need to have. But one very important thing is the soft skills because we talk to product managers, designers, and others who don’t necessarily have the machine learning and data science background.
Then we have to translate their requirement and the way they think into machine learning setup. This is a very difficult skill because there are too many possibilities. One scenario, you can translate into five different setups. Or these five different setups might mean different things and they have different consequences, and so on, so forth. So how you can think about this is very important for us. It’s a key as to data scientists, because this is where this kind of scientist or science part really comes into play. The other part that’s very similar to this is communication skills.
So again, you invented this fantastic model, you come up with these really good solutions. But how can you communicate with the shareholders? Where, again, we are talking about the shareholders who have a very diverse background. Product managers, designers, company executives. All sides are studying this. How can you make sure the things that you put up there can be summarized in English words? This is a very important skill, and as data scientists grow I think it’s going to help them evolve along the way.
Mike Delgado: The soft skills are so key. Because if you can’t communicate it well, you’re not going to get buy-in or it’s not going to be very easy to sell it within the organization.
LiangJie Hong: Right.
Mike Delgado: I’m glad you touched on that. Because a lot of times people will focus on the hard skill, the models, and the background and stats, or different programming languages. But to your point, to get anything done in an organization, you’ve got to have that soft side.
LiangJie Hong: Yep. So right now the hard side is already emphasized enough. We all agree what the hard side is. But the way I look at this is I see more successful data scientists, they have much more soft skills, they can maneuver inside the organization and can put data science and machine learning as a driving force in the organization. That’s why I emphasize the soft skills.
Mike Delgado: Before we end, I always like to ask a series of questions, and the first one is what is your favorite programming language?
LiangJie Hong: I like Python. I think it’s a very flexible and a very good tool for data science.
Mike Delgado: And the last question is, what advice do you have for our community who are interested in getting started in a data science career?
LiangJie Hong: Have patience and keep learning. A good example is we recently got a candidate who submitted things for our full-time data scientist position, and that person is from The Juilliard School.
Mike Delgado: Really?
LiangJie Hong: His major … actually he’s getting the Master of Piano Playing and all his reference letters are from performance centers. I sent an email to this guy that said, “You’re not the data scientist role that I’m looking for, but if you really think you have a passion about that …”
Because this person also attached his repository in his résumé. Obviously, this person has bimodal interests. So daytime probably as a musician, but free time a data scientist role. I sent an email to this person encouraging him to pursue the way. That’s the way I want to give the folks that advice, where I say, even though today you may not tap into this industry, but just keep your interests in place and one day there’s some good outcome out there.
Mike Delgado: That’s a cool story. That’s on another level, super smart, complex thinker, amazing pianist, and then actually wanting to pursue data science. That is amazing.
LiangJie Hong: I was shocked when I looked at that résumé.
Mike Delgado: That was brilliant. Okay. Before we end, where can everyone learn more about you?
LiangJie Hong: I have a personal website. Just search my name and that’s basically the top one from the Google result. There you can check out what we are looking for, the job description, so on and so forth. And we also list a bunch of papers, blog posts that we post often.
Mike Delgado: Awesome. And I’ll make sure to put links to your LinkedIn profile so that people can follow you there and then also links to your website on our blog. And for those who are listening to the podcast, the short URL is just ex.pn/datatalk40, and that’ll bring you over to the website where we’ll have this interview in video format, along with the podcast episode and a full transcription and links to where you can connect with LiangJie.
Thank you so much, Dr. Hong.
LiangJie Hong: No problem.
Mike Delgado: Take care. We’ll see you all next week on #DataTalk.
Liangjie Hong is Head of Data Science at Etsy Inc., managing a group of data scientists to deliver cutting-edge scientific solutions for: Search and Discovery, Personalization and Recommendation, and Computational Advertising. Previously, he was Senior Manager of Research at Yahoo Research from 2013 to 2016, leading science efforts for Personalization and Search Sciences.
Liangjie has published papers in all major international conferences in data mining, machine learning and information retrieval, such as SIGIR, WWW, KDD, CIKM, AAAI, WSDM, RecSys and ICML, winning WWW 2011 Best Poster Paper Award, WSDM 2013 Best Paper Nominated and RecSys 2014 Best Paper Award, as well as serving as a program committee member in KDD, WWW, SIGIR, WSDM, AAAI, EMNLP, ICWSM, ACL, CIKM, IJCAI and several workshops.
In addition, he constantly reviews articles in prestigious journals such as DMKD, TKDD, TIST, TIS, and TKDE. Liangjie co-founded the User Engagement Optimization Workshop, which has been held in conjunction with CIKM 2013 and KDD 2014. Prior to Yahoo Research, he obtained his Ph.D. (2013) and M.S. (2010) from Lehigh University and B.S. (2007) from Beijing University of Chemical Technology, all in Computer Science.
Check out our upcoming data science live video chats.