We believe that big data is good. Good for our economy; good for consumers and good for society.
In a recent Experian #DataTalk, we had a chance to talk with Seth Stephens-Davidowitz about his latest book: Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are.
Mike Delgado: Hello and welcome to Experian’s Weekly Data Talk, a show featuring some of the smartest people working in data science. Today we’re very excited to feature Seth Stephens-Davidowitz, who is the author of Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are. Seth is also a New York Times op-ed writer. He’s a lecturer at Wharton, and former data scientist at Google. He earned his Bachelor’s degree in philosophy from Stanford and then went on to earn his PhD in economics from Harvard. Seth, it’s an honor to have you in our chat today.
Seth S.: Thanks so much for having me.
Mike Delgado: Seth, can you talk a little bit about your journey that led you from philosophy to economics, humanities to a math-based curriculum.
Seth S.: There aren’t really jobs in philosophy so you have to transfer to something. I got very excited by the idea I could use data to learn who we are and people’s secrets, so I got into it and just decided to try that for a while.
Mike Delgado: It’s awesome that you made that transition. I was English lit and I stayed away from all mathematics, but I think that it’s great that your background in philosophy also helped you with your questioning and helps you with your data science.
Seth S.: Yeah, I think that’s right. My philosophy professors said, “It teaches you how to think,” which kind of forces you to be very rigorous in your logic so I think it was useful training. I believe undergrad in general, it’s good to have more meta training than learning particular things. Learning how to think is probably more useful as an undergrad than learning facts.
Mike Delgado: Seth, your book was something I couldn’t put down. I read it straight through within a week and I’ve been recommending it to all my friends, colleagues at work. It was mind-blowing. You spent so much time going through so much data, and then your narrative of how you told the story was just fascinating. It kept me glued, and then you’re also very funny.
You have all these little things that are all throughout the book that kept me going. In fact, your last chapter of your book cracked me up, the way that you kind of ended it. I thought that was really clever, and you even had me reading through your notes at the tail end. That’s how good your book is, so I want to just recommend everyone to check it out.
But the theme of your book, which is the topic of today’s chat, is you talk about the Google search box serving as kind of a confessional for us, and we sometimes ask questions or type in topics that we maybe don’t want others to know about, and I’m really curious about what led you to start to analyze Google search data.
Seth S.: Well first, thank you for the very kind words about my book. I worked hard on it, so I hope people would enjoy it but you never know till I put it out there in the world.
As I was doing my PhD in economics and I was kind of burnt out. I wasn’t that into traditional economics. Never really have been, and I thought, “What have I done with my life?” I was struggling to work, and then I don’t remember how I learned about it, but I somehow learned about Google Trends, this tool that allowed you to see where and when people made searches. Maybe it’s because I have a devious mind, but I had talked about this with my friends that Google knows everything about them. It was on my mind and I knew from my experience in the social sciences that a lot of the traditional data sources aren’t that good and you can’t really trust what they tell you. I thought, “Ah, this is very interesting.”
I think I started with some very basic stuff, but soon I was typing in inappropriate stuff. I quickly moved in that direction. I thought it was an important tool for social sciences on these questions. I said, “Oh, I won’t get bored of this for like the next five or ten years.” I guess I am interested in what people are really like, kind of the dark underbelly of society. I thought there were important insights in social science and that it could keep my interest for a while which led me in that direction pretty fast and I was researching sex and racism and insecurity and sexism and all these kinds of shady topics I guess.
Mike Delgado: Yeah, you touch on a lot of very hot topics. Sexual preference, Islamophobia, as you mentioned racism. For those of you that don’t have the book, at least get the Kindle preview so you can at least read the introduction, because I think if you just read the introduction of Seth’s book, you’ll want to buy the book, because it’s that good. The amount of insight that Seth shares about these queries. As you started to dive into these subjects that are very dark, you’re also showing humanity and how Google’s kind of showing us not only who we are, but that we’re also not alone. But I’m curious. As you started to go through this data, was there anything that surprised you as you were digging in?
Seth S.: Yeah, all kinds of things. The first thing that surprised me, and this is just because I was naive and because of the time period I started in which is 2012, shortly after Obama had been elected the first time. Everyone said, “We live in a post-racial society,” and I was just shocked by how many racist searches Americans were making, and where they ended up, where these searches were most frequently made. Right off the bat I was like, “Oh wow.” It was surprising to me. I started giving lectures and there were African-Americans in the audience, and when I started by showing quotes that social scientists said about how racism is not a big factor anymore in America, they just broke out laughing. They were not surprised by any of this research, but me personally, I was very surprised by the level of racism against African-Americans that was revealed in Google searches.
Sometimes it’s not necessarily surprising but it just confirms some things you suspect, such as sexual preferences. You kind of see that in places where it’s hard to be gay and there’s clearly are a lot of men in the closet. Again it’s not surprising, necessarily. Or when you make it hard to get an abortion, more people search for “self-induced abortion.” Okay, not so shocking, but then every once in a while, you get something that’s totally out of nowhere that you would never guess. Like the top search starting “my husbands want” in India is, “my husband wants me to breast-feed him.” That’s just one I would have never guessed. There are things in this data set that are just totally out of nowhere, which is cool.
Mike Delgado: One of the things that I thought was interesting as a parent, when you were looking at parent behavior and how they search for things versus sons versus daughters, and you pointed out that people would start off their search, “is my son gifted?” That’s the number one search, and when it comes to daughters it’s about, “is my daughter overweight?”
Seth S.: Well, it’s not the number one search. Any time there’s a search around intelligence, gifted or genius, it’s more common for sons and any time there’s something about physical appearance, overweight, how to lose weight, it’s more about daughters.
Mike Delgado: Yeah, and so that’s one of those hidden prejudices that you kind of uncovered for parents.
Seth S.: Yeah, and also I think that’s one where parents probably aren’t aware of these prejudices, but if you look at everybody’s aggregate, you see them, so it hopefully makes people think twice before thinking about their daughter’s weight or just thinking about their son’s intelligence or giftedness.
Mike Delgado: I like how you pointed out what an odd query it is when considering that girls naturally are speaking earlier, they’re usually the first to be accepted into gifted programs, and so it was interesting that you pointed that out. I’m curious from a data science angle, did you have any challenges dealing with such large data sets?
Seth S.: The data I deal with is much smaller because the machine learning and AI people at Google, they’re the ones who do the hardcore technical stuff of getting it into Magna, getting rid of the spam and reducing the data set to a manageable size. Then they curate it to a point that I’m just dealing with much smaller data sets, so it’s not a technically huge challenge on my end. It plays into my laziness as well.
Mike Delgado: Your very first chapter when you’re dealing with the racism that was going on and the use of the N-word with words like “jokes” and “slurs” and things like that, I was shocked at that data. Then as you also revealed that your gut instinct originally, which was mine, was that the majority of that search behavior would be in a predominately white deep south area, and as you pointed out in your data, that’s definitely there, but you also said it’s not really north/south. It’s more east/west.
Seth S.: Yeah, Michigan and western Pennsylvania, eastern Ohio, upstate New York. Also really high there. It’s kind of eerie. If you’re African-American in this country, a decent percent of your neighbors are saying all the right things, they’re being nice to you, they’re smiling and they’re going home and searching “N jokes” or something. It’s kind of disturbing.
Mike Delgado: Yeah, that was really surprising to me when you shared that, and then also when you correlated that with polling in the book about, you looked about Obama’s polling in certain areas versus Kerry.
Seth S.: Yeah, it was clear as day that Obama did worse in places that made a lot of these racist searches. Even though everyone’s telling pollsters that they didn’t care that Obama was black, the people who were searching “N jokes” were not voting for Obama even if they’d voted for previous Democratic candidates. You’d like to think that we’ve moved beyond that a little bit. I don’t know. I guess at least now we can measure it better, so we can learn how to control it a little more.
Mike Delgado: Yeah. The other disturbing data that you pointed out was that chart in your book where you were showing how different types of people are being searched for in Google and you had things like, if you were gay or if you are Christian or Muslim, these are the words most associated with you?
Seth S.: Yeah. Again, just the thoughts that people don’t tell you. So, a Google searches are, “Are Jews evil?” Or “Are Jews greedy?” Or “Jews are cheap,” or whatever, and “Mexicans are lazy.” These are all the kinds of stereotypes that people have that they don’t necessarily share publicly, but they do turn to the internet. I think just in general people see race a lot. There’s kind of this idea that we pretend that we don’t see it, but a lot of people see it and even if they’re not searching “N jokes,” they do maybe like have questions about a group. They might have a Mexican employee who they think is not the best worker, and they’re going to go to Google and say, “Why are Mexicans lazy?” Or something like that. It’s like people do extrapolate to the group level from one example. Again, I think it’s good that we don’t pretend that this doesn’t exist, but some of it is hard to. I kind of wanted to make it a little bit uncomfortable at points.
Parts of it, when it became disturbing, I threw in some jokes. I think one of the powers of this search data is it’s so rich, it gets more evocative than If you ask someone in a survey. It doesn’t bring an emotional response necessarily, but if you see people in Mississippi searching, “How to use a coat hanger for an abortion,” it’s just very jarring. I think in a good way. You’re just like, “Oh. That person’s in trouble.” You have more compassion for what African-Americans are going through, because you know that their neighbors are searching “N-word jokes.” That can’t be easy to go through life with neighbors who are searching this stuff, so I think in a way it does make you more compassionate, because the data is very evocative.
Mike Delgado: Yeah. I think also it shows how fake, and you point this out, our social media lives are, right?
Seth S.: Yeah. If you look at the data on social media it never lines up with reality. The National Enquirer sells more copies than Atlantic Monthly, but the Atlantic Monthly is 45 times more popular on Facebook because everyone wants their friends to think they’re informed.
Mike Delgado: Right.
Seth S.: Nobody’s sharing porn on social media. I talk about how people discuss their husbands on social media. “My best friend, so cute, adorable, amazing.” And then on Google searches when they’re alone it’s like, “My husband is a jerk, annoying, so mean.”
Mike Delgado: Yeah, that’s right.
Seth S.: To kind of go back to my background, I’ve also been a little bit angry at society. The whole idea like “everybody lies,” it’s like, “Shut up. I know you are.” I kind of always felt like people were lying to me and I’ve kind of been pissed off about it, so I think this is a little bit of my revenge. Like, “Shut up. I know all of you are not thinking all day how your husbands are so cute and amazing and all this stuff. I’ve got the data now to prove it too.”
Mike Delgado: I love it. This should be like a Curb Your Enthusiasm episode. You can’t trust a word.
I’m curious. I’ve never been too skeptical of surveys in the past, but after reading your book, every single time I see a headline with, “86% of consumers say this,” I’m now skeptical of any survey data and I’m kind of curious about your take on that.
Seth S.: It can take it too far. I think that was the sexiness of my book, that surveys are what social scientists use to understand humanity, so it’s like, “Surveys suck and Google are awesome.” It’s not as simple as that. Even in this previous election where the surveys got it wrong, they only got it wrong by like 2 percentage points which is not that horrible. It’s very hard to get exactly how many votes Trump would ultimately get, so there are a lot of complications. I think a certain skepticism is warranted. I wouldn’t totally discount everything in surveys.
If there’s a change that a president used to have a high approval rating, now he has a low approval rating, that probably tells you something. Or when Bush’s approval went to 90% after September 11th, that probably told us something about where the country was at. So I don’t think you can just say, “You can’t trust surveys at all.” But yeah, there were times where I was talking someone in the food industry and they do these surveys asking what consumers want and they always want healthier options. You give them healthier options and they never take them. They never buy them. So there definitely are situations where you should have some skepticism and in general, focus groups also.
There’s this famous example of, as you mentioned, pretty much Curb Your Enthusiasm or Seinfeld had horrible focus groups when it first came out. I think part of it is everyone said, “Oh, these characters are so unlikable.” That’s what you’re supposed to say and you’re supposed to say, “Oh, I don’t want to watch this show with unlikable characters or narcissistic characters debating the minutiae of life,” but for whatever reason, people are drawn to it. You should have some skepticism about the things people say and always pay more attention to the things people do, what they click on, buy or search. That’s always, I think, more reliable.
Mike Delgado: Other data that I found interesting in your book is you cover was some Facebook data around marriages, and you pointed out that people who have different social circles tend to stay married longer.
Seth S.: Yeah, that’s one of the cool things. In general, also, the book’s just about what the internet means for the social sciences, which I think is a huge revolution in social sciences, the data that now exists. We haven’t had a previous data set on all the relationships in the world or most of the relationships in the world. When people are single, when they’re in a relationship, when they’re married, when they’re divorced. A lot of qualities about them. Do they have the same taste in music, the same taste in movies, do they have the same interests? Do they have the same social circle? And already we’re starting to see things in this data that are surprising. One of them is that if you share the same social circle, you’re more likely to break up so it’s better for a relationship’s longevity to get separate social circles.
Mike Delgado: Was that more significant than a shared religion or anything like that?
Seth S.: I don’t think they did that. They didn’t test that in the Google data.
Mike Delgado: Oh, okay.
Seth S.: That would be interesting to test all these things and put them in a Google search and see what seems to matter.
Mike Delgado: You also write about the limitations of big data and the problems of using data for making decisions, which I thought was really intriguing and I love your line, “Numbers can be seductive. We can grow fixated with them, and in so doing we can lose sight of more important considerations.” I’m wondering if you can kind of talk a little bit about that, because that seems to go against everything I’ve ever heard about big data and science.
Seth S.: Well actually, I was recently reading this book by Charlie Monger, who works with Warren Buffet on picking companies and he said one of their big advantages is that in their models, if they have no measurement of something but from their experience think it’s really important, they still adjust their models based on that. They consider everything, even things that can’t be measured, and I think that’s important. Data scientists tend to just love things that are measured and forget about things that can’t be measured, and that’s when you start maximizing clicks and giving everybody click bait and then people lose interest in your website because all the stories don’t live up to the promise you offered them. So, I think in general you want to be careful and think through what your model might be missing.
Mike Delgado: Time is almost up, so Seth, I was wondering if you could share some advice for those in our data science community that are looking to become data scientists and any tips you’d have for them.
Seth S.: I’d always suggest starting a blog, because it’ll just get you in the habit of producing, even if nobody reads your blog or only your mom reads your blog. Just get in the habit of, let’s say every week, with these public data sets like Google Trends or learn how to scrape websites and put together, force yourself to look at what data’s available and every week try to produce an interesting finding. I think you’ll learn a lot in that process if you force yourself to do it every week, so I recommend that just about every day.
Mike Delgado: Seth, thank you so much for being our guest. I want to remind everybody that we’ve been talking about Seth’s latest book called Everybody Lies. You can find it everywhere. Barnes and Noble, Amazon. Highly recommend you at least check out the Kindle version and download the introduction. I guarantee you’ll be hooked. This is one of those books you can’t put down, especially if you’re on an airplane flight, this is a good book to have with you. Yes, a lot of dark stuff is in here, a lot of dark data, but Seth is also very funny and provides great narration, so I definitely recommend this book. We’ll have a link to this book in the “about” section of the YouTube video, as well as in the comments of this Facebook live, so you can get it directly from Amazon or you can go directly to Seth’s website at sethsd.com.
Seth, thank you so much for being part of today’s conversation. I want to thank everyone for watching, for all your hearts, for all your likes, for all your shares, and we’ll see you all next week.
Seth S.: Thanks so much for having me.
Seth Stephens-Davidowitz has used data from the internet — particularly Google searches — to get new insights into the human psyche. A book summarizing his research, Everybody Lies, was published in May 2017 by HarperCollins.
Seth has used Google searches to measure racism, self-induced abortion, depression, child abuse, hateful mobs, the science of humor, sexual preference, anxiety, son preference, and sexual insecurity, among many other topics.
He worked for one-and-a-half years as a data scientist at Google and is currently a contributing op-ed writer for the New York Times. He is designing and teaching a course about his research at The Wharton School at the University of Pennsylvania, where he will be a visiting lecturer.
Seth received his BA in philosophy, Phi Beta Kappa, from Stanford, and his PhD in economics from Harvard.
Check out our upcoming live video big data discussions.