Listen to the podcast:
Every week, we talk about important data and analytics topics with data science leaders from around the world on Facebook Live. You can subscribe to the DataTalk podcast on iTunes, Google Play, Stitcher, SoundCloud and Spotify.
This data science video and podcast series is part of Experian’s effort to help people understand how data-powered decisions can help organizations develop innovative solutions and drive more business.
To keep up with upcoming events, join our Data Science Community on Facebook or check out the archive of recent data science videos. To suggest future data science topics or guests, please contact Mike Delgado.
This week, we talked with Bill Vorhies about the ethical implications of artificial intelligence.
Here is a full transcript:
Mike Delgado: Welcome to Experian’s Weekly Data Talk, a show featuring some of the smartest people working in data science. Today we’re talking with Bill Vorhies, the Editorial Director at Data Science Central and the President and Chief Data Scientist at Data-Magnum.
Bill, thank you for being part of today’s broadcast. Can you share a bit about your background and what got you started working in data science?
Bill Vorhies: Thanks for having me today. In the ’90s, I came out of industry and went into management consulting. I ran the consulting shop at JD Power and Associates. Then, I moved over into big four consulting at Ernst and Young and PricewaterhouseCoopers. During those experiences, I had a couple of projects that required linear regression as part of the solution, which was pretty magical at the time.
In 2001, a partner and I started our own company called Predictive Modeling LLC, PREMO. I must say it’s possible to be almost too early in a market. There really wasn’t much business going on. There might have been 10 or 15 independent consulting companies like mine at the time. It was really thin. There wasn’t much air to breathe. But we persevered until big data came along in 2007. Since then, we’ve been focusing on predictive analytics, big data, streaming, IOT, AI.
For the last two-and-a-half years, I’ve been very honored to be the Editorial Director of DataScienceCentral.com, which allows me to stay abreast of the whole industry in a way I wasn’t able to before.
Mike Delgado: For those who haven’t been reading DataScienceCentral.com, I highly recommend it. Bill does an outstanding job of not only writing and curating articles, but it’s a huge resource for anybody who’s interested in what’s happening in the industry. We’ll make sure to put the link to DataScienceCentral.com in the About section of this YouTube video, as well as in the comments of the Facebook video, because it’s outstanding. In fact, one of the reasons Bill is joining us today is because of an article he wrote a couple weeks ago about artificial intelligence and the ethical dilemma we’re now unexpectedly facing.
Bill, there are several things about your article that caught our attention. First is the issue of ethics associated with AI. This is something not a lot of us have been thinking about — especially for those of us who are new to the field or just watching the field. You describe this problem of ethics in AI as something that’s urgent. I wonder if you can share a little bit about what urged you to write this article, because it is one of the most popular articles right now on DataScience Central.com.
Bill Vorhies: Right. And believe me, I was caught off-guard by this too. There were two studies that caught my attention. One of them is almost a year old now — November 2016. Two research academics out of universities in Canada and China, cooperating together, published a peer-reviewed study that showed they could identify criminals from non-criminals based solely on facial recognition with 89.5 percent accuracy. That blew me away. At the time I wrote an article that asks if we’ve gone too far.
Then just about a month ago, two research academics from Stanford published a peer-reviewed article claiming they could tell sexual preference among men and women with 91% accuracy and 83% accuracy, respectively. One of the reasons people have to be suspicious of AI has a lot to do with online privacy and bias. But here are two examples that clearly go way beyond that, into much more personal issues than just online bias and privacy.
Now, to the issue of ethics. Ethics and moral philosophy are pretty much the same thing, and they’re about defining or recommending concepts of right and wrong conduct. Missing from that definition is the implicit fact these are human judgments about human behaviors. You might say ethics is a human interpretation of the consequences of our actions toward others. This may be the first time in history we found it necessary to apply restrictions of right and wrong conduct to nonhuman entities, because it’s the first time a mechanical or a nonhuman entity has been able to interact with us in a way that causes societal harm. That obviously is also the reason why it’s urgent.
Mike Delgado: No doubt. When I first read some of those studies, I was shocked as well. I was surprised that simply through facial recognition an AI could be that accurate in determining sexual preference or criminal activity. It was shocking. Like you said, this is now an ethical dilemma around personal privacy.
Bill Vorhies: Yes.
Mike Delgado: Are there any other studies you were looking at that concerned you?
Bill Vorhies: Yes. But in the spirit of fairness, let me get more detailed about the two studies themselves — the criminality study and the sexual orientation study. I think it’s important the audience understands that these really were peer-reviewed and good data science conducted in an academic environment.
In this criminality study, these are two professors: one out of McMasters University and the other out of a university in Shanghai. They looked at 1,856 ID photos of Chinese males 18 to 55. They had no facial hair, no facial scars or other markings and were known to be convicted of both violent and nonviolent crime. They compared those with ID photos of 1,126 non-criminals with similar socioeconomic backgrounds.
In facial recognition, you use a variety of points on the face to derive a feature set. Once they had that feature set and the two classes designed, they used several different types of supervised logistic regressions, machines and convolutional neural networks. As you might guess, the CNNs and the SVNs gave them the strongest results.
We all know that CNNs can sometimes be confused by random noise. To their credit, they injected a 3 percent random noise into their test, and they still got good results. Can’t fault them on the data science. In the case of the Stanford folks, Professors Wang and Kozinsky have been getting a lot of pushback on this study. First of all, we should be clear that the 91 percent accuracy they claimed for men and the 83 percent for women required them to have five images. If they had just one, the result was much less clear.
As a society, we joke about how we think we can spot those folks who have different sexual preferences from ours. But these two actually ran a controlled study and showed that only about 61 percent of the time for men and 54 percent of the time for women could people simply guess personal preference from a picture. That’s not much better than a coin toss. Again, they used a facial recognition program that’s well-known, and they had 130,000 images of men and 170,000 images of women. They were 18 to 40. The images were from dating websites where the people self-reported as heterosexual or homosexual. The group was made up of Caucasians in the United States.
Once again, they used a deep neural net facial recognition program. They got about 4,000 attributes. They used a simple logistic regression. While the Chinese researchers didn’t comment about the importance of their research, the Stanford researchers at least noted the chance of this information being abused. “”
Those are the two I reference in the article, but you asked about other studies. There are two others I’ve come across recently. One is a study in a psychology journal, ELOS, that used the digitized voice characteristics of couples in marriage counseling to determine whether their relationships would succeed or fail after two years.
This is a fairly long longitudinal study, and they got strong results. The question is “Are we going to go to a marriage counselor who’s going to record our conversations and say, ‘“Go home. It’s not going to work?’” There was a much less friendly example in the past year. In Russia, there is a Facebook equivalent that lets you use facial characteristics to locate people in social media. Well, apparently these folks, these trolls, used pictures of sex workers to cross-identify sex workers with their actual identities on social media and then outed them. Of course, it did a lot of harm, and there was also a lot of inaccuracy. But that’s kind of the other end of the spectrum here that we’re really concerned about.
Mike Delgado: No doubt. The cases you’re bringing up are things I’ve never heard of. The earliest ethical problems with AI is about autonomous vehicles and decisions vehicles have to make, right?
Bill Vorhies: The trolley problem is a classical philosophical problem, and it has to do with who you’re going to hurt. The origin of the name comes from an example about a rail-mounted trolley where the operator had the opportunity to either pull the lever and injure his passengers or not pull the lever and injure bystanders. It’s one that comes up in philosophy classes over and over again, and it doesn’t have any good answer. But it is a reason to wonder about autonomous vehicles and what rules have been built into them.
I mean, am I going to get into a car that has been programmed to say, “Oh, 40 percent probability; I better hurt the passenger.”?
Mike Delgado: There’s also that recent example where an AI Twitter account became racist and prejudiced based on people tweeting back to it within 24 hours.
Bill Vorhies: Right, that was Tay — and that’s a really good example. The point we really need to make here is that our AI as it exists today is just a machine. It does not understand what it’s saying. It does not have the ability to act on its own accord to try to act against us as human beings. A lot of what we read about bias and prejudice in AI today follows two means. One of them is “The system’s biased against me. It’s going to try to do something that I don’t like.” Or that these are robot overlords on the way and so we should be concerned.
Let’s deconstruct this a little bit. While there are some concerns that we should probably have over online and geographic tracking, privacy, my own take on that is that it’s fairly minor. I think you want to remember that that type of tracking increases the efficiency in time and also reduces cost of advertising by presenting stuff that, frankly, is most aligned with our interests.
If you want to protect yourself from that stuff, go ahead. If you’re motivated to withhold your information, that’s fine. You’ll be giving up a little convenience and economy. But the second case is much more close to what your question about Tay is. That is, are these robot overlords going to be able to do something to us that we really don’t want?
Let’s talk about chatbots, of which Tay was an example. No matter how sophisticated they are — and, by the way, it was just in the news this week that there is now a chatbot you can call up and get psychological counseling from. But that chatbot has no understanding of the conversation it’s having with you. It’s a combination of some very clever natural language processing, combined with a very, very large data set of previously successful conversations, and it simply learned what is successful relative to your question.
So Tay is a perfect example. In 2016, Microsoft unleashes two versions of this chatbot, one in Japan and one in the United States. The one in Japan got rave reviews, and pretty soon there were young single men in Japan actually pledging their love to Tay because she was so sensitive and so responsive, and thought, “Wow, that should work.” In the United States, unfortunately some clever but ill-meaning folks figured out that Tay learned from what you told her, so in the space of 16 hours they got Tay to start spewing very sexual and anti-Semitic — frankly Nazi — language.
Microsoft obviously had to pull it, but it’s a good example. You can compare it to the behavior of a three- or four-year-old child raised in a home that’s filled with hate speech. That child is looking for approval and love from those adults who have that point of view, and that’s why the child continues to say those things. Chatbots aren’t the same. They have no human component. As a matter of fact, it’s important the audience remembers these AIs are actually very brittle. If you change their sensors, if you change their actuators, if you let their body of knowledge contain outdated or incorrect information, they’ll just fail outright.
Also, these systems can’t learn from one system and apply it to another. Even if they could learn from one system and apply it to another, it would be humans who would have to tell them what the goal of the game is. In interaction with humans, it’s about answering your customer service questions, but it’s not about manipulating your feelings to the benefit of the AI.
Mike Delgado: I’m glad you mentioned the overlord scenario because it’s definitely gotten a lot of buzz in the news. Proponents like Elon Musk and others have talked about their concerns about AI in the future and the ability for AI to do devious things. Do you have any concerns about that?
Bill Vorhies: I think Elon is thinking way out in the future, so first of all, let’s take a breath. Remember to look more at the donut and not at the hole. We’ve got some time. Also, words like ethics and privacy and even artificial intelligence are … I think everybody knows who Marvin Minsky was, and he said these words were suitcase words. That is, words that carry so many meanings that each person looking at the problem is going to see them from a completely different point of view.
From the wizard’s side of the curtain, if you can say that about data science, I think we need to help the public and the press not to start with their default of hyperventilating, which is so much the case these days. But we should be concerned about who’s going to use and who’s going to misuse this stuff, like the trolls in Russia. Also in the press this week, Turkey is having an LGBT crackdown. Are they going to look to this software and start using public cameras? Of course, they have not said that and I hope they’re not, and they would be wrong if they did that.
A couple of things to always keep in mind. First, data science is correlation, not causation, especially with respect to criminality. We’re not re-creating Minority Report here. We’re not going to arrest people because they look like they might be criminals. Second, even the most transparent of models have error rates associated there. There are no models that are 100 percent accurate. So you always have to be aware of false positives and even false negatives, but particularly the potential for any of these models to predict that someone is a criminal or someone has a sexual orientation that you don’t like, and it may be wrong in a significant percentage of the cases.
What should we be concerned about? I’m concerned about how pervasive physical tracking has become. Not necessarily the clicks or even the geographic location stuff, but the video cameras that capture our images everywhere. Now we know voice recordings can be used in kind of the same way — or for that matter whatever else can be captured that we can’t change. Could be our DNA. I suppose you could even imagine they could capture your breath and digitize it and do some sort of analysis on it.
Mike Delgado: I was reading some studies recently about how some recruiters are looking to interviewing candidates through webcams. There’s now technology available to help them distinguish how truthful someone’s being through the conversation, based on all these different factors. It is interesting. Someone may be totally truthful in the interview and trying to do their best, but for whatever reason there could be false positives. There could be some issues with the AI that’s determining they’re lying because their eyes are going in a certain direction, right?
Bill Vorhies: Exactly.
Mike Delgado: We just have to be very mindful of that, because that could hurt someone’s ability to get a job.
Bill Vorhies: Yes.
Mike Delgado: As you know, we have a community of data scientists and people who are interested in entering the field. Can you share some advice for those who are students in data science looking to pursue a career and just not sure where to start, what programming languages to learn? I’m curious about your advice for them.
Bill Vorhies: I have published a variety of career-oriented articles on DataScienceCentral.com. If you go there and use our search function to look for phrases like “so, you want to be a data scientist” or “midcareer switching into data science,” you’ll find a lot of deeper thought than I give you here today, but a couple of things to keep in mind. I often get the question, “Am I going to be able to learn this OJT?” or “Am I going to be able to learn this re-MOOC?” or “How do I learn data science in six weeks?” Well, the answer is probably none of the above.
You’re looking at somewhere between 12 and 24 months of concentrated study to reach the level of entry-level data scientist, a competent data-level scientist. Now, MOOCs may be a help for some folks, but when you go to get a job, there is a huge amount of bias in favor of degrees from accredited universities. For the most part these days, that’s the master of science with a data science focus.
If you don’t have a bachelor’s degree, you’re not left out, because there are now bachelor’s programs that are teaching exactly the same skills at the bachelor’s level you would learn at the master’s level. What you need to watch out for here is that the course you take and the degree you seek need to say specifically data science and not something more generic like data analytics or computer science.
Once you’ve committed to a comprehensive curriculum, those instructors will take care of making sure you have all the basics. Whether it’s Python or R, or SAZ, or SPSS, but more importantly how to prep the data, what techniques are most appropriate. How to formulate a data science question out of a business question. How to identify and implement a data science algorithm to actually produce benefits for your company. Those are the really important things, quite beyond the issue of whether or not you can program in Python or R.
Mike Delgado: I like your advice because it goes beyond just learning the programming language — being somebody who’s an analyst but who also someone who can develop those really smart questions. Being curious is crucial because there is an art to data science. Before we go, I have one last question. Do you have any advice for senior leaders looking to build a great data science team?
Bill Vorhies: We know that data scientists in general and great data scientists specifically are still in demand. There’s not enough supply to fulfill the demand. For example, what I see in organizations that have more than a half dozen or a dozen data scientists is an increased focus on efficiency and effectiveness. A lot of times, I’ll get a new graduate or someone who’s about to graduate in data science who comes up to me and talks about what they’ve done, and they go into great detail about their R code or their Python code.
I have to remind them that in the real business world, you don’t have an infinite amount of time to work on these projects. One of the things everybody needs to keep in mind — and I think this is particularly true of business leaders — is what’s rapidly emerging as automated machine learning. If you look at folks like Data Robot, or Pure Predictive, or — there are probably eight or nine of them now that have largely automated the front end of machine learning — what’s really going on here is these are packages that work quite well, and they allow just a few data scientists to do the work of a lot of data scientists in the past.
It gives you two things. First of all, it gives you the efficiency and effectiveness, but it also gives you a common platform. I think the other thing that senior leaders struggle with is if everybody is freelancing in R or Python, it’s pretty difficult to control for quality or even to debug. While I’m not in favor of fully automated ML, I think these high-efficiency platforms are indeed the way to the future.
Mike Delgado: Thank you, Bill. For viewers and listeners who want to be better informed on AI, deep learning, machine learning and also the ethical implications that go along with that, can you let everyone know about DataScienceCentral.com.
Bill Vorhies: Yes. We are the premier aggregator of all things data science and data engineering. It’s DataScienceCentral.com, no spaces. Membership is free. We have about 800,000 data scientists and data engineers who visit our site every month, which testifies to the quality of the content we’re able to put out there. It’s written and we have quite an in-depth video and webinar resource that you can tap into, as well as resources for finding jobs or simply posing questions to the community. We try to be one-stop shopping for everything you need in data science.
Mike Delgado: Bill, you always have wonderful writers and great content. For the data scientist listening who may be interested in trying to write for DataScienceCentral.com, how can they apply?
Bill Vorhies: They can just drop me a note on the site or an email. You can find my email all over the site and all over the material I’ve written. It’s email@example.com. If you tell me that you’re interested in writing, I will send you a complete set of guidelines for doing that. Many of our members contribute material, so we are able to turn over 20 to 40 new high-quality articles a week. I believe that’s one of our greatest strengths, and if you email me I’ll send you the guidelines. We don’t ask for prior approval of topics. Your submission comes in and goes to our editorial board. We take a look at it, and if it’s of good quality, you have a chance to be featured on our site.
Mike Delgado: That’s awesome. Bill, it’s been wonderful talking with you. It’s been an honor. Thank you so much for sharing your insights with us on the ethical dilemma with AI and what we’re facing in the future.
I want to thank the viewers and also the listeners of the podcast. As always, thank you so much for being part of this and making our community better by your comments, and by sharing topics that are interesting to you. If you’d like to learn more about upcoming podcast episodes as well upcoming videos or past videos, you can always go to experian.com/datatalk. We do these shows every week. We’ll see you next week. Bill, thank you again.
Bill Vorhies is the Editorial Director at Data Science Central and the President & Chief Data Scientist at Data-Magnum. He also serves as a Member of the Board of Directors at the American Institute of Big Data Professionals.
Make sure to follow Bill Vorhie’s blog on Data Science Central and check out his article about AI’s Ethical Dilemma.
Check out our upcoming live video big data discussions.
Experian is the world’s leading global information services company. Learn more.