Listen to the podcast:
Every week, we talk about important data and analytics topics with data science leaders from around the world on Facebook Live. You can subscribe to the DataTalk podcast on iTunes, Google Play, Stitcher, SoundCloud and Spotify.
This data science video and podcast series is part of Experian’s effort to help people understand how data-powered decisions can help organizations develop innovative solutions and drive more business.
To keep up with upcoming events, join our Data Science Community on Facebook or check out the archive of recent data science videos. To suggest future data science topics or guests, please contact Mike Delgado.
In this #DataTalk, we talked about data modeling techniques with Dr. Brandeis Marshall, Chair of Computer and Information Sciences at Spelman College.
Mike Delgado: Hello, and welcome to our weekly #DataTalk, a show where we feature data science leaders from around the world. Today we are honored to have Dr. Brandeis Marshall. She is the Chair of computer and information sciences at Spelman College. Before that, she served as professor of computer and information technology at Purdue University. We have these chats every week. If you’re watching live, we welcome your comments and questions. Today’s topic is data modeling techniques and strategies for getting the most out of your data. Dr. Marshall, it’s an honor to have you in our chat today.
Dr. Marshall: Thank you for having me, Mike. And please call me Brandeis.
Mike Delgado: Okay. For those listening to the podcast, if you want to read a transcription, watch the video or get any resources that Brandeis mentioned, the short URL is just ex.pn/marshall. That’s the URL you can go to.
So Brandeis, I always love to kick these things off with you telling us your journey, like what led you to begin studying data science and then eventually teaching it.
Dr. Marshall: It started in grad school. I took a database class, and I loved the order of the data and being able to construct some sense out of what is happening with all this information. Of course the information I was looking at was just text. But then I started getting interested in multimedia. I was starting to get interested in what’s happening with audio, and I love music. So what’s happening now with this blend of audio-visual type of data. So that’s where I really started on my journey.
After I finished graduate school, I was teaching, and I’ve been teaching databases for almost 10 years now to undergraduates and graduate students. So yeah, I just have a passion for data. I’m a data nerd, data geek. I like calling myself a data geek. But I just enjoy trying to figure out and problem solve how to make sense of data. How do we do it? What are the constructs that prohibit us from doing it? How you do it with messy data. How you do it with missing information. All that stuff energizes me and inspires me to learn more. Because it has not only an impact for an organization, but also an impact societally and socially. There are a lot of new data sets coming out, and it’s now being noted. There’s lack of representation for people of color, for women, for people with disabilities. And these issues need to be on the forefront of our minds when we are constructing, looking at and deep diving into this data.
Mike Delgado: Brandeis, you touched on some really important issues. So before we dig in to data modeling, which all goes back to what data you have, and you were just sharing that it starts with the data before you can model it. And if you don’t have accurate data or well-represented data, it’s going to mess up your entire decisioning, right?
Dr. Marshall: Right.
Mike Delgado: So speak to some of these challenges, these problems, you’re seeing in society right now.
Dr. Marshall: Well, I touched on them a little bit. Right now, I see there’s a lack of representation for women, for people of color. And then different socioeconomic types of challenges. Let me back up a little bit. Just when you get a data set, it typically is a CSV file, a ZIP file, and it might contain other data sets and sheets within them that are in Excel or CSV. You don’t even know what you’re looking at; you don’t know if the column names are correct. Do they represent what the information is supposed to be inside of there? Are you able to take that information and then put it into a system? That’s assuming you have the right format. So there’s all these initial issues when you receive some set of data. So modeling it then becomes an extra challenge because then you have to understand what you’re looking at and be able to interpret. So you’re problem solving before you even get to the problem. That’s the first issue.
Then once you’ve made certain assumptions and interpretations, you move on from there. That is where you get into this tricky situation of how to now make decisions based upon my assumptions. Assuming my assumptions are correct. How do I validate my own assumptions? And that’s me as an individual researcher, that’s me as in a group of researchers, that’s me with my students. How do I know what I’m looking at actually is correct, and how do I then move forward? It’s so rich with questions and challenges.
Mike Delgado: And that gets into the ethics right there, right? Because what biases we bring to the data set will possibly just ruin whatever it is that … The data might be speaking one thing, but because you come to the table with a certain bias, a certain assumption is now going to twist it to tell your own story.
Dr. Marshall: Exactly. Because you can make data say anything. Even if it’s modeled correctly. Let’s say that it is accurate. You have the right column names, the contents of all your data are correct, but you can manipulate that data. You can make a positive seem negative; you can have a negative seem to be a positive with enough finesse. There’s definitely this notion of how you critically think through the data, and then what your assumptions are about what that information coming out is really going to tell you. What story are you trying to arrive at, and are you manipulating the data so you arrive at the story you anticipate? Or are you letting the data tell you the truth? That’s really the ethics part of it. And you would think data modeling is very easy, it’s very structured. Okay, we have relational databases, we have object-oriented, we have time series type of databases. But at the core of it, it’s someone, a designer, who is trying to now put some parameters or some guardrails around how this data is being placed into a system. Whether that system is PHP or JSON, or something as large as Bigtable, or anything that’s like Apache Spark.
So no matter what, there’s always this human intervention when it comes to data modeling. There’s just human intervention when it comes to interpreting the data. And then what are those outcomes and results that you hope to achieve, and then what is the surprising outcome of results as well? And what do you do with that?
Mike Delgado: So, Brandeis, as you’ve been teaching for over 10 years, training upcoming statisticians and data scientists, how do you help them with this challenge of coming to a data set objectively and trying to put away assumptions? How do you train your students or teach them this process?
Dr. Marshall: Well, the first part is always collecting all the data. So there are two different veins. You can be collecting data or you can be using secondary or tertiary type data. I try to have students do their own data collection so they understand the challenges of what decisions they need to be making and what assumptions they’re having. So that’s the first part. They collect their own data before I move them into, “Let’s use someone else’s data and see what happens.”
The other thing I do is give them a lot of case studies. There are a lot of books available. I happen to use Hoffer’s textbook, Modern Database Management. It’s very well-known within the database community. But they have a lot of large problems; they’re called field exercises. Those tend to work very well. And so I try to piece and parse those out. As well as talking with the students about these problems. I’ll provide an example and then start working through it, because one of the challenges in training up individuals in this space is you have to know what the problem is. So that means you really have to read critically, and you have to make sure you have some type of advanced reading comprehension. So, what are you reading, how are you interpreting what’s being read, and then how are you going to represent that?
When they have this data set, they now know they don’t know everything, and that’s difficult for a lot of novice people to realize and understand. But at the end of the day, they then appreciate it. I always have students in my database class do some sort of semester-long project where they’re either constructing their own database, a system, using some company that already exists, or they’re modifying someone else’s version of a company’s database structure. That way they start to see if these business rules are being implemented correctly, what type of business rules would make more sense, what business rules can be implemented in a database, what business rules need to be done through queries. And they start to piece out how they need to think through the data model and what the limitations of it are.
So hopefully when they get to their industry job where they move on to graduate studies, they then can see data in … See it for what it is. And it’s a hot mess most of the time. But at the very least they have the resources in order to tackle the hot mess and put some constructs around it to make it manageable for them. And that’s what I think every company’s trying to do within data science is trying to manage the crazy, trying to manage the chaos of their data. They don’t know what it is, where they’ve gotten it, how it all fits together. But that is what makes data science beautiful, because everyone can come in and have an opinion and have justifications that they can work toward those interpretations to make sense for the company.
Mike Delgado: I love it. It is a hot mess. Because you think about all the data and every year companies are getting more and more data, whether they’re purchasing it or just collecting it naturally. And I’ve heard this example from another broadcast where someone was saying it’s kind of like those Where’s Waldo? books, where you open up the pages and there’s all these people and you’re like, “Where is Waldo?” Where is the data you actually need that’s going to actually be important to the business? So yeah, it is a hot mess.
Dr. Marshall: It really is. There’s no other term. I know it’s so technical, but it really … Okay, there is a technical term for it called messy data but …
Mike Delgado: I like hot mess better.
Dr. Marshall: But hot mess is really what I see. With my students, we’ve been working with Black Twitter. As you know, there has been a lot of conversation about the Oscars and the representation of people of color being nominated and then being awarded Oscars. We started this work in 2016, when there was no representation of black people, to 2017, when there was a lot of representation, to 2018, where it was very much in the spirit of the #MeToo movement. That intersectionality between gender and race was a conversation. So my students have been gathering Twitter data during the Oscars.
Mike Delgado: Oh, that’s awesome.
Dr. Marshall: And that has been an experience for them, and for me, that has been wonderful. Because they now have a reason and a purpose to collect information. They have a reason and a purpose to try to figure out what the trends are, what words are being spoken, who is coming up in the Twitter feeds, what countries, what cities, who the influencers in this area are. Having them really take a forefront and be the leaders in spearheading and pioneering that type of work is awesome, especially because they’re undergraduates. That doesn’t happen very often, but they are on the forefront of that. And I’m just glad I thought of the idea, but they have been able to roll with it and take all of the “I don’t know what this data’s telling me. I don’t know what this graph is telling me.” And then we talk through it. We don’t need all these different columns, and we’re parsing through it. “Oh, there’s so much data, how do we harness this data? How do we put it into Python?” “How do we use pandas effectively?” These are great questions. Things that everyone in the data science world needs to know.
Mike Delgado: What a wonderful way to get your students involved and get them passionate and get them curious about different ways of working with the data, especially with the political climate we’re in, the racism that’s going on, and then having your students get involved with actually looking at the Twitter data, looking for racism, looking for maybe even predictions about who might win the Oscar. That’s awesome you’re getting them involved. And I think it’s amazing that you get your students with this huge project to actually build their own data set. What a massive project that must be, and what a great learning experience for them.
Dr. Marshall: Yeah. I really am enjoying it. They are really scraping the Twitter data themselves; they’re creating their own data sets and then having to parse through those and try to understand those. But that’s what it takes in data science. It’s practice, right? So your original question was, “What do you do?” It’s a lot of practice at the core of it. You have to get your hands dirty, roll up your sleeves, and you have to fail. You have to get to a place where you’re like, “I don’t know what I’m doing, but let me try to figure it out.” And then you figure one thing out, and you might take two steps back. But that is what it takes within data science, because the data itself doesn’t provide any valid quality information. That’s up to the humans to do. And so you have to continue to press forward in that persistence. There’s a lot of language around grit. So I think that’s what it takes within data science is this grit to continue to persevere, to have a passion for trying to understand and provide quality information about what is being housed in these black boxes that we are calling data sets. Right?
Mike Delgado: Yeah. That’s awesome. And I love that you used the word grit because when you have a huge data set and you don’t know what you’re looking at, you’ve got to be gritty, you’ve got to be curious to start digging through to figure out what the labels should be. Are the labels accurate? What am I looking for? And then also if you’re working for a business, what is the business goal, what is the challenge, the pain point, right?
Dr. Marshall: Yeah.
Mike Delgado: They are trying to solve …
Dr. Marshall: And sometimes you don’t know that in a business. In a business you just know that your ROI is too high or too low, right? Like one department has too many applicants, another department doesn’t have enough applicants. How do we balance this out? Sometimes you don’t know what it is, but that discovery and that curiosity you’re mentioning is very much on point — very, very much on point.
Mike Delgado: So as you’re teaching these classes and helping these young professionals work with data, how do you guide them for which models they should be using? And maybe even before we get to that, what are some of the common database models you encourage them to look at?
Dr. Marshall: I encourage them always to start with relational databases. And that is the most structured data model that’s out there. And the reason why I focus on that fundamental, not just because that’s where I entered into the data world, but also because a lot of organizations still use relational databases. So you want to be relevant and you want to make sure you’re able to enter a company and be able to understand their model.
So relational databases that include a MySQL, a PostgreS, or an Oracle if you’re at a very large organization. So that’s where we start. Then there’s other models they then can easily translate their knowledge of relational databases to others. So that could be NoSQL and Cassandra. There could be more distributed types of databases; then you can move into big data databases like those housed within Spark and Hadoop and things like that. So you can look at how you’re going to understand what’s important in your data. If you care about the timing, then you might want to do maybe a NoSQL Cassandra. It has its pros and cons.
But if you want to look at more structured, then you might want to do a relational. If you care more about web interfacing, then you definitely want to do a JSON? There’s Node.js and other versions that are aligned with Java-type libraries. But you have to have an idea of how you’re going to use the data to then decide what type of database you want to house it in. And it might be a combination. You might be storing information inside a relational database as an archival-type system. But then when you do the processing, you’re using JSON. That happens all the time. Because you have to translate the data. That’s all you’re doing. Eighty percent of the work is really in cleaning the data. That does sometimes mean it’s moving it from one system to another system construct when it comes to the data model. That’s where I start the students, and that’s the conversation I have with them.
Mike Delgado: And that’s the #hotmess, cleaning up that data.
Dr. Marshall: Right. Because it’s interesting in how students will — and even myself, I get into this problem of assuming, “Oh, once I have it inside of a database , whatever that database is, I can easily translate it into JSON.” Why do I make that assumption every time? I’ve been in this field too long. Why do I make that assumption? Because not everything is a one-to-one matching. So you always are learning something new, because there’s always a new update, there’s always a new method that’s being shared online via Twitter, via Facebook and different data communities. So you’re like, “Oh, that’s deprecated. All right, I can’t use that method. We’re going to use something else. Oh, that library is no longer the trending hot thing to use anymore. It has these x, y, z limitations. I need to move over to another suite of libraries.” That is always that learning objective you need to have, that curiosity and that grit I spoke to earlier.
Mike Delgado: I love that. Now what’s your recommendations for professionals who are looking at a data set that is all unstructured data, maybe even like Twitter data that has video, pictures? Where would you begin with choosing a model?
Dr. Marshall: Wow. So the first thing I probably would do, and I have done this, is I ignore all the emojis and all the video and audio information. And I do that …
Mike Delgado: No hands? No hand emojis?
Dr. Marshall: No hands, no thumbs up. We’re getting rid of all of those, because those are all being translated into text as computer language.
Mike Delgado: I’m crying now. I’m crying.
Dr. Marshall: Yeah, you’re crying. So I remove all of those, only because it does tend to cloud the judgment. So what I do want to keep in the data set are the URLs. That is very important. The emojis not so much right now. Even though more and more people are just speaking in emojis, which is hurting my heart a little bit because I do like words. I am an educator. So it does vex my spirit a little bit that we’re only speaking in emojis.
Mike Delgado: 100.
Dr. Marshall: But I do want to use natural language processing techniques. And there are so many that are readily available. I think that’s the first step is just trying to remove all the impediments or blockades to you actually getting to valuable content. So text first, and then you can build upon there. Where do you start outside of that? I am a Python person. There are basically two camps in data science: you’re either Python or you’re R. That’s about 90 percent of it. I’m a Python person, so I would say Python, pandas, you’re to use NumPy, matplotlib as far as those packages in order to create reusable code.
Now let’s say you’re not a coder. You haven’t looked at code, you don’t want to deal with being a designer, you want to really be using tools. I would say bringing it into an established tool like Tableau would be your entry point. But if you’re curious about where you might land on that spectrum, there are different resources you can try. Berkeley has a course called Foundations of Data Science. It is available. Most of the materials are available so you can start working through some of these techniques in understanding linear regression, you get a little bit of Python introduction. There are other tools in other companies, nonprofits such as Data Carpentry. They provide online lessons you can download, and they help set you up. And they move forward. They even provide instructors if you want to create workshops, one-, two-day workshops.
I’ve also used a relatively new organization called DataCamp. That’s something a lot of professionals and companies like to use to help scaffold their employees who need to work with data and are trying to hone their data skills. There is a lot out there outside of just Courseera and free courses. But it’s a landscape that is having more and more companies and organizations trying to figure out how to teach and how to make sure that there is retention in these data skills. But those are just a few.
Mike Delgado: Wonderful. We just got a question on Facebook from Joan, who says, “What do you think about —”. Is that KNIME?
Dr. Marshall: Yeah, KNIME.
Mike Delgado: Have you heard of that?
Dr. Marshall: Yes, that is wonderful platform. That is really meant for someone who has some understanding of computer programming algorithmic design, more specifically, who is quite familiar with how to deal with code. So yes, I think that’s a wonderful platform.
Mike Delgado: Well, I can’t believe we’re almost out of time. I do have a couple last questions for you. So the first one …
Dr. Marshall: Oh, no. It’s only half an hour.
Mike Delgado: I know. Do you have any final tips for those who are just starting out with data modeling, just some tips, suggestions to help them?
Dr. Marshall: I think it’s really about your mentality when you are approaching data modeling. It’s a lot of not being frustrated. Everyone goes through this. Resources are plentiful. There are a lot of database books in order to help you get some understanding, and I can share those with you afterward. I mentioned the Hoffer textbook. That’s just one. But there are several other ones that are pretty much staples in the database world and hence the data science world. And one day at a time. Just try to learn one thing at a time. And practice. Last thing is going to be practice, practice, practice. Practice, practice, practice. Collaboration. Collaboration is necessary. So if you do have accountability buddies, more than one, I think that is awesome because you can now talk through some of the challenges. Because that’s what you do when you are constructing a data model. You have to talk to the client multiple times, and that client sometimes is you, sometimes it is someone in the organization. But you need to collaborate. A big misnomer about computing in general, but also data science, is that you have to talk to people multiple times to really fine-tune and hone what they’re looking for and why they’re looking at this data a certain way and that interpretation to really help move that conversation forward.
Mike Delgado: There are a lot of people in our community who are in graduate school; they want to become a data scientist. Or some are just starting college and are like, “I think data science is the path for me.” What is your advice for them on the whole process of getting started with their career?
Dr. Marshall: I think there are a number of things you can do. It’s never hopeless. Even if you’re at an organization or a school that does not have a data science program. There are courses that you can take and learning outside the classroom that you can participate in. So let me give a couple of examples.
If you’re at a university, institution of higher ed, statistics courses, computer science programming courses are always helpful in this conversation. That’s the fundamentals, because science really is a blend of mathematics, statistics, computer science. And then there is the discipline-specific science or field that you are completely in. So if you have at least the mathematics and the statistics and the computer science courses under your belt, that’s an introduction to programming, that is statistics, that is precal, calculus — those are all helping to build your logic reasoning gates in order to now be able to enter into this data world informed and able to make the proper interpretations. Or at least be able to ask the right questions to get you to the end result.
That’s the main advice I have for any college students or graduate students who want to enter this space. And then it’s a matter of doing some research about what type of programs exist. There are a lot of programs coming out now that are blending data science with every discipline. With economics, business analytics it tends to be called, or with journalism. That’s where you have digital media and digital humanities. There’s blending data science with English because of all this transcription. You mentioned you’re going to be transcribing it.
Mike Delgado: That’s right.
Dr. Marshall: So there you go. There’s a whole field. And how do you now parse through this set of data, which happens to be all these words that are being said over this past half-hour, and how do you work with that set of information? Then it’s a matter of trying to understand where your passions might lie as you work within this data science field, which is incredibly broad. So be patient, be persistent, and you’ll be just fine.
Mike Delgado: I am loving this interview. I am so sad it has to end. I have just one last question, and I promise I’ll let you go.
Dr. Marshall: This happens all the time. All students say, “I have one question for you,” and I go, “No, you have more than one.” It’s okay. It’s completely fine.
Mike Delgado: I need to have you back …
Dr. Marshall: I don’t need to eat lunch at all. It’s fine. There’s no worries.
Mike Delgado: Yeah. Advice for data science leaders who are listening in right now who are building up their data science teams, they’re looking in to bringing in the best candidates. What is your advice for the interviewing process, questions to ask, what to be looking for to bring in a good variety of people who can help work on data?
Dr. Marshall: I think this goes back to what most companies want, which is people who can communicate effectively. That’s the number one talent that everyone is trying to acquire. It’s not just about what they know; it’s about whether or not they’re able to communicate. Are they able to collaborate? So when it comes to interviewing, it’s all about the conversation, it’s all about what type of problems you are giving that allow the candidate to expand and express what they know. And is what they’re expressing something that you as a company interviewer can work with? And then how does that interview let you know of their potential? Because in data science, there’s no one who’s going to come in ready to do exactly what you want. Let’s push that to the side. Let’s really talk about how you can grow the data skills within each of your employees. And are those individuals you’re hiring amenable to that growth and alignment with the company’s goals and objectives?
Mike Delgado: I love that. I’m going to leave that right there. And Brandeis, I’ve got to have you back at some point because at the beginning of this conversation you were talking about music and I go, “Oh, I want to have a data science chat about music.”
Dr. Marshall: Yes.
Mike Delgado: So I’ve got to have you back.
Dr. Marshall: Most definitely.
Mike Delgado: For those who joined late, this is Dr. Brandeis Marshall. She is the Chair of computer and information sciences at Spelman College. If you want to get links to her LinkedIn profile so you can follow her to her website, you can go to the Experian blog, where there will be a full transcription at the end of this week of today’s episode. The short URL is just ex.pn/marshall, M-A-R-S-H-A-L-L. And if you are new to the #DataTalk show, we have a whole list or whole archive of all past shows over at ex.pn/datatalk. We have about 27 live interviews now that are recorded there. That’s where you can go to get that. Dr. Marshall, thank you again so much for your time and sharing your insights. I love talking with you. And thank you for sharing your insights with our community.
Dr. Marshall: Oh, thank you. This has been fantastic. And I look forward to doing this again.
Mike Delgado: Awesome. Can’t wait. Okay, everybody, thanks for joining us. We’ll see you all next week.
Dr. Marshall: All right. Bye-bye.
Dr. Brandeis Marshall earned her Bachelor of Science degree in Computer Science from the University of Rochester and her Master of Science degree and PhD in Computer Science from Rensselaer Polytechnic Institute. She previously served as a professor in Computer and Information Technology at Purdue University and is now the Chair of Computer and Information Sciences at Spelman College.
Check out our upcoming live video big data chats.
Experian is the world’s leading global information services company. Learn more.