Listen to the podcast:
Every week, we talk about important data and analytics topics with data science leaders from around the world on Facebook Live. You can subscribe to the DataTalk podcast on iTunes, Google Play, Stitcher, SoundCloud and Spotify.
This data science video and podcast series is part of Experian’s effort to help people understand how data-powered decisions can help organizations develop innovative solutions and drive more business.
To keep up with upcoming events, join our Data Science Community on Facebook or check out the archive of recent data science videos. To suggest future data science topics or guests, please contact Mike Delgado.
In this week’s #DataTalk, we talked with Dr. Victor Pankratius about how he is using machine learning to find distant planets. Dr. Pankratius leads the Data Science for Astro & Geoinformatics group at MIT.
Here’s a full transcript:
Mike: Hello, and welcome to Experian’s weekly #DataTalk, a show featuring data science leaders from around the world. Today, friends, is a unique topic, something we’ve never covered before. We’re talking about how scientists are using machine learning to find planets. This is super fascinating, and we’re super excited to talk with Victor Pankratius, who is the principal research scientist at MIT’s Haystack Observatory.
Also, I want to give a shout-out to MIT’s Department of Earth Atmospheric and Planetary Sciences for helping set up today’s chat in the Earth Resource Lab at MIT. Victor, thank you so much for being our guest today.
Victor: Thank you so much for having me.
Mike: Do you want to give a shout-out to the team who helped set up today’s event?
Victor: Yes. Thanks very much to Josh Castle here at the Earth Resource Laboratory at MIT. I’m very happy to be hosted here, and it’s very interesting to reach out and learn new things. Maybe the next podcast will be about machine learning for finding natural hazards.
Mike: Very cool. I love that. That’ll be another great topic. We have an amazing group of data scientists in our community on Facebook, and they’re always curious about what led you to becoming a scientist and start working in the sciences. Can you share your path with them?
Victor: Absolutely. I’m a computer scientist by training, and I have a background in parallel computing and software engineering. I came to data science based on the big data challenges that I saw. I started to talk to different scientists and they said, “We have all this big data coming up,” and of course everybody asked, “How big is this data?” We’re starting to move toward petabytes per second of data. For example, in radio astronomy with Square Kilometre Array. That’s gonna happen over the next decade. I’m like, “This is really fascinating.” Then, of course, you start asking these questions: How are we gonna process this data? How are we gonna generate deeper insight?
I realized early on that there’s a need for computer-aided discovery, and that’s not just in one field. It’s across the sciences. While we’re at it, it’s relevant also for business applications and industry applications because they’re facing also highly increased data rates these days. This is a very fascinating topic where we can develop new algorithms, new directions, and work with real data. This is not artificial data; this is real data. And while we’re making progress in computer science, we can also make new discoveries. So, this interaction I found highly fascinating.
Mike: That’s awesome. Tell me about the data, because you mentioned petabytes per second, so much data is coming in, you’re collecting and using. Can you talk a little bit about what this data looks like?
Victor: Sure. The petabytes per second, just to make it clear, this is gonna happen over the next decade.
Victor: Of course, if you add everything together that we would get from Sensornet or satellite images, and so on, maybe you could look at that if you’re looking at it from a data fusion perspective. But realize we are collecting very different types of data. You have images on the one hand that come from satellites, telescopes and so on, but we also have in sensor networks that collect time series of information, be it on Earth or making observations looking into space.
This is yet another challenge for data scientists: how to cope with these different types of data. Then once you get these images — suppose you’re looking into the universe looking at the sky — then you have billions and billions of potential candidates for interesting phenomena. That’s only the starting point for the actual science.
Mike: Wow. So this data that you’re using, sensor data, you’re getting so many different types of data that’s coming in. What’s the process for even beginning to label the data?
Victor: There are very different types of approaches we have to consider in this context. One is, of course, if you’re looking at AI and classical machine learning, and particularly supervised learning, then you would think about how we label this data. But as you can imagine, that’s a challenge. So we have to think about alternative ways. One thing that we leveraged, as a starting point, as a specific study, is to leverage crowdsource data for machine learning.
This is a recent publication we just got on the exoplanets topic and debris disk topic, because how do you train these machine learning algorithms? If you have no data, then NASA had a project called the Disk Detective where they involved humans in labeling images. They were shown images with specific properties where you would know that if you have candidates for planets or debris disks, then these images would have to have certain features which you can visually identify.
Then people were asked to answer specific questions on categories of what they would see on certain images, and that constituted an early approach to try to find different kinds of things. In this case, it was related to planet search. We were looking at this and said, “This is very interesting because now we can use this as label data to train machine learning algorithms and scale beyond what crowdsourcing can do.”
Mike: The crowdsourcing part … Were there any challenges with getting that set up because sometimes people make mistakes. How did you handle that?
Victor: This is actually an approach where we leveraged what NASA’s teams had already done. NASA had the Disk Detective project by Marc Kuchner. They set up basically this whole experiment carefully thinking about what kind of categories do we want, what kind of data do we want to look at, do we want to look in infrared or optical, or what wavelengths do we need to see the things we’re looking for?
They set this experiment up and, of course, it is not a perfect thing when you have humans involved. You can have false positives or false negatives. Or you can have things where something goes wrong, but you can have other people in this crowdsourcing process check on other people’s results. You can get some kind of validation without having to manually reprocess everything yourself. But, yes, it has pros and cons in doing it this way, but this is where we are right now in terms of what’s being tried out for this direction.
Mike: Can you talk a little bit about how you are training artificial intelligence distinguished planets from the other astronomical objects out there?
Victor: Sure. If you’re thinking of this problem in a general sense, it’s a classification problem. There’s an image and there are sets of pixels in this image, and you need to distinguish that a certain set of pixels or an object is a planet, yes or no. It’s a binary classification. We’ve tried out different techniques that would require label training sets, and that’s how we leverage the crowdsourcing part from the NASA project.
The other thing we did is — there’s different techniques on how you can find planets. If you’re thinking about this very generally, you know, how would you see a planet with your eye if you were able to? You would look at a star, and suppose there’s only one planet orbiting the star. Then there’s a method called the transit method that looks at the brightness of the star over time, and if there’s a planet in between you and the line of sight to the star, you would see a slight dip over time.
So you get a series that essentially looks like this, ideally. But as you can imagine, in the real world, there’s a lot of noise. There’s various things that look like this pattern that are not stars, that are not planets. The stars can have properties that have that kind of pulsation, variable stars. So it’s very difficult to distinguish.
The other ways to do this is then you can use indirect methods. One particular one we were exploring in a paper recently published in Astronomy and Computing with our students and co-authored by Professor Seager. She’s exoplanet research. We were looking at debris disks. These debris disks are essentially indicators for planet formation because, if you’re looking at the process of how planets are formed, they typically form where there’s debris disks.
That’s the hypothesis, and there have been recent discoveries of exoplanets that have debris disks. Then you could say, “Let’s look for debris disks because they’re bigger and they also show up in survey data that we already have.” For example, in the infrared survey or the WISE satellite, which is run by NASA JPL. This is how you can approach this problem looking at different things in different data sets and then trying to find indications what could lead you to understand this classifier on how something can be a planet or not.
What we did is we went out to the WISE survey and we trained it using the crowdsourced labeled data from NASA, and then we found candidates for debris disks. We validated that based on the NASA exoplanet archive that contains a list of all the known exoplanets so far. That’s how you can tell how good you are if you’re trying out a new method. It seemed to work pretty well. So now we have this technique that allows you to say, “I found candidates for debris disks,” and then potentially that is an indicator where you can look for more planets.
This is a piece in this chain of missions that we have running right now, missions that are planned for the future, because it helps you determine where you want to look. And then with follow-up missions, you can take a closer look at all these debris disks and try to see if there’s real exoplanets or not. This is an endeavor that is going to take a longer time, and we’re doing piece by piece these little techniques that can help us get there.
Mike: Through this process, how many … I don’t even know if this is the right question to ask, but as you are looking for a debris disk to help with this process of finding planets, how many potential planets do you think you have found with the data that you used?
Victor: That is hard to say because right now we have to validate with the sets and the planets that we know. I can refer you to our paper in Astronomy and Computing that contains this data. You have to look at this in a very specific way because also the training set that we used was preselected. So potentially, if you were to apply this to the entire universe, there might be a bias just through this preselection and the small number of stars.
For example, I can outline some of the numbers that we had because that’s the data we had. It’s 114 stars with locations in the Southern Hemisphere, and those were determined to be good debris disk candidates by Disk Detective users. Then, we had 13 from the literature that other researchers deemed to be promising candidates. It’s a total of 127, and then we had an example of 138 bad candidates. These are the kind of numbers we are talking about.
Then, in a validation, we had the NASA planets archive with over 2,000 known planet host stars. This is how we validated it, and so far we got a total accuracy score of 0.97 on these specific data sets. But I want to caution you a little bit about these numbers because they’re highly selective and the techniques that we developed now can be used to generate a list of candidates for future missions, which have a little bit more information based on the training set that we know on what we believe could be the real candidates.
We’re generating a list of targets and where to look with future telescopes. This is because you need more information to give you a more precise answer. Right now, just the sample size that we have is not very big, but that’s gonna change.
Mike: It sounds like you have an awesome start because you’re collecting and finding places where planets may be, and it’s really cool. This is the first big step in that process. Now one of the things that makes the headlines all the time is habitable planets. I think I saw a while ago there was … They said it’s another Earth-like planet many, many light years out there. I know that this is not part of your research, but can you talk about those people who are trying to find habitable life on other planets?
Victor: Sure. I can outline for you my perspective as a computer scientist on how you can approach this and potentially automate these things. This is how computer scientists think. Typically, what you’re looking for is what makes a planet habitable, and that’s typically several criteria.
So far the community has converged into looking at so-called habitable zones, which means the planet is not too far away and not too close to a certain star, the temperature is okay, everything that could be useful for life as we know it on Earth is within the known parameters. Then, this criteria they have to apply to the star as well. If you have stellar flares, that would be the corresponding solar flares, but on a star, if they’re this big that they could wipe out life in a nearby planet, that’s another criterion you want to factor in.
That’s why you need all these observations, not only about planets, but also about stars, in understanding how they develop over time. That’s why, from a data science perspective, it’s not just a snapshot you’re looking at. You have to look at all these things over time.
So, yeah, the key question is, are you in the habitable zone or not? That’s difficult to answer because there’s so many parameters. For most of the planets that have been discovered so far, we don’t know all the parameters. We know maybe the mass or we know the orbital parameters and then we can infer potentially what can be there or not, but there’s so many more things where we need better observations in order to give a more clear answer.
Now I want to comment a little bit on how I believe computationally we can come closer to understanding maybe life elsewhere. This is a bit speculative, so bear with me for a moment. It’s more like a …
Mike: I love this. This is great.
Victor: Even the question of how you find life on Earth is very difficult. There was an experiment back in the Voyager days where you fly out with Voyager, you’re turning around looking at Earth, and you ask the questions: Suppose we don’t know if there’s life or not? Can you tell us if there’s life on Earth just by what you measure? That’s very difficult. Then, you can start thinking about what you would measure. I know there’s something about the atmosphere and the composition and what’s in there.
Then, essentially, what you end up doing is a computational model. This is how I’m looking at this. You can look at metabolic processes and biomarkers and molecules that are related to life or used by life or produced by life. You can look at this as a big graph theoretical problem. You have processes that can be described as potentially a very big complex graph, which is maybe incomplete in the way we have it right now.
Now you can try and do this in a similar fashion for other planets because you can look at their atmosphere with a spectroscopy. So, you can look at the light and potentially infer what kind of molecules are there. Then you would try to construct a metabolic graph for potential planets. Then, you could look at structural similarities with graph metrics. Is something that’s out there similar to what we have on Earth or not?
That’s a bit speculative because there’s so many assumptions. One of which is we’re looking for life as we know it; and life as we know it has maybe big parts of this graph centered around carbon. But potentially, this graph could be centered around other elements in the periodic table. But computationally, this is something that we could explore. What if we had other scenarios? What would be those metrics? What would they look like?
Potentially, you can run simulations. Controversially, I would say maybe the search for life is a computational problem, and then you would validate it with the empirical observations. But whatever we do, the empirical observations are not enough. We need to link them to models in order to gain deeper understanding. This is a piece of the puzzle that I personally find very interesting to think about.
Speaker 1: Yeah. It’s super exciting. The research you’re doing is a huge step into that. It’s so cool to see the work that you’re doing around that. Can you talk a little bit about the coding languages that you’re using with these different models?
Victor: Sure. We mostly use Python. We also use C and C++ when it’s important to have good performance. But I can generally tell you that the science community now seems to switch to Python. So Python becomes our lingua franca. Python and Jupyter Notebooks for interfaces. We’ve adopted this language and frameworks in Python just because our customers or users are using it. Aside from that, I think Python is a very beautiful language. There are criticisms in terms of performance, but there’s ways around that. That’s where we would use wrappers around highly performance C and C++ kernels that are good for clustering or other methods that are required for data science for the kind of data science that we do.
On the other hand, there’s also advances in Just-in-Time compilation for Python where you can add essentially decorators or annotations, and then a Just-in-Time compiler takes your function and compiles it to assembler, redirects your control flow to this particular piece and then it goes back to your Python.
Then there’s other approaches, but these are some of the main techniques we’ve been using successfully. To scale on a large scale, we’re using web services in the cloud. This is where we created our own frameworks to be able to offload data processing pipelines transparently. That involved developing our own container technology so we can essentially put AI or other things into these containers and then the framework knows how to replicate them in the cloud. The scientist doesn’t have to care where a certain pipeline stage is being processed, how data is being partitioned on a certain node and so on.
Mike: I want to ask about the team. When you’re looking to bring on new data scientists, what sort of skill sets are really important to you, what personality types are important to you? When I look at the DataLabs that we have here at Experian, we have data scientists with all different types of backgrounds, physics, statistics, sciences, etc. I’m curious about working in the space that you are, very specific, finding planets. What sort of skill sets are really important?
Victor: I would say data science is interdisciplinary, because you need to know something about computer science, AI, the technical skills to solve a problem, but you also need to know a lot about the domain. If you’re looking just at the paper that we published on the exoplanet search, we have co-authors with different kinds of expertise. We have Tam Nguyen; she is a physicist, basically. Then we have Laura Eckman; she’s a computer scientist. We have Professor Seager, who’s a planetary scientist, and then I come in as a computer scientist. None of this could have happened without the knowledge of this entire team. You need to bring this together.
This is something that maybe is not really discussed in today’s discourse on AI because if you want to apply anything on a problem like this, you need to know a lot about the domain, you need to know what is noise and what isn’t, you need to know something about the features that constitute the phenomena you’re looking for. Typically, these are so highly detailed that you cannot just take something out of a box and apply and expect that you’re getting meaningful results. You will always get results, but the question is, are they useful and interpretable for the end user, the scientists who want to use that for making a new discovery?
This is also an approach that I followed in my own team. The six postdocs in my team have very different backgrounds, from Ph.D.s in computer science to astrophysics to geophysics to backgrounds in signal processing. This is what created many of the successes that we had. For the community, it’s important to understand that data science is typically teamwork. It’s very hard to do it by yourself, because you need to know all these things.
Mike: I love that you brought in so many different disciplines in surrounding yourself with people who have special knowledge in different areas. You bring these teams together and you’re working on a problem. As you’re working on something, do you encourage debate within the team? Different people have different ideas on how to approach a problem. Can you explain how your team operates on different issues?
Victor: Absolutely. Debates are really important. I remember we had situations where, even for implementations for parsing certain algorithms, even drilling down to that, we ended up having several versions of different implementations to evaluate and figure out which one is best, which one should we go for. We very often have an early prototype, so we’re trying to evaluate which way to go.
But also, the domain expertise can cut short in many of these discussions because people would say, “This is a great idea, but a geoscientist would never do it like this,” or “An astrophysicist would never be interested in this kind of result,” or “This algorithm works and scales, but this is not the type of thing you really want in the community.” Or “We want more explainability on how your methods work.” And that makes it very difficult for many machine learning applications that have no explainability in their inner workings to be acceptable for physicists. That’s why many physicists are still rejecting discoveries made by machine learning, because you need to have this explainability.
One technique that we converged on is we need to infuse domain knowledge into our algorithms when it comes to computer-aided discovery, because then you’re able to explain inner workings much better. Plus, the algorithms become more efficient because the candidates that are being presented to you already vetted with basic physics knowledge, where the candidates that don’t make sense in terms of physics are already excluded.
Mike: That’s awesome. I always like to ask a couple questions before we end every #DataTalk episode. The first one is, there are a lot of unknowns and fears about how AI might change society, and there’s always those scary headlines that come up about automation robotics replacing jobs. I’m curious about your view of the future of AI.
Victor: I think we are at a junction because there are two sides of this coin. You can do very bad things with AI, but you can also do very good things with AI. Then, I think it’s the interesting potential about it. It’s like fire. You can use fire, do something bad, but you can also use it to warm yourself up or cook a meal. Just looking at that potential with AI applied in different areas of our life, suppose we can increase our GDP, and I’ll speculatively say a number. Suppose we increase it by a million times. So our wealth becomes so much more than we’ve ever had. We enable totally new applications that were impossible before. Just think of healthcare or other applications where you can be diagnosed much earlier with a disease and then get better treatment, and so on.
This can have profound changes in our life. It transforms essentially society. What it means is, at the same time we’re innovating on AI, we probably also have to innovate on the way we want our society to work. Because technology by itself is not the problem. The problem is, how do we organize ourselves as a society to benefit from what AI has to offer?
I believe that if we make further progress, not only with the techniques that we have today that mostly focus on detecting new things, feature detection, neural networks, classification. Think about the inference part of AI. Being able to create more sophisticated theories from data is something that can help us understand the world and the universe in unprecedented ways.
Just think of it as a teaser. Suppose you know nothing about the world. You take all this data. Can you infer the theory of relativity or can you infer something that explains the world even better, if it’s possible at all? Or can you make a machine win a Nobel Prize just because a machine could generate theories that are so nuanced and so much better potentially than what our cognitive abilities are as humans?
These are all great potentials, and they could help us advance society in unprecedented ways. I’m looking at this as a chance, but it’s contingent on whether we are able as a world to organize our society in such a way that it’s for the good and it benefits everybody.
Mike: I love the answer. Victor, thank you so much for your time today. This has been a fascinating discussion. We’ve never covered it before, and I learned a ton, took a ton of notes while listening to you talk. Where can people get in contact with you, where they can learn more about you?
Victor: I have a website, VictorPankratius.com, if somebody is interested in learning more about the techniques that we’ve developed in computer-aided discovery or the type of machine learning we’ve applied to astronomy or geoscience problems. I’m also tweeting at Vpankratius.
My current goal is to expand our techniques beyond just the astro and geo domains. I’m very interested to talk to biologists, chemists, financial people, all domains that could benefit from computer-aided discovery and domain-aware artificial intelligence. I think this is something that’s very interesting to pursue. If somebody wants to reach out to me, I’m more than welcome to chat.
Mike: Wonderful. For those listening to the podcast, if you’d like to see the video or get the links to the research that Victor has done and also links to his website and social profiles, you can go to the Experian blog. The short URL is just ex.pn/datatalk47. That’ll bring you over to the full transcription of today’s chat and all the links that Victor mentioned. There’s a whole batch of good articles and also the research that he’s recently done on finding planets with machine learning.
Victor, thank you so much for being our guest. It’s awesome having you. Would love to have you back whenever possible. I’m also grateful and thankful to MIT for setting this all up, having an awesome camera, great sound. They did a terrific job.
Victor: Thank you very much. I also want to thank all my collaborators and students and my hosts here, MIT. This is a great setting and a fruitful ground for this kind of research, which as you can imagine, if you’re starting something like this with a cross-disciplinary focus, is very difficult to do, but MIT has provided this fantastic environment so we could make lots of progress. Thank you very much.
Mike: Wonderful. Thank you, Victor. Take care.
Victor: Thank you.
Victor Pankratius leads the Data Science for Astro & Geoinformatics group at MIT. He is a computer scientist who is passionate about advancing data science through novel computational methods involving domain-aware artificial intelligence, scalable parallel computing, and software engineering for artificial intelligence systems. He serves as principal investigator in NASA and NSF projects, and his interdisciplinary research spans collaborations in multiple departments including MIT’s Haystack Observatory, MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), the MIT Kavli Institute for Astrophysics and Space Research, and MIT’s department of Earth, Atmospheric, and Planetary Sciences (EAPS). He is tweeting @VPankratius
Check out our upcoming data science live video chats.