Man vs Machine Learning: Criminal Justice in the 21st Century | Jens Ludwig | TEDxPennsylvaniaAvenue

Translator: Delia Cohen
Reviewer: Ellen Maloney I have an idea for how to make
the world a better place, and like all truly good ideas,
this one starts with a roadtrip. The protagonists on the road trip
are me, a University of Chicago professor who studies crime
and the criminal justice system, and a friend of mine who’s a professor
at an Ivy League medical school. And so, there we were, two at least
moderately distinguished academics driving up Route 95. (Laughter) We had several hours to kill on the way
from New York City to New England, and so I tried to use some
of those conversational skills that professors are famous for. I turned to my friend and I said, “Tell me about the biggest mistake
that you’ve ever made in your life.” He paused, then turned to me and said, “I’ve got an idea:
why don’t you go first?” (Laughter) So, I said to my friend “I’m the one who asked the question.
Why don’t you go first?” So, my friend told me about a time
that he was working in the ER, and a patient came in
complaining of chest pain. The standard protocol
in situations like that is to do a cardiac enzyme test to see what’s in the blood
to try and predict whether the patient is having
a heart attack at the time. The patient comes in,
and they administer the test. The level of the enzyme
is above the threshold, and usually the default would be to take
the patient into the intensive care unit. But before they do that, my friend goes
into the waiting room to see the patient in person first, and the patient is sitting there
snacking on a watermelon. My friend talks to him for a few minutes, goes back out to meet
the rest of the team. Now, the rest of the team – the doctors and nurses
on duty in the ER – they haven’t seen the patient. All they’ve seen are the data in the chart
and the test level above the threshold. So they start saying, “We’ve got to go;
let’s get this guy up to the ICU.” My friend says, “No, no, no. I went and met with the patient;
he’s totally chill. He’s sitting there, he’s having a snack. I think he’s okay;
let’s leave him where he is.” And then, a half hour later, the guy goes into cardiac arrest, and they
have to race him to the operating room. That is an illustration of a lesson
that we’ve learned from behavioral economics and psychology
about how easy it is for the human brain to get distracted by irrelevant
but very salient information. That got me thinking of a problem
that my own research center I help run at the University
of Chicago at The Crime Lab, has been working on for several years which is the problem of the jail
system in the United States. Millions and millions of times a year, judges have to make a decision
about when someone is arrested, where that person awaits trial: do they get to go home
or do they have to sit in jail? And by law, that decision is supposed to hinge
on the judge’s prediction of what the defendant
would do if they were released: Is that person a flight risk? Is that person a public safety risk? This is an enormously
high-stakes decision. If the judge puts you in jail,
you will on average sit there for two to three months or longer,
sometimes much, much longer. The flip side of this is: if the judge releases someone
who goes on to commit a new crime, that could be horrible in its own way. And this is a decision
that’s very difficult for the judge for the same reason
that the emergency room decision is difficult for my doctor friend. ER doctors at least have the benefit
of something like a cardiac enzyme test to help them make
those sorts of decisions. We give judges a stack of manila files
with some information about what the person was arrested for and the person’s prior criminal record, and then the judge has to make
a decision in their head. To think about how crazy this is, consider that the very same judge
who spends all day reading through folders and making predictions that will change
the course of people’s lives, they go home and they want to relax
at the end of the day by watching a movie on TV and for help with that critical decision the judge gets access to Netflix, which uses some of the most
sophisticated machine learning technology on the planet, to help predict what movie
the judge is going to like. Why aren’t we using some
of these technologies that have been deployed so productively
in the commercial sector to help us solves these really important
public policy problems as well? Now, to think about whether this
would actually be helpful or not, I think it’s useful for starters to have
a little bit more sense for what machine learning is
and how it works. Let me talk you through briefly a canonical problem in computer science,
which is called sentiment analysis. So here’s what that is: It is basically taking a snippet of text and trying to determine
what the author’s affect was: is the author trying to convey
a positive or negative emotion? So, here’s how that looks for a more or less randomly
selected consumer product, the Hutzler 571 Banana Slicer. (Laughter) Now, here’s a review by Thrifty E: “I bought this in order to speed up
cutting up a banana for my cereal. Any time I saved in that endeavor
was spent cleaning this implement.” (Laughter) “It is not easy to clean. You have to scrub between
every rung to thoroughly clean it.” Now, we read that; it’s trivially easy
for us to tell that is a negative review. And we can confirm our assessment
by looking at the star rating, merely two out of five stars. Here’s another one
by Uncle Pookie who says “Great gift.” (Laughter) “Once I figured out I had to peel
the banana before using it,” (Laughter) “it works much better.” Five-star review. Here’s one by Q-Tip: “Confusing. There’s no way to tell if this
is a standard or metric banana slicer.” (Laughter) “Additional markings on it
would help greatly.” And here’s one more by J. Anderson: “Angle is wrong. I tried this banana slicer
and found it unacceptable. As shown in the picture,
the slice is curved from left to right and all of my bananas
are bent the other way.” (Laughter) Now, reading through these text reviews, you realize that it is
very, very easy for us to do this, and that gave the early
computer scientists an idea about how to get computers to do this. Why don’t we just introspect
on how we do this, and then try and program the computer
to do exactly what we’re doing? Here’s the results of a study
that tries to do sentiment analysis using what’s called a programming
approach for movie reviews. The data set that we have of movie reviews it’s half positive reviews,
half negative reviews. And so, an accuracy rate of 50% would
basically be just like random guessing. And so, you get a bunch of programmers, they sort of introspect on what words
you would expect to see in a positive review,
in a negative review. Here are some of the positive words that you think you would expect
to see in a good review and some of the words you’d see
in a negative review. And when you do this, you get
an accuracy rate on the order of like 60%. Now, that’s better than random guessing, but not much better. This is the challenge that the computer
scientists kept running into in this area. Even with pretty basic problems, it turned out to be very, very hard to program computers up
to do what we’re doing and get good performance. The reason for that is that it turns out
to be much more difficult for us to fully introspect and figure out what
we are doing when we do these tasks. My psychology friends call that
the “introspection illusion.” Progress in this area really only came once the computer scientists realized
that we needed to just completely forget that we knew how to do
these things ourselves and turned these tasks into just
brute force data exercises. In the movie review analysis,
here’s what that would look like: You would take a large sample
of movie reviews where you know whether
they’re good or bad reviews by the star rating and you would let the computer learn which words tend
to come up in good reviews and which words tend
to come up in bad reviews. Okay? And then use those words
as your prediction algorithm for future reviews. And once you adopt
that data-driven approach, these are the words
that the computer learns, that the machine learns, are indicative of positive
and negative reviews. Now we can get up to accuracy rates
on the order of 95%. This, I think, is really the magic
behind machine learning, and you can see how you
would apply this then to something like pre-trial
release decisions. Let the computer learn
what case characteristics, or combination of case characteristics, are actually most predictive
of flight risk or public safety risk. I’ve been working as part of a research
team for the last several years trying to build a prediction
algorithm for pre-trial release to see if we can be helpful to judges. We’ve been doing this with data
from a large, anonymous American city of 8.5 million people. (Laughter) What we discovered is that it’s not so hard to actually
build the algorithm. You can download free
software off the internet and figure out how to do that. The hard part here
is testing the algorithm and seeing whether it will actually
make the world a better place or not. For Netflix, this is not
such a hard problem. Everything that Netflix does is in this sort of self-contained
online environment. But testing an algorithm in the real world
for public policy applications is often much more complicated. This is a difficult problem to solve, absent the ability
to do a randomized trial, and it’s a difficult
social science problem, not a difficult computer science problem that we run into. And it’s so difficult that many of the people
who are now thinking about taking these machine learning tools and bringing them
into the public policy arena are tempted to just give up
on the testing stage and take tools right from the drawing
board of the computer into the real world. And I think that would be a mistake. It is very possible
to inadvertently build a tool that can wind up making the world
a worse place, not a better place. For the project that we’ve been working on the hardest part for us has been
to figure out how to test the tool and make sure it’s actually helpful. The way that we have come up with
to test the tool builds on two insights. Notice why this problem is difficult
in the pre-trial case. We build an algorithmic rule
to inform pre-trial release that says let’s prioritize the people
with highest predicted risk for jailing and let everybody else go. That algorithmic rule will inevitably
want to release someone that the judge jailed. And when the algorithm wants to do that, we can’t see what that person
would have done had they been released because the judge actually jailed them. So, we have this very difficult
missing data problem. On the flip side, though, if the algorithm wants to jail
someone that the judge released, we don’t have an evaluation problem because we know what the effect
of putting someone in jail is on their flight risk
or their public safety risk. Being in jail eliminates the risk
that you won’t show up in court or get re-arrested. That’s insight number one that this missing data
challenge is one sided. And the second insight that helps us here is that in the big city
in which we’ve been working, cases are more or less
randomly assigned to judges. What that means, then, is
that we have a sample of judges who are hearing very similar caseloads. The judges turn out to differ a lot with respect to their
strictness and leniency. So, here’s what we can do in that case. Imagine that we have two judges: a lenient judge that releases
90% of the cases and a stricter judge that releases 80%. We can basically compare
how the judges perform when they become stricter compared to how the algorithm
would choose to become stricter, as a fair test of the
algorithm’s performance. Here’s what that would look like. Here’s the lenient judge
who releases 90% of their cases. We can observe all the outcomes
for the people that judge releases. And the algorithm would say
if we wanted to become stricter and go from a 90% to an 80% release rate, the algorithm would just say: let’s identify the highest risk
10% of people in the judge’s caseload and prioritize them for jail. Now we’re down to an 80% release rate, and we can observe what the crime rate
would be that we could get, and then we could compare that
to how the stricter judge did in getting us down from a 90%
to an 80% release rate. This gives us a way to fairly compare the algorithm’s performance
against the judge’s on a comparable set of cases, focusing on the algorithmic task where we don’t have
this missing data problem, where the algorithm is just
selecting people to jail from among the pool of people
that the judges let go. Now, having solved
the evaluation problem, the testing problem, we can do some policy simulations
to suggest what would happen if we actually followed
the algorithm’s rule instead of standard practice
in the criminal justice system. What we find is that if you follow
the recommendations of the algorithm, you’d be able to reduce
the crime rate by fully 25% without having to put
a single additional person in jail. Alternatively, you could reduce
the jail population by fully 42% without any increase
in the crime rate at all. And the reason that the algorithm
is capable of giving us such big gains over the status quo
criminal justice system is we can see in the data that the judges, just like my ER doctor friend, are getting distracted by irrelevant
but very salient information about these cases. And that’s especially true
among the highest-risk cases in the defendant pool. So, what I’ve just done is I’ve showed you the upside of applying machine
learning to these policy problems. There’s a potential downside as well, which is the possibility
that these algorithms, once we apply them to policy problems, maybe especially
criminal justice problems, might get us gains on some outcomes but compromise other things
that we care about like fairness. You can see why people
are worried about this. In the city in which we are working, fully 89% of people
in jail are minorities – in a city where I can assure you
the overall city population is not anywhere like 89% minority. The people who are concerned
about the use of machine learning for these problems, I think, are right
in a way in that we have discovered that if you build an algorithm in a release rule
that ignores this issue entirely, it is indeed possible to build
a tool that makes this problem, if anything, a little bit worse. But what we’ve also found
is that if you build an algorithm paying attention to this problem, you can design a decision aid that would simultaneously
let you reduce crime, reduce jail populations,
and reduce racial disparities in the criminal justice system as well. How does the algorithm let you do that? Well, what is race, after all, but an irrelevant but highly salient
piece of information in the courtroom? What is an implicit bias other than a version
of the introspection illusion? The algorithm is not prone
to those challenges to human judgment and decision making. I think what’s particularly
exciting about bail is that it is just one illustration of a larger class
of public policy problems that hinge on a prediction
that a human being is currently making, but in principle could be informed
by machine learning algorithms. There is an active debate underway about whether it’s a good idea
or a bad idea to take these algorithms from the commercial sector and bring
them in to the public policy arena. Should we do that or not? I think that that actually
is the wrong way to frame the debate
and frame the question, and here’s a thought exercise about why. Imagine that I could
magically transport you back to the beginning of the 20th century. You would arrive telling people
about this new technology that was on the horizon that would very quickly become
one of the leading causes of death and have massively adverse
impacts on the environment. And yet, I think relatively few of us here would argue that we shouldn’t have adopted
the internal combustion engine automobile. Imagine what life
would be like without cars. We wouldn’t have had anything
like the economic growth we’ve seen over the last 100 years. Our lives would be impoverished
in countless ways, and we wouldn’t have road trips. (Laughter) And so I think the right conversation to be having about
the use of machine learning for policy applications
over the next ten years is not whether to adopt
these new technologies but how. Thank you very much. (Applause)

3 Responses

  1. Ken Litton says:

    This presentation is most fly.

  2. Adnane ARHARBI says:

    very good subject, so we can design referral systems for judges or lawyers, or we can even control the quality of judgments

  3. TheSpecialAlpha says:

    One of the best talks I've ever listened to

Leave a Reply

Your email address will not be published. Required fields are marked *