Posts Tagged 'machine learning'

Libraries in a computational age

(After a year long hiatus from external speaking engagements, I accepted an invitation to speak in Madrid at an event celebrating the 20th anniversary of the Madrono Consorcio.  Below is the text of my talk.)

It is a privilege to be able to speak with you and to share with you my thoughts on the future of libraries, and some of what we are doing at MIT to reimagine what a research library can and should be and do in a computational age.

I am particularly happy to be talking to you on the 20th anniversary of this consortium, which is committed to the same kind of sharing and collaboration across libraries and universities that we will need to do even more of now and into the future.

I think that the best ways research libraries can meet the challenges of the future, and support our universities in educating students and producing research that will allow us to face the future and solve some of the big global problems that are looming is by working together and sharing our experiences.

So let me follow my own advice and share my experience at MIT and tell you a bit about our context.

MIT is probably best known as one of the world’s leading technology and engineering schools. We are ranked #1 in the world, but paradoxically only #3 in the USMIT has just over 1000 tenure-track faculty, nearly 5000 undergrads (almost ½ women), and 7000 graduate students. Faculty and students are spread out over 5 schools – Architecture and Planning; Engineering; Humanities, Arts and Social Sciences; Management; and Science.

Probably more important than the facts and numbers, MIT is known culturally for at least 3 things: a hands-on approach to learning, openness, and a relentless pursuit of innovation.

The hands-on approach (learning by doing) is reflected in the MIT motto – mens et manus; mind and hand. There is a very real emphasis at MIT on learning by doing – reflected both in the project-based teaching approach across the curriculum and by fact that 90% of MIT undergrads participate in a research project before they graduate.

Openness is a also a very important part of our culture and widely-shared value at MIT. We are one of the few private universities in the US with an open campus, including libraries that are open to all visitors. We are also committed to openly sharing our educational and research materials with the world.

MIT created Open Courseware in 2000, “a simple but bold idea that MIT should publish all of our course materials online and make them widely available to everyone.” To date Open Courseware has over 2 million visitors/month, and hosts 2400 courses.

In 2009, MIT passed one of the first campus-wide open access policies in the US, passed by a unanimous vote of the faculty. MIT turned to the libraries to implement the policy, and because of a commitment to provide adequate staffing and resources to collecting faculty research, we now share 45% of MIT faculty journal articles written since 2009 openly with the world through our OA repository.

The 3rd important component of MIT culture is that it is a place obsessed with innovation; across the curriculum. MIT is where gravitation waves were detected, and where Guitar Hero was invented.

MIT’s most recent innovation was to reinvent itself and how we think about computing across the curriculum and in every discipline. This year, MIT launched a new college of computing – a $1 billion effort that will eventually add 50 new faculty to MIT. The goal of the new college is to address the global opportunities and challenges presented by the prevalence of computing and the rise of artificial intelligence (AI) by infusing computational thinking throughout every department and discipline AND to ensure computer science, and especially machine learning and AI are informed by work in other disciplines – especially social sciences and humanities.

I give you all this background about MIT because to my mind MIT is a perfect place to develop a bold and ambitious vision for the future of research libraries. I came to MIT in 2015, after many years working at the Stanford libraries, and I could see right away that this was a place that was ready to think about libraries as more than books and buildings.

We do have books and buildings of course. We operate 5 libraries on campus, a reading room for distinctive collections, and a storage annex. We have a collection of 2.3 million print items. We get over ½ million visitors to the library spaces/year (that figure is growing slightly in recent years. 87,000 of our 2.3 million physical items were checked out last year, and like most academic libraries, the circulating of print is declining.

Use of our vast digital collections is significantly higher and growing. There were over 80 million searches on our online databases last year, and 2.3 million downloads of the open access articles we disseminate via DSpace@MIT.

While I know that the size of our library at MIT may seem large to many of you, compared to the US research universities that are widely considered our peers (Harvard, Yale, Princeton, and Stanford), MIT has a small library – at least by the traditional measures of size of print collection, or budget, or staff. Harvard has a physical collection 10 times ours (22 million) and a staff of 700, compared to our 160. Yale, Princeton, and Stanford all have collections over 10 million and staffs of over 300. (Note: I am well aware that size is relative, and that by almost any measure, MIT and the MIT libraries are extremely well resourced. I add the HYPS comparison because I find many folks assume MIT Libraries are roughly the same size as those peers.)

We are small(ish) but mighty, and we are mighty in ways that are relevant to current and future research and that align with MIT’s core missions: open scholarship, and computational and algorithmic access to collections.

Our vision is to be an open, interactive and computational library.

Let me back up a bit now, and tell you how we arrived at that vision, and hopefully convince you that a focus on openness, interaction, and machine access to collections is a good direction for the future of research libraries more generally.

Shortly after I arrived at MIT, Provost asked me to lead convo across campus about what the future of libraries should be. So I convinced him to create a task force on the Future of Libraries. I volunteered to co-chair the task force, and we ended up with 30 members; mostly faculty from all across MIT – engineers, computer scientists, business school faculty, historians, biologists, etc. Importantly, the task force also included staff from libraries, the MIT Press and central information technology.

Membership ranged from folks who relied heavily on library collections and librarians in their teaching and research, to faculty who claimed at the start that they “never used the library”.

I asked this group to think about the future of libraries as a kind of research question. I wasn’t interested in how well the current library was serving their needs, and how we might improve or expand that a little bit. I was asking them to think critically about what a research library can and should be in a digital age, and now a computational or algorithmic age.

The report from the group is online, but want to share highlights here.

The first conclusion was that although the initial digital turn in libraries was not yet complete, we were already on the cusp of a second, potentially  more profound one. The first, original digital shift in libraries was print to digital plus print, and was brought about by the internet, google, and e-books/journals.

In that first digital turn, the library went from being a place where individuals came to find physical books and journal articles (and manuscripts, and images, and lots of other stuff) so that they could read those books and articles themselves, to libraries being a service that individuals use to gain online access to journal articles, and e-books, and digital images and manuscripts and more so that they can read and use those things on their own digital devices.

Slide with text “print … to digital” and image of person taking book off full bookshelves, and image of a tablet device showing a digital bookshelf

Although this was a HUGE shift, it did not open up access to scholarly content the way many of us hoped it would. In large part because of the market power of many large commercial publishers, the advent of online journals did not democratize access to knowledge, and the potential for the rise of the internet and of online information and scholarship to create information equality has been stunted. None the less, the first digital turn in libraries and scholarly communication did make research and reading arguably more efficient for those who had access.

In describing the next evolution of libraries, the MIT future of libraries task force emphasized not only the technological shift, but also the importance of combining this shift with a renewed commitment to open science and open scholarship. What is the next shift? It is an evolution of libraries from service to platform, and is from not just digital and physical; but also to computational.

The Future of Libraries task force described this by calling for the libraries to operate as an open global platform. A platform is something scholars and patrons build on, and it is a way of thinking about libraries as not just physical and digital repositories of content; but as vehicles for interacting with content and tools and expertise to both consume information and to create new knowledge in many forms – text, images, data, maps, multi-media, interactive and dynamic. And for a library as platform to be truly effective for current and future patrons, it must be committed to openness and to serving a global community.

“The MIT Libraries must operate as an open, trusted, durable, interdisciplinary, interoperable content platform that provides a foundation for the entire life cycle of information for collaborative global research and education.”

Future of Libraries TF, 2016


One of the key features of the library of the future – the library in a computational age — is that it should be a library accessible by machines and algorithms, not just by people. In a computational age, we have to realize that humans are not our only patrons. In fact, I have argued before that we would be wise to start thinking now about machines and algorithms as a new kind of patron  — a patron that doesn’t replace human patrons, but has some different needs and might require a different set of skills and a different way of thinking about how our resources could be used.

Drawing of a robot, holding a book, with thought bubble of “I love reading”


When I think of AI and machine learning in the context of libraries, I think of computer programs and algorithms that can extract and derive meaning and patterns from data, make predictions and inferences about and with new data, and in doing so, solve problems at scales not possible by humans only.

I said earlier that the MIT Libraries vision is to be an open, interactive and computational library. Let me explain those a bit:

  1. Open is about more than open access – it is about being a library that is open and inclusive of a range of ideas, and types of knowledge.
  2. Interactive means that we collaborate with scholars as partners, because in a computational world our understanding of how information is organized, how data is managed, and where and how bias creeps in becomes even more important than ever
  3. Computational is about ensuring our collections are accessible in formats optimized for text mining and other computer analyses; and that patrons can design and code their own ways to access and analyze our collections.

The computational part of this vision is what I think is really interesting, and where libraries can play a unique and important role in a this age of AI. There are several ways to think about the roles of research libraries in a computational age:

We all know that there are problems with bias in algorithms, and especially in terms of the data used to train algorithms. When the data used in computational research is not diverse, not inclusive, and/or is described in ways that reflect societal prejudices and inequalities then those problems and biases will be reflected and amplified in the findings, conclusions and applications of that research. Librarians know better than most people how information is collected, assembled and organized; so we know where things can go wrong. Library folks who understand data and metadata, can help ensure scholars are aware of the shortcomings of the data they use, and can help mitigate those impacts.

We can also use machine learning in our own work. One of the most interesting applications of machine learning is in assigning subjects to books. MIT Press is using a machine learning learning tool called Unearth to ‘read’ all of the books it publishes and extract subjects that human readers might miss.

We can also do what libraries have always done and be a centralized, accessible, and inclusive resource for our communities. We can provide centralized access to computational tools and resources as a way to equalize access to machine learning across our campuses. Some libraries might do this by providing AI labs in the library, with access to hardware, software and training tools to get students starting in using machine learning and AI. We might also maintain online libraries of training data and basic algorithms that students can use and modify as they learn.

But IMO, the most important thing that libraries can do is work to ensure that the knowledge and research products we already collect, curate, and disseminate are openly available and that the scholarly record is as diverse and as inclusive as possible. Because it is the combination of truly open access to lots and lots of content – text, data, code, images – analyzed with powerful computational tools and methods where really interesting things can happen. New findings and understandings and new discoveries in sciences and humanities have thus far mostly occurred when we build on prior knowledge, and make new and creative connections between facts, data, knowledge and insights. Computational access to open collections of knowledge means that can happen at a speed and a scale that most of us can’t even really imagine. Certainly, the choice of topics and problems, the interpretation and application of results requires human imagination – but machine learning tools can speed up the process and, when combined with open access, equalize the ability of people to make use of the knowledge we have already accumulated.

The connection between open science and computational analysis is why this topic is so exciting and important to me. One of the recommendations of the MIT task force on the future of libraries, was that MIT convene another task force – this one on open access and open science. I am co-chairing that task force now and we released draft recommendations in March 2019, and expect to release final recommendations and a report this fall.

As our task force has engaged faculty and students across MIT in the cause of open science, we have emphasized that open access to research is good for research and is critical to enhancing our ability to collectively solve big global challenges. That is a compelling and true argument, but the argument that seems to resonate most strongly at MIT is when we explain to scholars that locking their work behind publisher paywalls means that their research will likely be left out of the scholarly conversation and the progress of science; because those conversations and that progress is increasingly computational. Researchers who are looking for data to use for machine learning and computational analysis are looking for data that is easily and openly available. Publishers want to sell not just reading access but also computational access to scholarly content; but I believe the integrity of science depends on educational institutions maintaining control over their own scholarly output – disseminating and preserving it in institutionally owned and operated repositories.

Imagine the progress we could make as a society if the output of researchers was openly and computationally available in interoperable repository platforms operated by and for the academy? (Here is a good place to put in a hearty personal endorsement of the Invest in Open Infrastructure initiative.)

It is possible that the most important thing libraries can do in a computational age is to continue to fight for open science and open scholarship – based on academic values and served via academy-owned infrastructure.

The combination of open + computational + academy-owned is the future that I think libraries are uniquely well suited to pursue, and that I think is what our universities and our communities need us to pursue.


What happens to libraries and librarians when machines can read all the books?

Revised text of talk I gave for the Harvard Library Leadership in a Digital Age program.

The description of this course promises that “you will identify fundamental changes occurring in the field of knowledge management and consider their implications for libraries, information services, and library leadership.”

I think my session maybe breaks the rules a bit (which is my first leadership tip for you: when it feels like the right thing to do, break the damn rules!).

One of the things I think is important for library leaders is that we look at fundamental changes outside of knowledge management and consider their implications for libraries and the work we do.

I think looking outside of changes in our own field is essential if we want to be active, effective leaders who don’t merely respond to change, but who create and shape the change we believe is needed in libraries and archives.

So, I want to talk about AI and libraries in at least 2 ways:

  1. Substantively, I want to share with you some of my thoughts and speculations about the potential implications of AI and machine learning for libraries and librarianship .
  2. I also want to talk a bit about AI on a more meta-level – that is to say, I want to use my own commitment to learning about and thinking about AI and its implications for libraries as an example of the more general tension leaders face between tending to immediate, local challenges and thinking about, preparing for and creating the future.

So let’s start with why I’m interested in machine learning and AI.

Basically, it is because I think that it is past time for us to take digital libraries to the next level; and I think the next level is likely to involve machine learning and optimizing our collections, services, and spaces for machine learning applications.

Where are we in digital libraries right now? We are still in the midst of the initial digital shift in libraries (really from the mid-to-late 1990s to now).

In this shift, we have gone from libraries being a place where individuals came to find physical books and journal articles (and manuscripts, and images, and lots of other stuff) so that they could read those books and articles themselves, to libraries being a service that individuals use to gain online access to journal articles, and e-books, and digital images and manuscripts and more so that they can read and use those things on their own digital devices.

This ongoing digital evolution of libraries and of how students and faculty use scholarly content is significant and has arguably made research and teaching more efficient and more productive.  The advent of online and digital libraries has also made more information more accessible to more people than ever could have been possible when scholarly materials were available only in tangible, physical formats.

But if this switch, from individuals reading books and articles one at a time in print to individuals reading books and articles one at a time on their own digital device is all we get from the digital revolution, then it won’t have been much of a revolution.

In the title of the talk, I ask “what happens to libraries & librarians when machines can read all the books?” But the truth is that we are already there – or at least the machines are. So it behooves us to be ready for it – intellectually, strategically, and operationally.

I think an important part of leadership is not just responding to changes, but actually getting in front of those changes when we can.

Let’s start with some definitions.

From the MIT Press Essential Knowledge book Machine Learning:

AI is “Programming computers to do things, which, if done by humans, would be said to require “intelligence”.

Machine learning is a kind of AI, where the computer is programmed to optimize a performance criteria using examples or past experience. The machine does what the data tell it to, not what a program tells it to.

In describing the advent of machine learning, Ethem Alpaydin says:

“nowadays, more and more, we see computer programs that learn – that is software that can adapt their behavior automatically to better match the requirements of the task. We now have programs that learn to recognize people from their faces, understand speech, drive a car, or recommend which movie to watch … once it used to be that the programmer who defined what the computer had to do … now, for some tasks, we do not write programs but collect data”

Since I’m not a computer scientist or an engineer, I use the terms in relatively loose ways and often interchangeably.

When I think of AI and machine learning in the context of libraries, I think of computer programs and algorithms that can extract/derive meaning and patterns from data, make predictions and inferences about and with new data, and in doing so, solve problems at scales not possible by humans only

Slide05At an MIT symposium a few years ago Elon Musk, CEO of Tesla, talked about the existential threat of AI and suggested a need for regulatory oversight. Specifically, he said “With artificial intelligence, we are summoning the demon.”

So let’s talk about the fears and concerns, maybe they aren’t as existential as Musk’s (I find librarians tend to be more practical), but I’m sure we have some. I certainly do.A

What are the dangers of AI, especially in relation to libraries and the things we support — especially research & learning? Here are some common concerns:

    • Robots will take our jobs – In an article in Library Journal in April 2016, Steven Bell writes about the Promise and Peril of AI for Academic Librarians – and he asks “Could artificially intelligent machines eliminate library jobs?
    • One reason people argue that AI will not replace library or other jobs is that machines can’t replace the deeply human skills of creativity and interaction; which may mean that those skills become more valuable or could mean that AI will usher in an era where creativity and empathy are devalued and rare
    • Another fear is that AI will eliminate the relationships between people and books, and between librarians and their community members
    • And one concern I think is very important to take seriously is the reality that without explicit counter-measures, machine learning & AI will re-inscribe & magnify existing systems of inequality and racism, sexism, homophobia and the like.

Here’s a cautionary tale about that last concern:

Last Spring, Microsoft unveiled a twitter bot named Tay; programmed to tweet like a teen. What could go wrong, right?

Tay is backed by Artificial Intelligence algorithms that were supposed to help the bot learn how to converse naturally on twitter. But what happened is that the bot learned quickly from the worst racist sexist corners of twitter – and within 24 hours Microsoft had to shut the experiment down because the bot had started tweeting all kinds of sexist, racist, homophobic, anti-Semitic garbage.

Even, or especially, with those concerns in mind, I think we need to think about AI and machine learning and the implications for libraries.

My thinking about AI, machine learning, & libraries, is guided by 3 kinds of questions:

  1. What role can libraries play in making sure we don’t summon the demon; or at least that we have the tools to control or tame the demon?
  1. How might we leverage AI in support of our missions? How might AI help us do some of our work better?
  2. How might we support AI and machine learning in ways that are consistent with and natural evolutions of the long-standing missions and functions of libraries as sources of information and the tools, resources, expertise to use that information?

Let me address the 1st issue and offer some thoughts on libraries as demon-slayers in our AI future. First, we need to accept that AI and machine learning are becoming more prevalent in our daily lives, and in many learning and research contexts.

Then we have to think about what concerns around AI that libraries and librarians are maybe especially well-suited to addressing; like privacy, context, authority, and ensuring the data used to train AI is inclusive and diverse and of high quality.

This last one seems to be to be especially urgent – as an example, when Apple hired a new Director of AI research, he spoke about the promise of AI as a research tool, imagining — “If I ask you something about a particular thing, can your system basically go to Wikipedia, read a few different articles, learn some facts about the world, and provide you with the right answer?” As much as we all love and use Wikipedia, I suspect that makes some of us cringe. Wouldn’t it be better to have “your system” go to the actual scholarly literature on a topic?

The 2nd area we should think about is how we can leverage AI in our work?

A typical area we think about is reference – this is Steven Bell’s concern that AI chat bots will replace reference librarians.

There is also plenty of potential around using machine learning in search – the 2 articles that were assigned reading for this session cover that ground fairly well (see list of references at end of this post).

We might also imagine leveraging AI for recommendation systems, and for cataloging and organizing our collections.

What if we turned my original question around and asked what would we do if librarians we could read all the books?


If we really could absorb all the information in our collections and make some sense of it, what would we do? What could we do if we had the capacity to read all our books, and maybe all the books in our peer libraries, and derive patterns from them?

What would we do that we can’t do now? What would we do better that we already do?

Can thinking about AI and machine learning in that way help us conceive of ways to leverage the fact that machines actually can do that now?

Finally let’s talk about how machine learning and AI might change or be changing research; and how we might start to think about optimizing our libraries to support new kinds of research made possible by text & data mining, AI and machine learning.

Let me share 2 really interesting examples:

Prof Regina Barzalay and her students and colleagues at MIT are using machine learning to extract information and predictions from the unstructured data in tens of thousands of pathology reports. Faster, as accurate as humans; and based on much larger amount of data than humans have access to.

Another example I learned about from my colleague Sara Lester, Engineering Librarian at Stanford, is GeoDeep Dive is a tool for geologists that uses machine learning to extract data that is buried in the text, tables, and figures of journal articles and web sites, sometimes called dark data, about rock formations.

GeoDeepDive is based on open source code, that can be repurposed on other datasources. Should libraries be exploring how tools like this could help us extract even more meaning and information from deep within our collections?

I think it is important not just that we know about these kinds of efforts, but that we proactively ask where can AI and machine learning be leveraged in the service of better science?

And how do libraries leverage our resources and skills to ensure it really works – and is infused with and informed by values we care about (inclusion, privacy, democracy, social justice, authority, etc.)?

Where can we intervene to make sure the research based on AI and machine learning is as good as it can be?

We help students find the best books and articles for their learning; so can we help programmers find the best data for their algorithms to learn on?

Can we help them think about the questions they want their machine learning applications to answer? Can we help fit the data to the question?

A final string of thoughts, provocations, and questions that keep me up at night:

As I begin to fully appreciate the fact that machines really can read all the books, and can “learn” from them; I am convinced that we need to think more rigorously about reading.

What are the different ways of reading, and what are the various goals of reading?

What can we learn best, as individuals and/or as society, through human reading? what can we learn best through machine reading?

Can we start thinking about how to design libraries to maximize the unique payoffs of many different kinds of reading?

How can texts (and images, and data) be maximized for human discovery and reading? For discovery via algorithms and reading by machine learning applications?

What does it mean to maximize our collections for humans and what does it mean to maximize them for machines and algorithms?

OK – really wrapping it up now:

Machines can already read all the books. Or at least they can read all the books (or articles) that they can read.

(sidebar about how the proliferation of AI should compel us to double-down on mass digitization and on open access)

Trying to understand a little bit about AI and machine learning has taken me way outside my cognitive comfort zone, but I think it is the kind of thinking we need to do to be effective library leaders and to be effective stewards of the future of libraries, librarianship, and for those of us in research libraries, for the future of scholarship.

I think it will be crucial that we avoid the temptation to continue to serve primarily individual human readers and let the computer scientists worry about how to apply machine learning and AI to vast libraries of resources.

I think we would be wise to start thinking now about machines and algorithms as a new kind of patron  — a patron that doesn’t replace human patrons, but has some different needs and might require a different set of skills and a different way of thinking about how our resources could be used.


For further reading:

Early reading list on machine learning

In the preliminary report from the MIT Task Force on the Future of Libraries, we make several references to the importance of optimizing library content, data, and metadata for machine learning applications.

We imagine a repository of knowledge and data that can be exploited and analyzed by humans, machines, and algorithms. This transformation will accelerate the accumulation and validation of knowledge, and will enable the creation of new knowledge and of solutions to the world’s great challenges. Libraries will no longer be geared primarily to direct readers but instead to content contributors, community curators, text-mining programs, machine-learning algorithms, and visualization tools.

I am convinced that machine learning is going to have a major impact on the advancement of knowledge in lots of ways we can’t anticipate, and I want to understand it better. I am also convinced that without the intervention of folks who understand the biases built into our collections in terms of content, organization, and description; machine learning applications will re-inscribe and reify existing inequalities.

To that end, I’m trying to put together a reading list to get smarter about what machine learning is, what it can do for libraries, and what libraries can do to support and inspire creative, productive, just and inclusive applications of machine learning. Here’s my very incomplete initial list. Additional suggestions welcome in the comments.

%d bloggers like this: