July 14, 2023
Robert On

There’s been a lot of buzz around Large Language Models (LLMs) recently (see our recent blog post on this).  So we thought it might be good timing to share how we have tried to leverage machine learning models in the core of our work at the Agency Fund – grantmaking.

The Agency Fund has, to date, received 3112 applications for funding of which 109 have been accepted and funded. We try our best to carefully review each of these applications to determine which are the most promising, impactful, agency-centered programs to fund. Our process is certainly not perfect, and we’re continually trying to improve. We’ve tried various approaches that examined the consistency of our ratings between reviewers, the weight of various criteria when determining a successful application1 and, the focus of this post, building an LLM-based predictive model that uses our previous determinations of funding to score future applications.

As you might guess, the success of such a predictive model will largely depend on how good our decisions were to begin with. It will also depend on the quality of the data we have available. We often have a hard time wrangling our reviewers – they don’t always reliably and completely label each application according to the criteria provided. Until we have a stable evaluation strategy, and more commitment to data labeling, this will continue to be a work-in-progress. 

Why are we doing any of this? Part of the interest is in improving the fairness and reliability of our evaluations. In practice, another pressing constraint is reviewers’ time: it takes many person-hours to determine eligibility for funding. If we were able to determine which applications are likely to be more promising than others, we could prioritize those applications over others which are clearly not a good fit for Agency Fund funding. Enter the predictive model… 

Since our applications are mostly text, we can use the tools offered to us by LLMs to make sense of our application data. We trained an algorithm that, in theory, could help us triage applications for review. We leveraged a model trained on 130GB of English news articles and built a neural network consisting of three hidden layers with 2048, 512, 128 RELU nodes, respectively.  We trained the model on our first round of grant-making (with data consisting of application text plus a binary score provided by human reviewers). We plotted the resulting model performance on out-of-sample validation data, achieving a passable AUC of 0.8 (AUC, also known as area under the receiver operating characteristic curve, represents the model’s predictive ability).

We simulated an alternative evaluation workflow, in which we would prioritize the review of applications based on the model’s scores, to demonstrate how we might improve screening of applications given fixed time constraints (figure below).

In this plot, the y-axis represents the number of “accepted” (or fundable) applications we would have missed if we stopped reviewing after the number of applications specified on the x-axis. The red line signifies the policy where we read them in the order they were submitted through our portal (“created”). The blue line is the policy where we read them in order of the prediction score (“pred”). After 20 or so submissions, the payoff in reading them in order of predicted scores seems to pay off significantly.

Until now, this exercise has felt fairly simulated and academic. What came of this model? The entire process is captured in this notebook – right down to the automated scoring and updating of new scores for new applications (an alternative BERT-based model was tried as well). Are these scores being used? Occasionally, but mostly not. One reviewer used them to order applications for reading, pushing the most highly-scored submissions to the top of the pile. Her rationale was to focus her attention on proposals most likely to be fundable, while still giving every submission a fair read (even if she was pretty tired by the time she reached the lower-scored submissions). Over time, we have opted not to maintain the model or refine it. 

This is not an uncommon experience, especially in the social sector. Why? Perhaps because money talks. In the private sector, you might build a model to rate the quality of an advertisement based on click-through-rates. This has a clear, obvious utility: a better ad will immediately make more money. In the social sector, these types of market pressures and incentives are not so immediate. And the impacts of adopting algorithmic policies are unclear, even controversial. 

At the Agency Fund, we ultimately do intend to read every application. So the simulated gains in the figure above may not be there after all. Maybe we’re just not ready to hand over these types of complex decisions over to the AI.

The lesson I’ve learned after building over a dozen of these models in the social sector is that the incentives, use case, impacts, culture, and buy-in to these types of approaches need to be figured out at the beginning. Otherwise, we’re just signaling without substance. 


 1 The estimation of these weights was limited by missing labels from reviewers.