From prototype to production: building the PDF importer with Southwark Council

Council content teams were spending days converting PDFs that nobody was ever going to make accessible manually. We built something that does it in minutes and then we gave it away.

Screenshot of Southwarks PDFs imported into Drupal.

Council content teams have a PDF problem. Not just a lot of PDFs, thousands of them. PDFs are how services submit documents to be published. They arrive every week, they're not accessible, and converting them to HTML manually takes hours. Sometimes days. It's the kind of work that drains good people.

We'd already built the LocalGov Publications module with Hammersmith & Fulham to give councils a way to publish structured, accessible HTML content. But when councils saw it, they assumed it would import their existing PDFs. It didn't, it was for creating new ones. That gap was the brief.

We built a prototype, showed it at a LocalGov Drupal community event, and Southwark Council saw it. They wanted to fund it properly. So we got together.

The challenge

Southwark's content team didn't want to spend hours every week copy-pasting from PDFs into their CMS. The documents were rich with information but locked in a format that wasn't searchable, wasn't editable, and wasn't accessible. Multiply that across a team, across every service area that submits a PDF, and you've got a serious drain on time that could be spent on better work.

The prototype we'd built was a starting point. It extracted text and dumped it into a single node. It fixed the worst of the line-break chaos PDFs create. But it wasn't production-ready: no sensible page structure, no images, no links, no way to customise the process per council.

Southwark needed something a content editor could actually use. Not a developer tool.

What we built

A pipeline architecture

The importer is built around three plugin types: extract, transform, and save. An extract plugin pulls content from the file. Transform plugins process it — fixing line breaks, handling images, optionally calling an AI. A save plugin creates the publication nodes with the right references.

Councils can configure these into import pipelines to suit how they work. A council with a house style they want to enforce can add a transform step. A council that doesn't want AI in the loop doesn't have to use it.

Links and images that survive the import

PDFs strip structure. URLs become plain text. Images exist as page-rendered graphics with no metadata. We wrote extract logic to find URLs in the text and reattach them as real links. Images we can extract are pulled out and saved to Drupal's media library, then attached to the page as proper media entities.

AI as an optional step

The hardest part of PDF conversion isn't getting the text out. It's that PDFs have no concept of flow. Content that looks like a heading is just bigger text. A list is just lines with similar indentation. We use an LLM to re-introduce that structure — headings, paragraphs, lists, tables — and to split the document into sensible pages.

In the prototype, we sent one AI request per page. That gave inconsistent heading hierarchies and duplicated titles because the model couldn't see the whole document. We moved to one request per document. The results are more consistent, page breaks land in sensible places, and the model can generate accurate titles for each section. For long documents it runs as a background process.

The AI step is optional and configurable. The default prompt was co-written with Andy Broomfield (Brighton) and Justin Ashworth (Southwark). You can change it to simplify language for an easy-read version, translate, apply a style guide.

Built properly for open source

Because PDFs are chaos, we wrote a proper test suite. Fourteen real-world PDFs, imported on every test run. Test data is kept in a separate module so it doesn't land in every council's production install. The module supports any AI provider the Drupal AI module supports — which at v1.0 is twenty-plus, from Anthropic to Azure to Ollama.

The results

What took hours now takes minutes. Often under one minute.

Southwark Council's Angie Forson, Web and Digital Programme Lead:

The AI PDF Importer wasn't just about extracting data from documents; it was about rethinking how we interact with data.

— Southwark Council's Angie Forson, Web and Digital Programme Lead:

"This project has been a model example in collaborative innovation, from co-designing with frontline staff to sharing learnings with the Local Gov Drupal Community and beyond."

— Southwark Council's Angie Forson, Web and Digital Programme Lead

The project was covered by the Drupal Association, featured in UKAuthority, written up by LOTI, and picked up by Drupal4Gov EU. It was shortlisted for the AIImpact Awards 2026.

v1.0.0 shipped in early 2026. West Lindsey District Council are funding v1.1 coming very soon. 

 

Our reflection

The thing we got right early was the plugin architecture. It would have been easy to build something that worked for Southwark and only Southwark. Instead we built something configurable enough that every council can adapt it without forking the codebase. That's what makes it a community contribution rather than just a client project.

Along the way we fixed a handful of bugs in smalot/pdfparser, the open source PHP library we use to extract text from PDFs. Obviously we contributed them back upstream. It's a small thing, but it's how open source is supposed to work: you find a problem, you fix it, you don't keep the fix to yourself.

The AI piece is genuinely useful not as a magic wand, but as a way to get an editor to 80% without manual effort. The remaining 20% still needs a human. That's fine. The point was never to remove editors from the process; it was to make sure they weren't spending their time on copy-paste.

The module is open source and available on Drupal.org. If you want to try it on your site, or you've got a backlog of PDFs you'd like to clear, we'd love to help.

Email: hello@wearechicken.co.uk Code: https://www.drupal.org/project/localgov_publications_importer 

LocalGov Drupal

Join our PDF importer mailing list for updates from us