Jonathan Soma is Knight Chair in Data Journalism at Columbia University, where he is director of both the Data Journalism MS and the summer intensive Lede Program. He lectures at Columbia on everything from basic Python and data analysis to interactive visualisation and machine learning. As an educator, programmer, and designer, he focuses on making unapproachable concepts accessible. He has worked with ProPublica, The New York Times, and others.
Summer School of Investigative Reporting 2025 Sessions
Data journalism: Gain a foundational understanding of the tools and processes of data journalism
Data journalism is a wide field, covering everything from visualization to machine learning and AI. In this session, we will examine a variety of data sources and tools that allow journalists to analyse and visualize data in all its forms, including spreadsheets, maps, and text documents. Since each data-driven story presents its own challenges and demands specific approaches, we’ll also examine how to skill up successfully once you’ve left the programme.
Advanced Data Journalism (Session 1): Data analysis and cleaning with AI and Python
Python is the ideal tool for data analysis when you’ve outgrown Excel and Google Sheets. In this workshop, we will explore how Python –with the helpful suggestions of an AI mentor – can transform your data journalism workflow.
We’ll learn to create reproducible, easy-to-understand “notebooks” that are powerful, yet simple to share and review. You’ll discover how to open various data formats – CSV files, Excel spreadsheets and web-scraped data – and master techniques to clean and convert them into analyzable formats.
By the end of this session, you’ll have the skills to handle larger datasets, conduct more complex analyses, and uncover stories that might otherwise remain hidden in the data. This workshop will empower you to take your data journalism to the next level, opening up new possibilities for insightful reporting.
Advanced Data Journalism (Session 2): An introduction to scraping with Python
Much of the world’s most interesting information hides in plain sight on websites, inaccessible to simple downloads. Web scraping offers a powerful solution, automatically extracting this data and transforming it into structured, usable datasets.
In this workshop, you’ll learn to harness web scraping for journalism, covering techniques like:
– Pulling data tables directly into Python
– Extracting news articles and search results
– Downloading hundreds of PDFs
– Automatically clicking “next page” dozens of times
We’ll also explore advanced scenarios, such as collecting social media posts and setting up automated scrapers for regular data updates – daily, weekly, even hourly.
By the end of this session, you’ll have the skills to significantly enhance the reach of your investigations.
AI: Do’s and Don’ts with Jonathan Soma: How to leverage AI responsibly in a modern, data-driven investigation
While every corner of the newsroom is buzzing with the possibilities of AI, it holds special potential for investigative journalists. We’ll look at:
– Automatically sorting and categorizing documents
– Extracting key information from texts
– Transcribing interviews privately and accurately
– Leveraging it to learn new skills
But for every benefit of AI, there’s always a potential pitfall. To ensure we are reporting responsibly, we’ll also examine how these tools are created and how they work, which will help us know when to set them aside in favour of traditional journalism and manual review.