I’ve spent much of the past year prototyping, productionising, and developing the efficient learning algorithm for BBC Bitesize. Our ambitions for the Bayesian framework that underpins the Bitesize quiz have always gone beyond just a quiz.

Once we got the initial prototype of the quiz up and running we turned our attention to diversification, and specifically towards adapting the algorithm to a cold-start recommender for BBC News and Sport.

Cold Start Recommender

The term “cold start” refers to the lack of data associated with new users to a product. How do you recommend something to someone when you know nothing about them?

image source

In this article, I’m going to give an overview of Bayesian statistics and talk about how we utilise user data to personalise and improve BBC Bitesize’s Efficient Learning GCSE quiz.


Let’s start with an example for context.

Suppose you flip a coin 3 times and get heads each time. Is the coin biased?

For most people the answer will be no, but why?

Can we tell whether kids are in school just from the viewing of certain TV shows on BBC iPlayer? Spoiler— yes! We can also identify when schools are closed due to snow!

This was the first project I worked on at the BBC, and remains one of the most fun to have delivered!

While the information on school holidays is available on the internet, it’s presented on local government websites which vary in format from council to council making web scraping impractical.

Use cases

The BBC produces a lot of content for children and being able to effectively and appropriately deliver this…

This was exactly the situation that our Data Science Team faced during the Covid-19 crisis in March 2020. Find out how we created an ML quiz engine with no data or questions.

To find out how we did, try it for yourself!

image source

In the first week of the UK’s lockdown, a co-worker shared this article from the Washington Post. I was inspired by the great simulations of how diseases spread and the demonstrations of how some social distancing measures can help to reduce the spread.

In response to reading the article, I decided to recreate the simulations and add far more additional detail to them. In this article I will discuss some particular scenarios of interest. …

Image source:

A beautiful data visualisation speaks a thousand words. Inspired by a talk I saw by David Spiegelhalter last year I’ve created an easy to use python package to create visually stunning dot plots.

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

Visualising Covid-19 Symptoms

Due to the global pandemic currently effecting every part of the world I thought…

I’ve previously discussed how term-frequency inverse document-frequency (tf-idf) can be used to calculate content association metrics between individual pieces of BBC content. By using our customer browsing data we can study the overlap in users between each pair of content items and deduce how similar they are. One challenge with such an approach is that popular content items will share a user-base with many other content items. Using tf-idf weights mitigates this problem by penalising the weights associated with popular content.

With the added application of Louvain clustering we can identify collections of different content enjoyed by the same people…

The BBC produces fantastic content that appeals to a mass audience such as Killing Eve and Bodyguard on BBC iPlayer, the Danger Mouse games for CBBC, and Match of the Day on the BBC Sport, to name a few. While it’s great that we can produce content that is enjoyed by so many different people it does create data science challenges around understanding the personalities and preferences of individuals. Episode 1 of Killing Eve, for example, attracted a whopping 26% of the TV audience during first transmission.

Term frequency-inverse document frequency (tf-idf) is a metric most associated with text analysis…

BANs or big ass numbers are a visualisation technique that are frequently overlooked in favour of unnecessarily complex graphs. While some stakeholders might need to know sales figures going back 18 months and need lots of stats about how last week performed, for others a top level stat that can identify good or bad weeks will suffice and have more impact. In this post I’m going share a nice BAN-based doughnut plot code that I’ve developed in python to do just that and I’ll apply it to 2 common datasets.

Air Passengers Dataset

Let’s start with the air passengers dataset. This set contains…

So you’ve done some clustering using k-means; you’ve scaled your data, applied PCA, checked the clusters using the elbow and the silhouette method and you’re pretty confident that your model is giving you the absolute best clustering you can get. But just how much trust can you place in the classification of individual data points? That’s what I’ll be discussing here.

Most metrics for assessing a clustering algorithm are measures of the global success of your model. But what if you’re not looking to classify every data point but are looking to identify specific data points that definitely fit into…

Matt Crooks

Senior Data Scientist at the TypeForm

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store