On Technology, Privacy, and Good Governance (Part 1)

This is a short piece explaining k-anonymity and its importance for people in the civil service. The technical information is very serious, and the background information is equally very serious and absolutely not meant to reflect on any current political officials. As the fine print in many a book says, any resemblance to real people is entirely coincidental.
Wink wink.


A Totally Made-Up Scenario

Imagine that you're a civil servant - say, a special government employee. You've been brought in to do a job for the federal government - namely, tracking government waste and helping the government be more efficient in resources (people, finances, disbursements, you name it). When you're brought in, you boldly proclaim that you're going to put together an incredible team that digs up all the information related to the bureaucracy, that the people voted for the administration to have you do just that, and that you'll be very transparent about the results you're finding. You announce your hypothesis that the government has trillions in debt and that the money that you'll find will balance it out.

As you're doing your work on a fine late-winter morning, you decide to share a bit of the findings with the public. After all, you promised to be transparent, and the people want this! You take to your favourite social media of choice and make a screenshot of the data from one of the spreadsheets that your new hires have sent you. You're pleased with yourself. The evidence is now out there, and the country can see just how amazing your work is. Your phone pings with a notification and you smile to yourself as you pull it out because it clearly means people are reposting the evidence you've shared. You pull the notification down and raise an eyebrow; your boss's lawyer and one of the data engineers have sent you a meeting request with the subject line “URGENT”. You put on your black baseball cap, sit at your desk, and log into the Teams link.

Your employee and the lawyer are going haywire. Apparently, the screenshot you shared of some government spending might have broken some privacy laws, and they need you to take the post down fast before any of the opposition parties see it and take your boss to court over it. You're confused. No one's name is in the screenshot: it's just cities where the government funding went to, approximate dollar amounts, the year the money went out, and some vague keywords/memos about what the money was for. The interns told you nothing identifiable was there, and there isn't. You express as much and cross your arms, and the lawyer sighs while the data engineer removes their glasses and rubs their temple.

“Seriously?” the engineer asks. “You've never heard of k-anonymity?”

Congratulations. You've now opened a legal can of worms for the administration.

K-Anonymity, Explained

In this totally arbitrary and not at all (as of early March, 2025) current scenario, the person-brought-in-to-help-government-efficiency had no idea about something called k-anonymity. If you're unfamiliar with this term, that's okay. By the end of this post, you'll have a rough understanding of what it is, why it's important, and you'll start thinking about it when working with data or sharing data publicly.

(For the moment, we won't get into the super technical aspects of modifying datasets to ensure k-anonymity is reached. If there's interest in it, maybe I'll share a post in future with an example.)

The definition

In data science, k-anonymity is a term used to describe how successfully a dataset anonymizes individuals. According to Google Cloud, “A dataset is k-anonymous if quasi-identifiers for each person in the dataset are identical to at least k - 1 other people also in the dataset”.¹

At the simplest level: in a dataset, if more people share the same attributes, it's much harder to identify any given person.

Let's break down some of the terminology used.

(A detour to understand quasi-identifiers)

In data science, quasi-identifiers are pieces of information that identify an individual only when combined with other pieces of data.² (They are contrasted with direct identifiers, which, as their name suggests, outright identify individuals).

On their own, “Month: July”, “Location: West Country”, “Year: 1980”, “Country: England”, “Hair color: brown”, or “Date: 18” don't really mean too much. July 31 is just a date, lots of people have brown hair, and if you're an American hockey fan, 1980 is a great year because the US beat the USSR in the Miracle on Ice. Those identifiers signify nothing and can stand on their own without calling attention to someone. But the moment you have them together (and maybe, just maybe, include an identifier called “odd bodily scars” with the value “lightning bolt” in there), you start to get an impression that maybe a certain individual fits the bill far more than others³.

Direct identifiers are the opposite of quasi-identifiers. In this very fictional example, “Harry Potter” is very identifying. Names are usually examples of direct identifiers (unless you are yet another person with the fortune of being named for The Boy Who Lived).

The importance of quasi-identifiers is that an 'adversary' - someone who would like to find out the identity of an individual in a dataset - uses them to piece together who the individual is.

This comes with assumptions (some implicit, some not so much): you could assume the adversary knows the individual and want to figure out what other identifiers belong to them, or you could assume the adversary knows some identifiers and wants to figure out other identifiers to match other attributes in a large dataset to the individual.

Back to k-anonymity

Okay, back to k-anonymity. Remember, the definition we were working with was this:

A dataset is k-anonymous if quasi-identifiers for each person in the dataset are identical to at least k - 1 other person also in the dataset.

The minimum meaningful k value one should aim for is 2. If k was equal to 1, it would mean that the person would be uniquely identifiable (that is, they wouldn't be anonymous!) in the dataset! The larger the value of k, the more anonymous the person can be.

In simple English, here's what this means:

If you were to decompose the attributes of a bunch of people in a dataset and wanted to anonymize them in some way, for any given person, the dataset would be k-anonymous (where k is a number) if k-1 other people in the dataset had the same quasi-identifiers.

Okay…a little more understood, but maybe it's time for an example.

In another universe far far away…

Imagine you're the Dark Lord of a parallel world in the year 1975. You've received a prophecy that a child born in the next little bit is going to be your undoing. Luckily, you have some data analysts on your team and they're pouring through scripts and birth dates to figure out which possible kid you must…uh…pay a visit to. The prophecy is vague and there's not much detail to work with, but you've been tipped off that the child will have ruddy hair. After pulling a thesaurus out, you learn that ruddy means 'red'. Your newest intern (a very blonde rich guy with a cane) mentions that his contacts allege that there's three individuals it might be.

Birth Month Birth Location Birth Year Birth Country Hair Colour Eye Colour
July West Country 1980 England Brown Green
March Devon 1980 England Red Blue
September Oxfordshire 1979 England Brown Brown

The evidence is clear. At some point around January 1980, you and your crew need to be in the Devon area of England keeping an eye out for parents with reddish hair (or at least a mother that is heavily pregnant). Around the third week of February, you probably want to start thinking about befriending parents so they trust you. And on March 1, if you've done your homework, once you learn which woman has given birth, it's time to snatch the newborn and raise him in your environment ensuring that they never become your undoing and are aligned with you for eternity. And you didn't even have to use an Unforgiveable Curse!

It's a simple example, but you might implicitly grasp what's going on. How did Parallel Dark Lord (let's call him Lord Wall-de-art) figure out who in the dataset was the baby with pinpoint certainty?

Aside from being glaringly obvious, Lord Wall used k-anonymity to reverse identify the child…which just so happened to be Ronald Weasley's parallel universe self.

First Name Last Name Birth Month Birthday Birth Location Birth Year Birth Country Hair Colour Eye Colour
Harry Potter July 18 West Country 1980 England Brown Green
Ron Weasley March 1 Devon 1980 England Red Blue
Hermione Granger September 19 Oxfordshire 1979 England Brown Brown

In this dataset of individuals, k-anonymity was not achieved: only ONE individual had the identifier that the Dark Lord was tipped off about (red hair). Thus, k would be equal to 1 in this instance. If the prophecy had specified brown hair, The Dark Lord would have had to decide between the 1979 baby or the 1980 baby (and if he didn't feel extremely horrifying in going for two babies instead of one). If brown hair was the identifier, k would equal 2: that is, there would be minimum 2 individuals in the dataset with that identifier.

With respect to the identifier of hair colour, if ‘brown' was the attribute, k would be 1.

Key takeaways for busy experts

  • K-anonymity helps with anonymizing information.
  • People who work with data that share data with the public should care about privacy and anonymizing their datasets.
  • When sharing data you believe is already de-identified, it pays to take a moment to think like a malicious actor and ask yourself “Could someone do harm with the information I'm sharing?”

Wrapping Up

This post was longer than expected, and we've only just begun to scratch the surface! In the next one, we'll dive a bit deeper into k-anonymity, try to anonymize a dataset a little more with something called “suppression”, and put on our biggest wizarding hat to prevent The Dark Lord from coming after our would-be protagonists with the help of some anonymization tools.