Generating unique IDs with the Snowflake algorithm

@
12 min. read | 1337 views

Unique IDs are pretty much used everywhere and in every application. They are used to identify a user, a product, a transaction, a session, etc. That way we can, for example, assign a transaction made to a specific user, and not by mistake another one. In this article, I’ll show you how I generate unique IDs in my applications and give them a second usage as well.

How to get unique IDs?

There are several ways of getting a unique ID per user. Obviously some people probably do stuff like

Though I don’t really recommend that as it is obviously not really unique. You can, and will, end up with twice the same ID generated. Of course you can check in the database if that ID already exists and generate a new one if it’s the case, but you get my point

Auto incrementing integer

Although this is a very basic and widely used method, I’d rather not use this. An example on how this looks like in a database would be something close to that:

1 john some_hash 1667329399
2 jane some_hash 1667329404
3 jack some_hash 1667329412

Here we can clearly see the IDs incrementing every time one by one.

UUID

UUIDs are a very common way of getting a unique ID. They are 128-bit numbers that are unique. UUIDs are very useful because they are unique, yes there is a tiny chance you can get the same ID generated twice, and they are not sequential. They are also very easy to generate. We can, for example, generate one via the terminal using the following command:

As you have expected, this UUID will take space in your database but that doesn’t really matter that much, your database ends up looking like that when you use UUIDs:

45915234-74BE-499A-B7D5-D6D4882DA5B7 john some_hash 1667329399
885A26B8-C7CA-4C84-A871-0E89AF4997BD jane some_hash 1667329404
689E682B-35C8-4594-A0D7-5893626B7B1D jack some_hash 1667329412

Now let me introduce to you Spaceflake, a distributed generator to create unique IDs with ease; inspired by Twitter’s Snowflake

What is a Spaceflake?

A Spaceflake is very simple, it’s an ID based on the Snowflake algorithm of Twitter, which they are actively using, and it is slightly edited to fit my needs. The algorithm is pretty simple:

The structure of a Spaceflake

As you can see, a Spaceflake is composed of 4 main parts:

• Timestamp: The timestamp is the milliseconds elapsed since the base epoch, it is not necessarily the Unix epoch, when the ID was generated.
• Node ID: The node ID represents the ID of the node that generated the Spaceflake. It is a number between 0 and 31.
• Worker ID: The worker ID represents the ID of the worker that generated the Spaceflake. It is a number between 0 and 31.
• Sequence: The sequence is the number of Spaceflakes generated by the worker on the node. It is a number between 0 and 4095.

With these specifications of the algorithm, we can theoretically generate up to 4 million unique IDs per millisecond

Workers & Nodes = Spaceflake Network

A Spaceflake Network is a very basic concept where you have multiple independent nodes that themselves consist of multiple workers. These workers are the ones that can generate Spaceflakes.

Ideally a Spaceflake Network represents your entire application, or company. Each node represents a single service/server or application within the company, and each worker represents a single process which can generate a Spaceflake for a specific purpose. This way you can easily identify where a Spaceflake was generated by looking at its node ID and worker ID.

A Spaceflake Network

In the example above we can can consider Node 1 as being the API/backend of your application. The Worker (ID: 1) would be responsible for generating Spaceflakes for user IDs. The Worker (ID: 2) would be responsible for generating Spaceflakes for blog post IDs. That way we can easily identify where a Spaceflake was generated and for what purpose.
Then we also have the Node 2 which may be the logging system of your entire infrastructure, and the IDs generated would be generated by the Worker (ID: 1).

In the end you are free to use them as you wish, just make sure you use these nodes and workers to be able to identify the Spaceflake - which is also the idea behind it.

Why did I create Spaceflakes?

“Why on Earth did you create this???”

As a matter of facts lots of people have already asked me this…

To explain my decision, let’s go back to the example of the database above with UUIDs. I intentionally added a created_at field in the database as it is very useful to know when a user, or a post, was created. With Spaceflakes, and Snowflakes, you have the creation time of the ID built-in ((spaceflake >> 22) + BASE_EPOCH), so you don’t need to store it separately in a new column of your database. This is very useful as it removes unnecessary columns, and it also makes it easier to sort your data by creation time. Your database would then look like that:

1037425364332843009 john some_hash 1667329399
1037425458566270978 jane some_hash 1667329404
1037425546889924611 jack some_hash 1667329412

Another reason is that I use Go (duh) for all my different backends, and I wanted to implement my own way of generating unique IDs that are just like Snowflakes. The goal was to make it as easy as possible to use, and to be able to use it in any project I am working on, such as Project Absence, or any other future projects.

Some issues I’ve faced during development

Go is fast

Gotta GO fast (ba dum tss)

Yes, Go is fast; this was also my first issue. The reason behind that is that at first I made the sequence random, which was definitely not the best idea - though I’ve kept it in case I want to use it one day. The issue is that, as you’ve guessed, the sequence is from 0 to 4095, which means that it can only generate 4096 unique IDs per millisecond. This is not a lot, and it is definitely not enough to make sure to have a unique ID for each Spaceflake generated within a millisecond. So I had to change it to be incremental.

Incremental is not enough for bulk generation

So let’s suppose someone wants to migrate their database to use Spaceflakes instead of UUIDs. They would have to generate a lot of Spaceflakes, and they would have to do it as fast as possible. This is where the issue comes in. If you generate a Spaceflake every millisecond, you can only generate 4096 Spaceflakes per millisecond. This is, once again, not a lot. So I had to make the node ID and the worker ID incremental as well when using the BulkGenerate() function.
The code is a bit madness, but it’s working (at least I think so).

Now we can make a bulk generation of up to 3’935’295 Spaceflakes per millisecond, which is a lot more than 4096 without auto scaling. Once we hit that maximal amount, we need to sleep for 1 millisecond to make sure the timestamp is different, and we need to create a new node and a new worker.

That is another problem solved

Spaceflakes are big numbers

Last but not least, I learned that big numbers are pain if not handled as string within an API.
This was exactly what happened to me when using Spaceflakes for one of my API; I generated a Spaceflake - 144328692659220481 - and made an endpoint to get the user ID. When making a GET request to the endpoint with a browser, I got the following json:

Yes yes, there is a small difference - it’s right here: 144328692659220480. Now this may be a small difference, but it is a difference and we do not want that. The only solution is to make the API return the ID as a string. On the client side, we can then convert it to a number if we want to - for example by using this:

This will then return the correct sequence of the Spaceflake.

How to use Spaceflakes?

Using Spaceflakes is super simple, if you have a Go project, you can just download the package with

And then you can use it in your project by importing it and generate a Spaceflake:

There are other examples you can check out in the examples folder of the repository.