I just had my one-year anniversary working for Redgate, and I must tell you, it’s been one of the best years of my professional career of 25 years. People aside (and there are a lot of really great people!), one of the reasons I’ve enjoyed this year so much is that Redgate understands and believes in helping the data community grow and learn. That’s a mission I can easily join.
As part of that mission, I have regular opportunities to learn how users in the PostgreSQL community use the database and the challenges they face. In many ways, it’s not all that different from the SQL Server community. To help the community learn how to overcome challenges with PostgreSQL, a good sample database is essential.
It turns out that good sample databases are hard to create and maintain.
There are many (a plethora??) of datasets available to import for one-off learning objectives. The real challenge is finding a database that can be used for long-term learning that grows over time and utilizes as many features as possible. Database architecture and design is hard. Doing it with fake, but realistic data is really challenging.
But I still wanted to try. 😀
In the PostgreSQL space, one of the open-source options is a small database called Pagila. It’s based on an old MySQL sample database called Sakila, a fake DVD rental store. The community tries to keep it up to date with new features in PostgreSQL. But I wanted something more realistic if possible.
For instance, I utilized an open-source movie database, TMDB, to get real movie titles, movie details, production company information, cast, and crew data. With the help of Ryan Lambert, I was able to create realistic (but fake) geospatial data for customer and store addresses. In fact, Ryan will be teaching a full day pre-con at PASS Summit on PostGIS and mapping with PostgreSQL. Most importantly, there are functions to generate continuous rental and payment data.
Over the next few weeks, I’ll start to share the database, schema, tools, and scripts I’ve used to create a the database, which I’m planning to call “Bluebox” (U.S. readers will understand the node to Redbox). This first attempt is definitely beta-quality at this point, but I’m excited about making this available to the community and seeing how others can help improve it.
So my question to you is, what are some of your go-to sample databases and datasets to learn more about your database of choice? Are there any that attempt to mimic a real, full application? What qualities do you look for in sample data to ensure you’re able to learn the server and features well?
I look forward to seeing your suggestions and comments!