I fear I won’t be able to process everything and get it in a post before I start forgetting what my notes mean! Expect rapid postings in the near future.
Onto Session 2!
TL;DR: databases.
This session was about the Zombie apocolyse and data structures with Casey Rosenthal. I knew walking into this one that it would be a little over my head but after a post session debrief with Conrad (my guide) that helped frame database structures in general, everything clicked! I’m also just now starting to really get into database structures and understand more about how they work, how to work with them, and what are the ways you can structure different data types within databases so this ended up being a great session for me to be in. Also, it was about a zombie apocalypse so it had to be good. Let’s start from the basics. There are two types of databases, SQL databases and NoSQL databases. NoSQL databases are essentially anything that isn’t a SQL database. NoSQL databases can also be known as distributed key value databases. Examples of these databases include MongoDB and Riak. SQL databases include MySQL, SQLite, etc. There are a handful of differences but the main difference is that SQL databases are based in table modeling. SQL databases can be a bit more flexible if you have similar types of information coming in all the time, but they can also be harder to scale. SQL databases are also good as long as everything is on one machine because they are safe, have lots of features and have many other good aspects but features these same features are hard to implement once you are running on many machines. Why do you need to move to many machines instead of having just one machine? Because you can’t read the information quickly enough when there is a lot of it (essentially, the queries start taking too long and you need to split the information onto more than one machine). Finally, SQL databases enforce the structure. If you use a SQL database, all of the tables in the database require the same information. In a NoSQL database, this isn’t the case. NoSQL databases are designed to they can start as one thing but easily scale to become larger and hold different types of information. And because there isn’t the same table structure, it can be easier to store lots of different types of information. So, when using a NoSQL database, you don’t have to use commands like rake db:migrate because there are no data migrations (yeah, that kinda blew my mind and it was also where the whole rest of the talk suddenly clicked into place for me). AND there’s no schema file! A NoSQL database has tables but they aren’t specific so they’re called connections instead of tables. The downside, however, is that you can’t tell exactly what’s in the database unless you pull all the records
So, back to the actual talk…when thinking about what type of database to use, it’s important to think about how the information will need to be presented in the end. This was an idea that continued to surface throughout the talk.
Some important terms I learned when thinking about NoSQL databases (and databases in general) were high availability, strong consistency, and partition tolerance. High availability means you can connect to any part of the system and you can both read and write all the data. So, for example, if you have 3 different silos that each keep information, the databases are kept in sync with one another so that you can access all the same data from any of those silos. In other words, if a server crashes, the database still works (Casey has a great slide that shows this, so I definitely recommend looking up the talk on Confreaks when it’s up). Strong consistency means that parts of the database will remain in sync with one another. Finally, partition tolerance means that if the cable comes out of the wall, the database will remain running.
When you have a distributed system, there are a few data modeling options. You can do a document based inverted index or a term based inverted index. The document based inverted index means that you can write the information efficiently but that you’ll have an inefficient read process. The term based inverted index is the opposite. You have an efficient read but inefficient write system. So again, it comes down to really thinking about the data you have and what will be more important to your business processes. Will you be reading a lot of data or writing a lot of data? The
He also talked about Highly Available systems (HA) of geohashes and one other type of HA system that I can’t remember. Lastly, I learned about some interesting components that are Riak-specific. Riak creates data siblings, which is also not possible in SQL but is possible in key value databases. Basically, when information is written into the system to the same key (so to the same data entry) from two different places (so two people are updating the same person’s file at the same time) the system will save both entries and connect them. In other systems, they just take the most recent timestamp as the most recent data but here, siblings are created and the two entries are both saved, become siblings and then the user is notified the next time that file is accessed being told “hey, you’ve got two entries here. Which one is correct?”
The rest of the talk went into some more detail on this information, how to locate zombies via zip code and other interesting components of the database but, for me, this is what I gleaned from the talk. For more examples, information, and the source code, you can check out zombies.samples.basho.com.