If you're working as an Internet DBA there's no way to avoid the MySQL phenomenon. Just go online and look for a DBA position and most Internet companies (MSN excluded) are either requiring it as a primary or secondary skill. This article is a primer on MySQL for the SQL Server DBA. As this is a religious war for many, I want to preference this is based on my experience, with both SQL Server and MySQL. Therefore it's not the absolute guide for comparing the two.
MySQL is a vast product, and talking about all the different cool features and gotchas is outside the scope of this article, but perhaps in another. This article will focus on the data engines as this is the corner stone to understanding MySQL, and making the right decision if you decide to use it. In SQL Server you have one choice for a data engine which works off a b-tree structure, you don't have the option of changing it, as it's cooked in with the product. In MySQL you have a plethora of choices, each providing its own architecture with its advantages and disadvantages. These engines are declared at the table level, which allows you to mix and match different engines within a single database. This is a huge advantage for optimizing your application, as it allows you to choose the right engine for the right job. The disadvantage is some engines simply don't perform as expected when transferring code between them.
NDB is the gem of MySQL, originally developed by Ericson to track cell phone calls this is a share nothing cluster engine stored in memory. This is a true cluster that supports both high availability and load balancing. It's able to perform this by separating it's datasets into replicas where each node contains both a primary and backup segment of the data. If you have 2 servers, each server is responsible for holding half the data as a primary copy and half the data as a backup copy. Primary means the copy that gets hit for select queries, while the backup mirrors the other nodes primary replica. In the event of a failed node, the surviving node turns the backup on to become the primary for both. The failover process occurs in 1 to 5 seconds. This engine is similar to synchronous mirroring in SQL Server in that it is a 2-phase commit, the difference being the commit is done in memory at the data layer not the log. Logs are hardened at a later time, with the theory being that since data is committed on multiple nodes the data is safe and doesn't require a log flush as part of the transaction. In 5.0 partitions are randomly created on a hash of the primary key.
For pure performance the cluster is comparable with a single instance of SQL Server. I've found on selects it beats SQL Server slightly as long as the data on SQL Server is in memory. This gap widens as more connections are made. Writes depend on whether SQL Server is using write back cache on its controller, in the case it is, it beats NDB, due to NDBs 2-phase commit. Without the controller cache NDB beat SQL. However this is not apples to apples. When compared to SQL Server synchronous mirroring NDB wins hands down. The cost associated with NDB is that it resides in memory (5.1 allows you to keep non indexed data on disk), and therefore your dataset is limited by the amount of memory you use. In addition it has trouble with multiple tables using joins. The primary issue here is it doesn't support a merge join, and nested loops run considerably slower then in SQL Server. Another downside here is that startup times can be extremely slow, a dataset of 10 GB can take as long as 45 minutes to start due to the reading and replaying of its logs. With all the negatives put aside if you have an application that requires redundancy, and fast inserts and selects on a single table, NDB is the best product out there. We've been running it for almost 18 months and it's been rock solid. Compared with other SQL Server and other MySQL implementations this has required the least amount of attention. One final note this will not run on Windows.
This is the default engine for MySQL and as its name describes it is built on an Isam data structure, well sort of. It does contain a b-tree for its indexes but unlike SQL Server that maintains its data on the leaf page of the tree, the leaf points to an Isam structure. This is comparable in the SQL Server world to building a table on a heap, where the index points to a location in the data file rather then a key. Myisam whether intentional or not is built and optimized for read-only datasets. Some of the features that make this the case is the fact that it doesn't support transactions, so taking user updates would be dangerous, but you don't incur the overhead of the transactional log. It performs locking at the table level not the row or page, which is not the best for active OLTP systems, unless the system is only doing inserts. On an application only performing inserts it performs well because it has the ability to perform concurrent inserts and selects. This is because data is appended to the end of the file, and a table lock is not issued, allowing for selects of the data not to be blocked. However just like a heap table in SQL once space is made available through deletes data will be inserted into existing free pages and you loose concurrency, due to the table locking architecture. This feature can be turned off, so that all data is appended to the end file, but it requires constant monitoring on data space.
We've been very conventional in the use of this engine, where we prep data in a staging environment using SQL Server to process heavy loads. Once massaged the data is pushed to MySQL in our delivery environment using SSIS. MySQL offers a .net suite that offers a rich set of interfaces allowing for easy integration between MySQL and SQL Server. Since Myisam is not built for DML operations our methods for pushing data has varied from table flips and leveraging the "Insert Delayed" statement. The delayed option puts an inserted row into a memory queue, and inserts the data during a time of low activity. There is also the "Low_Priority" option of the insert statement which waits until there are no requests for Locks (Shared or Exclusive) before inserting data. The Low_Priority feature is available for Deletes and Updates as well. If you want to emulate this behavior you can also explicitly call a lock with the "Low_Priority" option, which is essentially what the fore mentioned statements are doing implicitly.
Myisam will run on either Windows or Linux, and performance is about the same. If you're new to the open source world running on Windows allows you to focus on learning MySQL without throwing Linux into the mix. Overall for small read-only data sets Myisam out of the box (Downloaded from SourceForge) is an excellent choice, but does struggle as the datasets grow, and requires tuning.
This is MySQLs transactional engine, and is equivalent to a scaled down version of SQL Server. Unlike Myisam data is kept in a b-tree clustered index at the Leaf level. If a clustered index is not specified, it uses an internal 6 byte identifier to cluster the data on, differing from the heap architecture of SQL Server when a Clustered Index is not specified. Innodb is a fully ACID compliant engine. However it offers the ability to move away from this for performance gains, by delaying transaction hardening and not flushing to disk per transaction. As with SQL Server it can perform dirty reads if instructed and supports row level locking. What differs in the locking structure is there are no escalations, which causes a higher frequency of dead locks as compared with SQL Server. You can call out an explicit table lock, if you determine the statement is too large for multiple row locks. Similar to SQL Server it has a separate log file, that supports rolling forward and back transactions on startup. From a high level Innodb looks very similar to SQL Server, but in practice there are some limitations you need to understand if you choose this engine for a transactional system.
When talking to DBA's about transactional systems the first question that comes to mind is redundancy, and this is where Innodb falls short in comparison to SQL Server. Outside of replication which offers similar functionality as asynchronous mirroring, there are no native high availability solutions comparable to synchronous mirroring. Rather MySQL leverages the open source community for separate pieces to glue a system together. These include DRDB which is a block level copy system among servers, and heartbeat which allows for failover of servers and services. There's no true backup infrastructure, binary backups mean shutting down the service, or at least locking the tables and copying the data files. Log files can be replayed from a point in time to correspond with this pseudo (not sudo) full backup, but the responsibility is on the DBA to know when the backup occurred, there is no LSN to track backup activity. There are some third party tools and scripts that will do backups, but their not baked into the product. As for pure performance and features Innodb falls short of SQL Server on an OLTP system, but through tweaking the configurations you can get much of this back.
The memory engine as its name suggests resides within RAM. This engine is often used for applications that require a local data store for read-only or transient data. The Memory engine supports both hash and b-tree indexes.
The Merge engine is a collection of Myisam tables that allow you to union these tables that contain different subsets of data. The union is defined at the table level so it doesn't require the Union statement in your query. Prior to 5.1 MySQL did not support partitioning and this design served these needs. Logic on which table to insert data into, is performed at the application level. An area you would use this engine is for performance and high availability. An example would be logically distributing your data geographically, where users requesting information relevant to NY go to one server and WA to another, even though both servers contain the entire data set. This allows the server to keep hot tables in memory for the region they're responsible for, while still holding the data for the other regions on disk. In the event the WA server fails they can be redirected to the NY server, which can support both regional queries.
There is also a CVS engine which is used to query a csv file through a SQL interface. Since this is a plain text file it can be read by any application capable of reading a comma delimited file. There is no index support so performance is based on the speed you can perform a table (file) scan.
The Federated engine is MySQLs version of linked servers, where remote data sources can be queried locally. There's also the Archive engine used to store large volumes of data in a compressed format. The BlackHole engine doesn't store data but is used for testing and development.
MySQL is a solid product that does many tasks well, however it has yet to reach the maturity of SQL Server. You need to work with all the separate engines to determine which ones work best for your application. If you're moving from SQL to MySQL there's quite a bit more hand rolling you need to do, to get the same experience as with SQL Server. With that said there are many companies such as Facebook, Yahoo, and Google that are making it work, with great success.