Here's the question...
Considering 192 trillion records, what should my considerations be?
My main concern is speed.
Here's the table...
CREATE TABLE `ref` (
`id` INTEGER(13) AUTO_INCREMENT DEFAULT NOT NULL,
`rel_id` INTEGER(13) NOT NULL,
`p1` INTEGER(13) NOT NULL,
`p2` INTEGER(13) DEFAULT NULL,
`p3` INTEGER(13) DEFAULT NULL,
`s` INTEGER(13) NOT NULL,
`p4` INTEGER(13) DEFAULT NULL,
`p5` INTEGER(13) DEFAULT NULL,
`p6` INTEGER(13) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY (`s`),
KEY (`rel_id`),
KEY (`p3`),
KEY (`p4`)
);
Here's the queries...
SELECT id, s FROM ref WHERE red_id="$rel_id" AND p3="$p3" AND p4="$p4"
SELECT rel_id, p1, p2, p3, p4, p5, p6 FROM ref WHERE id="$id"
INSERT INTO rel (rel_id, p1, p2, p3, s, p4, p5, p6)
VALUES ("$rel_id", "$p1", "$p2", "$p3", "$s", "$p4", "$p5", "$p6")
Here's some notes...
- The SELECT's will be done much more frequently than the INSERT. However, occasionally I want to add a few hundred records at a time.
- Load-wise, there will be nothing for hours then maybe a few thousand queries all at once.
- Don't think I can normalize any more (need the p values in a combination)
- The database as a whole is very relational.
- This will be the largest table by far (next largest is about 900k)
UPDATE (08/11/2010)
Interestingly, I've been given a second option...
Instead of 192 trillion I could store 2.6*10^16 (15 zeros, meaning 26 Quadrillion)...
But in this second option I would only need to store one bigint(18) as the index in a table. That's it - just the one column. So I would just be checking for the existence of a value. Occasionally adding records, never deleting them.
So that makes me think there must be a better solution then mysql for simply storing numbers...
Given this second option, should I take it or stick with the first...
[edit] Just got news of some testing that's been done - 100 million rows with this setup returns the query in 0.0004 seconds [/edit]
-
shard it. at this size having one large instance is a suicide - think about possible backup restores, table space corruptions, adding new columns or any other 'house keeping' processes - all those are impossible to be done in reasonable time at this scale.
simple back of the envelope calculations - assuming 32bit integers for all columns except 64bit id; no indices included:
8*4B+8B = 40B per row [and this is very optimistic]
192 Trillion rows 40B each gives us almost 7 PB
maybe you can re-think the whole thing, summarize information for quick reporting, and store compressed records for given time intervals when someone needs to dig into deeper details.
questions to answer:
- what is acceptable downtime in case system crashes/get rebooted?
- what's accessible downtime when you need to recover backup or pull server out of production for planned maintenance.
- how often and where to do you want to backup?
random links - speed of inserts:
- http://highscalability.com/blog/2010/5/20/strategy-scale-writes-to-734-million-records-per-day-using-t.html
- http://stackoverflow.com/questions/654594/
Sarah : I agree- 7PB is pretty heavy. I'd love the re-think it and find a lighter solution, but I need find find the existence (or non-existence) of a particular combination of the p fields. Splitting out the tables crossed my mind - it's more sensible, but then it just means I've got the query each table in turn. Out of interest, how many tables would you recommend splitting into here?pQd : @Sarah - i would not only recommend splitting into tables but also machines. you can run your queries in parallel to gain performance [i do it on smaller scale]. what about file system corruptions or even routine checkup after server reboot? i'm not sure what do you mean by finding particular combination... maybe simple key-value store would help? table size - not more than few tens of GB; data on single server - not more then few TB. look at http://stackoverflow.com/questions/654594/ to know what headache to expect at much smaller scale ; use innodb_file_per_tableFrom pQd -
Aside from tuning your DB params like crazy (use mysqltuner to help) to try and keep your SELECTs cached as much as humanly possibly, one thing you might investigate is START TRANSACTION/CoMMIT (assuming InnoDB) when inserting your few hundred records to avoid the row by row locking overhead and bring your insert time down by a huge factor. I would also create the table as both MyISAM and InnoDB and run tests on it to see which is truly faster once you get caching tightened up - it's not always that MyISAM will be faster for reads - check out this:
http://www.mysqlperformanceblog.com/2007/01/08/innodb-vs-myisam-vs-falcon-benchmarks-part-1/
During your testing, the number of concurrent threads should also be varied up and down until you find he sweet spot for how much RAM you can afford on the server to dedicate to tuning the caches; you may find that while you can support more threads by the math, the DB itself may actually perform worse if the thread count goes too high.
Also, if you use MyISAM and/or InnoDB file-per-table you could investigate creating a different filesystem mount point for /var/lib/mysql that was tuned to a smaller block size and tuned the fs-type params - i.e. ext3/ext4/resiserfs you could use data=writeback for the journal and disable updating of access times on the filesystem for I/O speed.
Sarah : Good stuff - thanks for this :)pQd : myisam seems to be out of question due to transaction requirements.From troyengel -
pQd's estimate of 7PB seems reasonable, and that's a lot of data for a RDBMS. I'm not sure I've ever heard of someone doing 7PB with any shared disk system, let alone MySQL. Querying this volume of data with any shared disk system is going to be unusably slow. The fastest SAN hardware maxes out at 20GB/sec even when tuned for large streaming queries. If you can afford SAN hardware of this spec you can affort to use something better suited to the job than MySQL.
In fact, I'm struggling to conceive of a scenario where you could have a budget for a disk subsystem of this spec but not for a better DBMS platform. Even using 600GB disks (the largest 15K 'enterprise' drive currently on the market) you're up for something like 12,000 physical disk drives to store 7PB. SATA disks would be cheaper (and with 2TB disks you would need around 1/3 of the number), but quite a bit slower.
A SAN of this spec from a major vendor like EMC or Hitachi would run to many millions of dollars. Last time I worked with SAN equipment from a major vendor, the transfer cost of space on an IBM DS8000 was over £10k/TB, not including any capital allowance for the controllers.
You really need a shared nothing system like Teradata or Netezza for this much data. Sharding a MySQL database might work but I'd recommend a purpose built VLDB platform. A shared nothing system also lets you use much cheaper direct-attach disk on the nodes - take a look at Sun's X4550 (thumper) platform for one possibility.
You also need to think of your performance requirements.
- What's an acceptable run time for a query?
- How often will you query your dataset?
- Can the majority of the queries be resolved using an index (i.e. are they going to look at a small fraction - say: less than 1% - of the data), or do they need to do a full table scan?
- How quickly is data going to be loaded into the database?
- Do your queries need up-to-date data or could you live with a periodically refreshed reporting table?
In short, the strongest argument against MySQL is that you would be doing backflips to get decent query performance over 7PB of data, if it is possible at all. This volume of data really puts you into shared-nothing territory to make something that will query it reasonably quickly, and you will probably need a platform that was designed for shared-nothing operation from the outset. The disks alone are going to dwarf the cost of any reasonable DBMS platform.
Note: If you do split your operational and reporting databases you don't necessarily have to use the same DBMS platform for both. Getting fast inserts and sub-second reports from the same 7PB table is going to be a technical challenge at the least.
Given from your comments that you can live with some latency in reporting, you might consider separate capture and reporting systems, and you may not need to keep all 7PB of data in your operational capture system. Consider an operational platform such as Oracle (MySQL may do this with InnoDB) for data capture (again, the cost of the disks alone will dwarf the cost of the DBMS unless you have a lot of users) and a VLDB platform like Teradata, Sybase IQ, RedBrick, Netezza (note: proprietary hardware) or Greenplum for reporting
pQd : @ConcernedOfTunbridgeW - they can always go this way: http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ - much more fun than SAN, only ~120-130 4U boxes needed ... but i'm not sure if 'the business' would be happy ....ConcernedOfTunbridgeWells : Essentially a Sun Thumper on a budget and really an example of an option for a node in a shared-nothing system. I'm sure I've seen other options for this as well, but I can't think of where . The question is not so much what hardware but what database platform.ConcernedOfTunbridgeWells : However, keen observers will note that any sort of direct attach based box like this is much, much cheaper per TB than anything based on a SAN, which is at least one significant argument in favour of something designed to work on a shared-nothing platform.pQd : @ConcernedOfTunbridgeWells and you can run all those queries / maintenance and anything else in parallel on multiple [otherwise power hungry] boxes.ConcernedOfTunbridgeWells : Thhe whole point of a shared nothing system like Teradata is that it's designed to do this sort of parallel query right from the ground up.Sarah : @ConcernedOfTunbridgeWells - to answer you questions... I need about 500 queries to return in under a second, if possible. I'll be doing this only a few hundred times a day. When a query runs though, the full table does need to be scanned. Also the INSERT's are a lower priority than the SELECT's, so it doesn't have to be anywhere near instant. I can wait a few hours for "new" data to get into the database.Sarah : Also backblaze does look interesting...ConcernedOfTunbridgeWells : @Sarah - You might be better off with an operational database to capture the data and a reporting database that's updated periodically from the operational database. The operational database can also be cleared out periodically, so it doesn't need to hold all of the historic data. This will proabably help with performance on your operational side as well.iconiK : @pQd, that's a horrible box to store important data. Lose one PSU and you've lost the data. And it's all home-use stuff, nothing designed to run around the clock. Besides, that box is horribly slow.pQd : @iconiK that's a bad idea [to store all that data in single DB] in the first place - see my answer below.iconiK : @pQd agreed, but the Backblaze pod is a bad idea for any usage where the data is even remotely valuable.pQd : @iconiK we get off-topic, but i think there are reasonable use cases for backblaze pods. as long as there is backup and people are prepared to handle potential slow data-recovery process - it's all fine. -
There may be another way, rather than storing quadrillions of numbers if all you want to do is see if they are in the set. Bloom filters are a probabilistic method, by hashing in multiple ways. Also, False positives are possible, but false negatives are not. (So, it might say the number is in the set - and be wrong, but it won't say it's not there, if it really was). There's also still the issue of the vast number of items to store, but at least it could bring the working dataset size down somewhat.
Sarah : Sounds interesting, although I could live with false negatives - but not the false positives :)From Alister Bulman -
Edit: Actually if it is just the existence or not of a "record" at location X in a range of integers, you could eliminate the datastore and just use the bitmap... So, 10 or so machines with 100 TB of disk space (so you have 10 copies of your bitmap for performance and backup) and if you did 128GB of RAM per server you could fit a high resolution top level blockgroup index in memory to do a first check before hitting the disk for bit X of 26 Quadrillion.
I would go for option #2 If you take:
375 machines with 64TB (32 2TB drives) each (realistically 400 machines for failures) then just map the records onto ZVOLs that are 2TB each. Then on one or more index servers, store in a Judy array or critbit array or just plain bitmap, a mapping of if you have added a record to that 1 of 26 Quadrillion locations. The index would be between 50 and 100TB and you could even have a second level index indcating if there were any records written to a certain 64k block of addresses that would fit in less than 64 GB of RAM and would provide a fast level of initial check if a certain "neighborhood" was empty or not.
Then to read that record you would first check if there is a record to find by looking at the index. If there is, then go to machine #(X)/ZOL #(Y) on that machine/record location #(Z) within that 2TB blob based on the simple index calculation. Single record look ups would be extremely fast and you could test loading some portions of the datastore into different dbs (while you use the datastore for real work) and doing performance testing to see if they were capable of supporting your whole database - or not, just use the data store that way.
A ZOL is a ZFS thing that could be thought of a sparse file in other filesystems, so similar things would apply. Or you could just index to a certain byte number on a disk but this gets tricky if disks are different sizes if you don't cap the number of bytes used per disk at a level that works for all disks - ie 1.75TB per 2TB disk. Or create metadevices that are a fixed size, etc.
Ann E. Mouse : Hi Sarah - not sure if you are still working on this, but if you need help I could prototype my idea for you on a 100TB machine and would also be willing to host (at a major US datacenter) and manage the full production cluster of 400-500 machines as required. BTW, did you ever work at CNET in SF?From Ann E. Mouse -
Call Percona. Do not pass "Go". Do not collect $200.
From JustinShoffstall
0 comments:
Post a Comment