May 19, 2011 at 3:28 am
Hi,
Few things I would like to know (assuming table is not partitioned):
1. Can I access extent or page information related to any particular table?
If so, can I access rows in a page depending on the row offset information given in a
page.
Please suggest a way to do so.
Use Case:
I want to access data from a particular table parallely using multiple threads and JDBC.
2. In case of loading data into a table, can I do so parallely?
3. Can I logically split a table into multiple temporary tables(containing huge data) and logically combine them again?
Any pointers would be highly appreciated.
Thanks,
Abhijeet G.
May 19, 2011 at 9:46 am
2. If you mean multiple threads, yes. It's the same as multiple people inserting at one time. If it's a large load, however, the parallelism may not help you. You might be better off with a large insert running through something like a BULK INSERT
3. This is done with views.
1. First, you don't ever go access a particular extent or page for data. You query the table and SQL Server manages to pull the data together. Knowing where things are stored in the file doesn't help you build a more efficient application. You can query to find data on pages, but this isn't the way to do it.
What parallel access are you doing? If you are thinking that you can more efficiently access data in JDBC from multiple connections/threads and combine it better, you might be right, but it's very unlikely.
May 19, 2011 at 9:55 am
2. Actually, I am running and retrieving data from a SQL server present on Windows m/c to a linux machine (runs JDBC program). So cannot use BULK_INSERT as data is in a file on linux machine.
Use Case: using multiple threads I will first load data into temporary tables. After doing this I will logically combine these tables (no physical copy), that would definitely increase my performance.
1. Actually i was thinking , if a table is split over multiple pages, I can access rows from multiple pages parallely... is there a way possible
May 19, 2011 at 10:00 am
How much data are you moving?
Loading data into temp tables is one thing, but combining them in a view isn't the way to go. A copy of the data inside SQL Server will be fast, and it is much more efficient for subsequent access, even if it doesn't feel that way.
From a programming perspective, you can access data on a Linux machine with something like SSIS or with a remote path from the server. The key is the pathing is relative to the SQL service account on the server, not from any client.
May 19, 2011 at 10:14 am
1. The data I would be moving is huge .... take for example 1TB or maybe more.
2. Thanks for pointing out the remote paths concept ... will surely research this area.
3. I would still like to know more about pages and extents ... if I can access rows in different pages related to a table simultaneously. Can you also point out some links ... from where I can get information on accessing data from pages.
4. Also .... talking about loading data using multiple temp tables.... the use case is the table would be already present in SQL server .... I just need to know a way to logically split and combine it .... if that's possible (I know about partitions .... if you can suggest any other way..)
May 19, 2011 at 12:19 pm
Loading a TB really depends on your I/O and network. The best way to load it would depend on whether it's one time or 1TB regularly. However you loading it from multiple threads does not necessarily improve things.
SQL Server is optimized to deal with sets of data. You never, ever, access a page looking for data. You have no idea where that page is, whether it's in memory or not, what might be on that page (which can change depending on inserts/updates/deletes). It is in no way like a key-value store or a record based file.
You have no need to access different pages simultaneously. That's what the server engine is for. It's got way more time, testing, and effort put into finding ways to efficiently pull data from a table that you will have by yourself. Don't try to figure out how to access pages.
Why don't you describe what you are trying to do here. I think you might not have a strong understanding of what the database does and how SQL Server does it. If you can state more from a problem standpoint, what business or data you are working with, how and why, we can suggest ideas.
In terms of logically combining things, you're attacking this with an idea in mind instead of an end goal. If you are trying to improve performance, partitions help, but only if you have data that can be separated by some logic into different locations, with different access patterns/frequency, and you have separate physical IO paths for that data.
Views (normal, distributed or partitioned), provide methods for combining data in queries.
May 19, 2011 at 12:46 pm
abygaikwad17 (5/19/2011)
3. I would still like to know more about pages and extents ... if I can access rows in different pages related to a table simultaneously.
Let SQL take care of this. It's not a flat-file system where you have to deal with the physical considerations of where the data is stored. With any relational database system, you access the data independent of how or where it is stored, that's one of the key features of a relational database.
While it is possible to access a page, what you will get back is the raw data on the page, in binary format. It'll be up to you to turn something like the following back into a row
0000000000000000: 30001800 51d70e00 93d60000 9069f600 †0...Q×..?Ö...iö.
0000000000000010: 9c9b0000 46000000 05000001 00fc0473 †??..F........ü.s
0000000000000020: 65652074 68652073 6f756c73 0d0a546f †ee the souls..To
0000000000000030: 206d6973 65727920 646f6f6d 27642c20 † misery doom'd,
0000000000000040: 77686f20 696e7465 6c6c6563 7475616c †who intellectual
0000000000000050: 20676f6f 640d0a48 61766520 6c6f7374 † good..Have lost
0000000000000060: 2e222020 416e6420 7768656e 20686973 †." And when his
Seriously, lose this idea, it's silly.
If you want to run multiple queries from your frontend, fine, just chose a column and filter on that. SQL will automatically parallel queries if it feels they are expensive enough to benefit.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
May 19, 2011 at 12:49 pm
abygaikwad17 (5/19/2011)
2. Actually, I am running and retrieving data from a SQL server present on Windows m/c to a linux machine (runs JDBC program). So cannot use BULK_INSERT as data is in a file on linux machine.Use Case: using multiple threads I will first load data into temporary tables. After doing this I will logically combine these tables (no physical copy), that would definitely increase my performance.
1. Actually i was thinking , if a table is split over multiple pages, I can access rows from multiple pages parallely... is there a way possible
What kind of a load are you doing on the linux machine? Is there a linux database there or are you creating text files of some kind?
If it is a database on the other machine and you can create a connection to that database from a Windows machine (and I think you should be able to do that), consider using an SSIS package to transfer the data.
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
May 19, 2011 at 1:31 pm
abygaikwad17 (5/19/2011)
Hi,Few things I would like to know (assuming table is not partitioned):
1. Can I access extent or page information related to any particular table?
If so, can I access rows in a page depending on the row offset information given in a
page.
Please suggest a way to do so.
Use Case:
I want to access data from a particular table parallely using multiple threads and JDBC.
Sometimes I think you need to let someone figure out on their own that the idea is just not in their favor. The command you're looking for is DBCC PAGE. As Gail pointed out, this is a binary storage of the data. You cannot directly write to it, but you could in theory read from it, if you understand how a page works. This includes: Reading the page in 8k structures into a string reader, breaking apart the header information (which is slightly different in each version and is internal, don't expect that to change, it's not for our use), and then learning how to decipher the NULL bitmap, the varchar string length structures, and writing code that will strip down said page into the rows in your application.
Please note, as others have mentioned, SQL isn't built for this style of access. You're going to attempt to reverse engineer a product with 15 years of innovations built into it to use this data quickly and intelligently. It's much easier to write: SELECT * FROM tbl WHERE row = 'x'.
I understand that you believe you can do it faster. You might even be able to under incredibly specific circumstances (I can't see how, but who knows). But the server itself will *still* be your bottleneck, because you'll have to interface with it first before it returns to you the pages you're looking for.
And you will never, ever, be able to directly write back to a page. Also, the data may have moved since you last saw it. Look up 'page splits'. The engine takes care of the consistency, you're not supposed to do it outside of it.
If you're really looking for that kind of access, you want to avoid just about any database engine and look towards flatfile structures, or your own type of storage format. What you're looking to do is pretty much what you just bought SQL Server for.
2. In case of loading data into a table, can I do so parallely?
Errr, depends. Need more information. Multiple connections can connect to the same table and write, but it will depend on existing locking and concurrency. A single bulk loader if memory serves doesn't write parallely, but you don't necssarily want it to. For parallel writing, you need more then one head on the disk drive. It always comes back to disk drives. When you're pushing a lot of data, you're going to flush to drive a lot. Parallel doesn't gain you anything there if you're forcing the head to move more then it needed to if you straight wrote it. It'll actually hurt you.
3. Can I logically split a table into multiple temporary tables(containing huge data) and logically combine them again?
Why? For the parallelism bit again? You're still trying to fight the engine instead of using it the way it was designed. Having 20 blades each grab a piece of the pie and process is a great idea. Just use ranges of IDs, or some other method of range identification. Then, when it finishes its compilation, have it write back to the same table as it finishes. If they come in simultaneously they'll just wait for each other, the same way the parallelled version would have to, but then it would be just shuffling together.
In the end, the data has to return to the drive, and the drive has 1 head. You're going to end up with pounding your head on the desk if you forget that minor point after you've worked this hard on everything else.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
May 20, 2011 at 9:59 am
Actually, if you want to write directly to pages, you could. You'd have to get a HEX editor, and write directly to the disk sectors the pages are stored on, or use a low-level language like Assembler or maybe C to write directly to RAM addresses. The moment you do so, you are no longer using SQL Server, as you'll end up with the data and log files out of sync and SQL Server will start labeling the database as corrupt or at best suspect.
However, if you're looking at reading/writing pages of data directly, you should move away from a commercial RDBMS altogether. Get the source code of one or more open source database engines, customize it to your own needs, and use that. If you need to read/write data pages directly, you probably also need to bypass the transaction log, and so one of the NoSQL engines might be better than MySQL or PostreSQL.
Either way, you're bypassing SQL Server.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
May 20, 2011 at 3:07 pm
abygaikwad17 (5/19/2011)
2. Actually, I am running and retrieving data from a SQL server present on Windows m/c to a linux machine (runs JDBC program). So cannot use BULK_INSERT as data is in a file on linux machine.
An alternative to loading the file by reaching from Windows onto Linux file system using BULK INSERT or SSIS is to load directly from Linux file system into SQL Server using FreeBCP.
FreeBCP is part of the FreeTDS Project (http://www.freetds.org) and can bulk load your data file from the Linux file system directly into a SQL Server table. FreeBCP uses the SQL Server Bulk Load API so will offer performance comparable to other techniques mentioned although may be a little lacking in terms of error handling and support for advanced data types.
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
Viewing 11 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply