SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Searching From Varbinary Data Type


Searching From Varbinary Data Type

Author
Message
Janarthanan-319869
Janarthanan-319869
Valued Member
Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)

Group: General Forum Members
Points: 71 Visits: 19
Hi All,

I have a varbinary datatype in which I have stored the files into that. How to search the words from those files. Please help us.

Thanks In Advance
Jana
Adi Cohn
Adi Cohn
SSChampion
SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)

Group: General Forum Members
Points: 14734 Visits: 6618
Can you explain why you are using varbinary column to store plain text?
In any case here is a small demo that shows you what to do in case that the text is stored as binary data. I recommend that you'll consider using one of the string data types (char, varchar, nchar or nvarchar) instead of varbinary data type.


create table Demo (vb varbinary(100))
go

--Inserting 3 records into the table
insert into Demo (vb)
select convert(varbinary(100),'Why whould you use a varbinary column?')
union
select convert(varbinary(100),'When you store plain text,')
union
select convert(varbinary(100),'you should use one of the strings data type')
go

--Getting all the values as binary values
select vb from Demo
--Getting all the values as string by using convert function
select convert(varchar(100), vb) from Demo
--Getting only one of the records by using convert and like operator
--Notice that this will prevent the use of index, and if the table
--is a big table, you'll might have performance problems.
select convert(varchar(100), vb) from Demo where convert(varchar(100),vb) like '%store%'
go

drop table Demo



Adi

--------------------------------------------------------------
To know how to ask questions and increase the chances of getting asnwers:
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
Janarthanan-319869
Janarthanan-319869
Valued Member
Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)Valued Member (71 reputation)

Group: General Forum Members
Points: 71 Visits: 19
We are not storing the plain text. we are storing the file into binary format in varbinary column. from that only we need to search the text.

Thanks In Advance
Mauve
Mauve
SSCertifiable
SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)

Group: General Forum Members
Points: 5364 Visits: 2071
Create a full-text index and use CONTAINS to search the contents of a binary file. However, you'll need to have a separate column that contains the file type (e.g, doc, ppt, pdf, txt, etc.) of the binary document.

This column is specified when creating the full-text index. It is used by the full-text indexer to determine which iFilter to load in order to "read" through the binary and tokenize the terms (words) found in it.

See post 768830 for a set of links that will get you started with full-text.
http://www.sqlservercentral.com/Forums/FindPost768830.aspx


(PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.
Michael McEvoy-477519
Michael McEvoy-477519
SSC-Enthusiastic
SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)

Group: General Forum Members
Points: 125 Visits: 19
I have a reason why I might want to do this.

My company specializes in storing and full-text indexing PDF files. I am, however, a little bit suspect of Microsoft's use of Adobe's iFilter. (It is so encapsulated that there is little to do when something goes wrong). If, however, my program were to use PDFBox to extract the text, that would be great. In that case, I could simply add the file to the database in a varbinary(max) FILESTREAM column and add the text in a varchar(max) field or something.

Another issue raises itself, however, and that is database size. I currently have a customer who has a single database of scanned PDF files that exceeds 100 gigabytes. I know that this size doesn't count against the 4 GB "Express Edition" database size limitation, but the varchar(max) field does, and for this database to be available under SS2008EE, I would somehow have to circumvent the 4 GB of space taken up by the textual contents of the PDF file. IF SQL were able to take text converted to binary and index it, then I could create a full-text index on that varbinary field.

So, from BLOB --> varbinary(max) (actual file) --> text contents --> varbinar(max) (text contents)

Is there a way to make this work?

Thanks!
Mauve
Mauve
SSCertifiable
SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)

Group: General Forum Members
Points: 5364 Visits: 2071
Database size issues aside, as that is a business / environmental issue, I would still give the Adobe iFilter a try.

We are a SaaS-based company, hosting databases for our customers. They have the ability to upload and store PDF documents, along with Microsoft Word and PowerPoint files. Our application also provides the ability to search over the content contained in these documents.

We're currently using 32-bit SQL Server 2005 with the Adobe iFilter that is provided with the Adobe Reader 9.1.

I am, however, a little bit suspect of Microsoft's use of Adobe's iFilter. (It is so encapsulated that there is little to do when something goes wrong).

The issue is generic, regardless of the iFilter provider. I.e., Microsoft's Office iFilter has the same limitations.


(PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.
Michael McEvoy-477519
Michael McEvoy-477519
SSC-Enthusiastic
SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)

Group: General Forum Members
Points: 125 Visits: 19
I understand this to be a limitation of the iFilter setup in general. My assertion is, though, that if I move AWAY from the iFilter setup (or at least use the plain text iFilter, which has been much faster and more accurate in my testing), and use, say, PDFBox for text extraction from a PDF, then that process becomes much more granular and controlled by me. The problem with that solution is that it would require the insertion of that text output into the database for indexing. If, however, I can get that text into a binary format, yet still be able to run the text-only iFilter against it, then I could yet again circumvent the SS2008EE database size limitations.

Am I being clear with my question?

Current setup:
PDF Files (on filesystem) --> Database BLOBs (as varbinary(max) FILESTREAM) --> Adobe iFilter --> gives my a database size of roughly (Total Size of PDF files / 20)

Theorhetical setup:
PDF Files (on filesystem) --> extract text using PDFBox --> Now have related PDF file and text contents --> Database BLOB (1 for each) as varbinary(max) FILESTREAM --> Text iFilter (on TEXT varbinary(max), NOT PDF varbinary(max)) --> results in unknown database size

Basically, I'd like to get around using Adobe's iFilter (I don't mind Microsoft's text iFilter) and still maximize the number of records I can store in a database while still coming in under the 4 GB size limit.

Thanks!
Mauve
Mauve
SSCertifiable
SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)SSCertifiable (5.4K reputation)

Group: General Forum Members
Points: 5364 Visits: 2071
You should be able to full-text index varbinary(max) data that is pure "text". Just specify a value of "txt" for the companion file type column.

I don't know if the documents you are trying to index are all of a single language/culture (e.g., "en-US") or multi-language. When you extract out the "text" from a document and then index it as plain text, you will lose any language-specific tokenization as the language "meta" tag will be lost. In other words, you cannot have a table containing rows of plain text in mixed languages indexed under the language-specific tokenization rules for the language of the row. This is a very bad SQL Server limitation. BTW, in Oracle, you can as you specify a column that contains the 3-character ISO language code. Granted, you can have separate columns for each language, but that is an ugly mess.

As for space savings, why not also store the extracted text column of varbinary(max) as FILESTREAM?


(PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.
Michael McEvoy-477519
Michael McEvoy-477519
SSC-Enthusiastic
SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)SSC-Enthusiastic (125 reputation)

Group: General Forum Members
Points: 125 Visits: 19
@JohnG,

That's EXACTLY what I was talking about doing, but wasn't sure what the results would be. The single language limitation isn't really a problem for us. All we do is English (at least for now).

I'll try it out, and THANK YOU!
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum







































































































































































SQLServerCentral


Search