Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Searching From Varbinary Data Type Expand / Collapse
Author
Message
Posted Wednesday, August 19, 2009 1:31 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, August 24, 2009 11:00 PM
Points: 3, Visits: 16
Hi All,

I have a varbinary datatype in which I have stored the files into that. How to search the words from those files. Please help us.

Thanks In Advance
Jana
Post #773239
Posted Wednesday, August 19, 2009 6:53 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Wednesday, April 16, 2014 5:39 AM
Points: 2,104, Visits: 5,378
Can you explain why you are using varbinary column to store plain text?
In any case here is a small demo that shows you what to do in case that the text is stored as binary data. I recommend that you'll consider using one of the string data types (char, varchar, nchar or nvarchar) instead of varbinary data type.

create table Demo (vb varbinary(100))
go

--Inserting 3 records into the table
insert into Demo (vb)
select convert(varbinary(100),'Why whould you use a varbinary column?')
union
select convert(varbinary(100),'When you store plain text,')
union
select convert(varbinary(100),'you should use one of the strings data type')
go

--Getting all the values as binary values
select vb from Demo
--Getting all the values as string by using convert function
select convert(varchar(100), vb) from Demo
--Getting only one of the records by using convert and like operator
--Notice that this will prevent the use of index, and if the table
--is a big table, you'll might have performance problems.
select convert(varchar(100), vb) from Demo where convert(varchar(100),vb) like '%store%'
go

drop table Demo

Adi


--------------------------------------------------------------
To know how to ask questions and increase the chances of getting asnwers:
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
Post #773447
Posted Thursday, August 20, 2009 3:15 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, August 24, 2009 11:00 PM
Points: 3, Visits: 16
We are not storing the plain text. we are storing the file into binary format in varbinary column. from that only we need to search the text.

Thanks In Advance
Post #774072
Posted Thursday, August 20, 2009 9:43 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: 2 days ago @ 6:43 AM
Points: 1,182, Visits: 1,950
Create a full-text index and use CONTAINS to search the contents of a binary file. However, you'll need to have a separate column that contains the file type (e.g, doc, ppt, pdf, txt, etc.) of the binary document.

This column is specified when creating the full-text index. It is used by the full-text indexer to determine which iFilter to load in order to "read" through the binary and tokenize the terms (words) found in it.

See post 768830 for a set of links that will get you started with full-text.
http://www.sqlservercentral.com/Forums/FindPost768830.aspx



(PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.
Post #774340
Posted Thursday, August 27, 2009 3:46 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, February 14, 2013 9:11 AM
Points: 19, Visits: 19
I have a reason why I might want to do this.

My company specializes in storing and full-text indexing PDF files. I am, however, a little bit suspect of Microsoft's use of Adobe's iFilter. (It is so encapsulated that there is little to do when something goes wrong). If, however, my program were to use PDFBox to extract the text, that would be great. In that case, I could simply add the file to the database in a varbinary(max) FILESTREAM column and add the text in a varchar(max) field or something.

Another issue raises itself, however, and that is database size. I currently have a customer who has a single database of scanned PDF files that exceeds 100 gigabytes. I know that this size doesn't count against the 4 GB "Express Edition" database size limitation, but the varchar(max) field does, and for this database to be available under SS2008EE, I would somehow have to circumvent the 4 GB of space taken up by the textual contents of the PDF file. IF SQL were able to take text converted to binary and index it, then I could create a full-text index on that varbinary field.

So, from BLOB --> varbinary(max) (actual file) --> text contents --> varbinar(max) (text contents)

Is there a way to make this work?

Thanks!
Post #778784
Posted Friday, August 28, 2009 7:37 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: 2 days ago @ 6:43 AM
Points: 1,182, Visits: 1,950
Database size issues aside, as that is a business / environmental issue, I would still give the Adobe iFilter a try.

We are a SaaS-based company, hosting databases for our customers. They have the ability to upload and store PDF documents, along with Microsoft Word and PowerPoint files. Our application also provides the ability to search over the content contained in these documents.

We're currently using 32-bit SQL Server 2005 with the Adobe iFilter that is provided with the Adobe Reader 9.1.

I am, however, a little bit suspect of Microsoft's use of Adobe's iFilter. (It is so encapsulated that there is little to do when something goes wrong).

The issue is generic, regardless of the iFilter provider. I.e., Microsoft's Office iFilter has the same limitations.




(PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.
Post #779087
Posted Friday, August 28, 2009 8:10 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, February 14, 2013 9:11 AM
Points: 19, Visits: 19
I understand this to be a limitation of the iFilter setup in general. My assertion is, though, that if I move AWAY from the iFilter setup (or at least use the plain text iFilter, which has been much faster and more accurate in my testing), and use, say, PDFBox for text extraction from a PDF, then that process becomes much more granular and controlled by me. The problem with that solution is that it would require the insertion of that text output into the database for indexing. If, however, I can get that text into a binary format, yet still be able to run the text-only iFilter against it, then I could yet again circumvent the SS2008EE database size limitations.

Am I being clear with my question?

Current setup:
PDF Files (on filesystem) --> Database BLOBs (as varbinary(max) FILESTREAM) --> Adobe iFilter --> gives my a database size of roughly (Total Size of PDF files / 20)

Theorhetical setup:
PDF Files (on filesystem) --> extract text using PDFBox --> Now have related PDF file and text contents --> Database BLOB (1 for each) as varbinary(max) FILESTREAM --> Text iFilter (on TEXT varbinary(max), NOT PDF varbinary(max)) --> results in unknown database size

Basically, I'd like to get around using Adobe's iFilter (I don't mind Microsoft's text iFilter) and still maximize the number of records I can store in a database while still coming in under the 4 GB size limit.

Thanks!
Post #779125
Posted Friday, August 28, 2009 8:37 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: 2 days ago @ 6:43 AM
Points: 1,182, Visits: 1,950
You should be able to full-text index varbinary(max) data that is pure "text". Just specify a value of "txt" for the companion file type column.

I don't know if the documents you are trying to index are all of a single language/culture (e.g., "en-US") or multi-language. When you extract out the "text" from a document and then index it as plain text, you will lose any language-specific tokenization as the language "meta" tag will be lost. In other words, you cannot have a table containing rows of plain text in mixed languages indexed under the language-specific tokenization rules for the language of the row. This is a very bad SQL Server limitation. BTW, in Oracle, you can as you specify a column that contains the 3-character ISO language code. Granted, you can have separate columns for each language, but that is an ugly mess.

As for space savings, why not also store the extracted text column of varbinary(max) as FILESTREAM?




(PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.
Post #779153
Posted Friday, August 28, 2009 8:42 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, February 14, 2013 9:11 AM
Points: 19, Visits: 19
@JohnG,

That's EXACTLY what I was talking about doing, but wasn't sure what the results would be. The single language limitation isn't really a problem for us. All we do is English (at least for now).

I'll try it out, and THANK YOU!
Post #779160
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse