SQL advice needed.

  • I'm unsure where to post, so sorry if in the incorrect section.

    I need a tool that will look up all lines from file1 in file2 (something like grep -f file1 file2), but I need it to be really fast for big files. I expect file1 to change for every lookup and to have even few million lines, file2 will stay the same, it might be updated from time to time and may contain over few billion lines (something around 150GB file size).

    file1 format: item

    file2 format: item:value

    I heard program with hashmap would be the fastest option, but apparently I'd need a lot of ram for it. So I'd like it to be done with NoSQL database like Cassandra Apache. I'd need a table with 2 rows: item and value and some program using API that would get all lines (file1) and look them all up in the table and if item match I'd like it to return item:value.

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Do you need this to be "real time" data or can you load the data into some system (Hadoop, SQL Server, Oracle, etc) on a nightly basis?

    I am not the expert on this as my experience has been all around SQL Server, but if you don't need real time data and can live with a scheduled data load, you could use one of the database tools to pull the data into the database in a table and then search or spit out results as you see fit.

    Using SQL Server, the tool I would recommend would be SSIS.  Use that to pull all of the data from the flat file into the database on a schedule and have it spit out the matches as it completes.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply