Getting year part out of a string that represents a date

  • Hi,

    I have a requirement to fetch the year from an imported .csv  file that can hold any kind of date , in any format as delivered to use by municipalities that manage to get their own ideas of the template they need to use.

    Basically the date can have many formats , like  YYYY-MM-DD , DD-MM-YYYY ,  DD/MM/YYYY  etc.

    Chances are minimal we can convert all the different permutations using  a cast / convert, so I was looking for a UDF to do the job.

    Would appreciate any thoughts on this challenge..

     

     

     

  • My advice - sanitize the data before import. Use Excel to get the dates into one standard format (manual approach) if it isn't too time consuming. If it is too time consuming, you'll likely need to do it during your import process (SSIS) in a script step. Possibly better approaches, but my approach would be to first find and replace all / with -. then repeat with \ to -. At this point you should have 3 possible date formats. If you have more, repeat the above for the other characters. Next, check the location of the first -. If it is at character 5, then grab the first 4 characters as that is a year. If it is not, then check the total number of characters. If it is 8 or less, you are SOL as the format could be MM-DD-YY or DD-MM-YY or YY-MM-DD or possibly single digit for month/year or some other format and doing a best guess is going to be challenging to code out without more logic you can do. If the length is greater than 8, you can now strip off the first 3 characters and check if character 5 is a -. If it is, you've got a weird format of either MM-YYYY-DD or DD-YYYY-MM, but hey, you got the year. If it isn't, then you can strip off the next 3 characters. Now if you have 4 characters left, you got the year. If you have more, then you have a broken date or someone is planning for things after the year 10000.

    SOME of the above may be overkill mind you. You could probably simplify and only handle strings that are 8 to 10 characters long and discard everything else. And most, if not all, date formats have the year at the start or end; I doubt anyone is putting it in the middle. IF both of these assumptions are safe, then you could grab the first 4 characters and TRY to cast them to an INT. if it succeeds, it's a year. if it fails, repeat but with the last 4 characters. If it fails again, then the data is garbage and ignore/flag that row and move onto the next one. I say 8 to 10 characters long as you could have 1-1-2025 for January 1st OR it could be 01-01-2025 or some clown could do 1-01-2025.

    Now, if this is already in the database and you need to deal with it, that's a whole different beast, but I know some smart people on this forum will be able to help. The logic would be the same, but the code would be different.

    My recommendation though is to fix it at the source (CSV) first and if that is not possible, then during import (SSIS or whatever tool you are using), and only fix it at the destination if it is the ONLY option. SQL should not contain garbage data. If you only care about the year, then make the column an INT and ONLY insert the INT value. Only store useful data in the database. Storing dates in random formats, ESPECIALLY in a VARCHAR column, is going to bite you at some point in the future when someone makes a typo in the CSV and a comma gets missed and suddenly you have usernames (for example) in your VARCHAR date column...

    Sanitize then import is what I encourage. IF you ever need to go back to source data, you have the original (CSV). Just store the CSV name in the table so if someone questions some data in there, you can point them back to the source.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I wrote the following (as I realized the data is strictly about pupils up till a certain age):

    [note: this will fail when someone will send in stuff like  '20-08-25', but this is a high-pressure issue and I hope for the best;

    data itself is discarded after a whole lot of subsequent processing]

    create   function [dbo].[year_from_datum] (@datum_string varchar(50))RETURNS VARCHAR(4)

    AS

    begin

    declare @temp_str varchar(50)

    set @temp_str =

    case

    when

    charindex('190',@datum_string,1) > 0 then  substring(@datum_string,charindex('190',@datum_string),4)

    when

    charindex('199',@datum_string,1) > 0 then  substring(@datum_string,charindex('199',@datum_string),4)

    when

    charindex('200',@datum_string,1) > 0 then  substring(@datum_string,charindex('200',@datum_string),4)

    when

    charindex('201',@datum_string,1) > 0 then  substring(@datum_string,charindex('201',@datum_string),4)

    when

    charindex('202',@datum_string,1) > 0 then  substring(@datum_string,charindex('202',@datum_string),4)

    else null end

    return @temp_str

    end

    • This reply was modified 3 weeks, 5 days ago by blom0344.
    • This reply was modified 3 weeks, 4 days ago by blom0344.
  • Finding the year in a date is something relatively easy for a human and often difficult for a machine.  However many date formats you think you may have, the data will present you with another.

    I would aim to normalise dates before loading into the database. This is because to do the job well is best done in a program language domain rather than a dml domain. You may find a class library that can hunt for dates in a field and normalise them, or it may be worth building an AI transformation.

    For the dates already in the DB, I would go for something that can exploit whatever you use for the initial transformation, possibly as a mini ETL process.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • If you want to convert from a csv, python can help:

    def detect_date_format(date_str):
    if re.match(r'^\d{4}-\d{2}-\d{2}$', date_str):
    return 'YYYY-MM-DD'
    elif re.match(r'^\d{2}-\d{2}-\d{4}$', date_str):
    return 'DD-MM-YYYY'
    else:
    return 'Unknown'

    # Check format distribution
    df['format_detected'] = df['date_column'].apply(detect_date_format)
    print(df['format_detected'].value_counts())

    However, what is this date?

    2-3-2025

    Is it March 2 (EU/UK) or Feb 3 (US)?

    Best thing is to press people to use one format, though yyyy-mm-dd is the best (with or without hyphens/slashes) for avoiding ambiguity. That being said, you might guess for those dates that have a > 12 value in one spot to guess mm-dd or dd-mm. Other than that, I might error out other rows  that are ambiguous and try to have a human sort them

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply