Common Data Challenges

Question

Common Data Challenges

Steve Jones - SSC Editor

SSC Guru

Points: 734418
More actions
March 6, 2019 at 8:08 pm

#416748

Comments posted to this topic are about the item Common Data Challenges

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

DinoRS SSCrazy Points: 2683 More actions · Answer 1

I'm not in the business as long as you are Steve - roughly 1/3rd of your time - and I only very recently started to care about data outside Performance and DR but from supporting Developers previously, I think some of the challenges remain the same, back then, today and potentially tomorrow.

That would be first and foremost data management. Remember all those functions, procedures, expressions we've had to form back then to get Excel exported dates back to something useful? We still do and we potentially still will be doing in the future. Another problem I would call is data sourcing: Do I really need 20 exported CSV Files to get all the data I want to process or have requirements changed that much so I could potentially just get 2 or 3 large CSV Files with all necessary data? After all we are processing more data today than 10 years ago.

And with processing more data I think we will see more transformations towards different approaches of data and data processing. Is the data a snapshot? -> Most likely your plain old ETL Process for the next decades to come. Is the data a continuous stream? -> As we certainly want to remember the interesting things from our data streams, we'll keep those but we still want to process the stream continuosly, for things like that we will see much more use of things like Hadoop and Machine Learning so yeah, we will and do see a lot of new challenges waiting for us. Might be a little bit different as we might not be looking that much at index optimizations anymore but rather wether our ML algorithms do enable our business to make the decisions reliably to our advantage or not?

David.Poole SSC Guru Points: 75896 More actions · Answer 2

I started my career in 1988. I've found that a lot of the problems with data are as a result of training gaps. Each generation suffers the same education gaps so are doomed to keep making the same mistakes.

The problems I see are not isolated to IT. I do not believe that mankind has adapted to the internet, the cultural implications, behaviours, ways of working etc and that exacerbates the recurrence of old mistakes. We have the capability to drive at a million miles an hour but the chassis is older than you'd believe and no one has upgraded the brakes!

LinkedIn Profile

Michael Lysons SSCertifiable Points: 6490 More actions · Answer 3

Application and database design that allows users to enter bad data. The users are (generally) not to blame, they will take the path of least resistance, and we end up with bad data that needs addressing.

I work in the NHS, and often some requirement will arise that the hospital's Patient Administration System (PAS) can't properly handle, so a workaround is required. For example, I work at hospital A, and hospital B decides to use our spare capacity to do some of their clinical work - our PAS has to record this activity, usually by storing some identifying data in a data item not designed for that purpose. Which ultimately means the data warehouse receives data for a different hospital, which then needs to be stripped out of all operational datasets etc. But, we have to ensure that hospital B can see that data (this can mean various things from direct to indirect access) so they can receive payment for it.

These are regular challenges and at a high level they haven't changed much over the years, although technology changes have occurred, e.g. HL7 interfacing (and interfacing in general) is a much bigger part of the work than it was 10 years ago.

Steve Jones - SSC Editor SSC Guru Points: 734418 More actions · Answer 4

David.Poole - Thursday, March 7, 2019 4:11 AM
I started my career in 1988. I've found that a lot of the problems with data are as a result of training gaps. Each generation suffers the same education gaps so are doomed to keep making the same mistakes.
The problems I see are not isolated to IT. I do not believe that mankind has adapted to the internet, the cultural implications, behaviours, ways of working etc and that exacerbates the recurrence of old mistakes. We have the capability to drive at a million miles an hour but the chassis is older than you'd believe and no one has upgraded the brakes!

Agree. Lots of cultural issues, and lots of us trying to adapt to the newer way of being connected to data.

Steve Jones - SSC Editor SSC Guru Points: 734418 More actions · Answer 5

Michael Lysons - Thursday, March 7, 2019 4:37 AM
Application and database design that allows users to enter bad data. The users are (generally) not to blame, they will take the path of least resistance, and we end up with bad data that needs addressing.

While I agree, I also know that so many business processes aren't as tightly defined as we would like. Usually because we didn't account for the chaos of the world when we built the system. As a result, I've gone more towards having optional fields and update capabilities that allow users to clean up data and move it around later. The "every field is x" or we need all this data was a trend in the 80s/90s and it didn't work out well. Too many problems from systems trying to force users to change their work rather than systems adapting to users.

I'd argue that we need better app design that creates flexibility to meet the problems of the world.

Of course, this means still constant data challenges for us to deal with.

ZZartin SSC-Dedicated Points: 30894 More actions · Answer 6

One super annoying trend I'm seeing more of is people wanting to use their real time and or EAV interfaces for bulk data transfer. I mean I'm sorry but when you 90% of your fields are defined in EAV or your real time interface is some bloated JSON/XML and you want to transfers hundreds of thousands or more records every day through it in a narrow window you'renot going to have a good time.

Jeff Moden SSC Guru Points: 1003851 More actions · Answer 7

Nothing has changed with the data itself when it comes to problems.

What has changed is the frequency of those problems and the desperate hacks people try to hammer together either themselves or buy implementing other peoples hacks in the form of 3rd party software, shrink wrapped or not.

The frequency has increased simply because data is more prevalent than it ever was simply due to the growth of the use of computers and the notions people have about what data is important.

The hacks have increased because of the waves of people that never used data before that have entered the field because it's both a prevalent field and a lucrative field. Ironically, it's like a bad drug habit. The more people do it, the more they need to do it because a lot of the people that have problems importing, analyzing, using, and storing the data are also the same ones generating the data for others.

If you don't think so, just look at the questions/problems posed on these and other forums, database related or not.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Rod at work SSC-Dedicated Points: 33897 More actions · Answer 8

I'm involved in a re-write of an old MS Access application into a WPF app (at least, there may be more). I've discussed this in these forums before. The Access app is actually a front-end to a SQL Server database. (Long before I got here, it used to all be within Access, but someone in the past took the time to migrate the data to a SQL Server database, but they left the Access front-end.) At the moment, though, I'm on the side lines as two BA's are busy analyzing the data, with a view towards replacing it with another database. The article you referenced Steve, it points to another article titled, How an Agile Approach Can Help Solve Your Data Problems. The article points out a waterfall approach to database/data design vs. an agile approach. If that article is correct, we are very much following a waterfall approach. Although, they've been doing waterfall for longer than I've been working.

Anyway, I don't want to defend the old database. I don't feel anything for or against it. I just wonder why they're even bothering to rearchitect the database?

Kindest Regards, Rod Connect with me on LinkedIn.

David.Poole SSC Guru Points: 75896 More actions · Answer 9

ZZartin - Thursday, March 7, 2019 7:42 AM
One super annoying trend I'm seeing more of is people wanting to use their real time and or EAV interfaces for bulk data transfer. I mean I'm sorry but when you 90% of your fields are defined in EAV or your real time interface is some bloated JSON/XML and you want to transfers hundreds of thousands or more records every day through it you're in a narrow window not going to have a good time.

I feel your pain. I haven't done anything with JSON in SQL Server but I know that JSON Path, unlike XPATH does not have a getParent() equivalent function which means that bulk ingestion of JSON containing arrays produces two unconnectable recordsets.
The approach we've had to address it is import a file containing many JSON documents of the same document type and loop over each document triggering multiple extractions. In effect reinventing RBAR.
I am currently experimenting with ways of bulk ingesting the non-array part of the document, which is very fast, and then submitting those documents containing arrays to the RBAR process.

I've had some luck with a library called YAJL but no son of mine will every be called JASON.

LinkedIn Profile