October 19, 2016 at 9:04 pm
Comments posted to this topic are about the item Where Do You Run Your R Scripts?
October 19, 2016 at 11:38 pm
Here is my 2 cents from working within the analytics team who uses R and Python.
I view most of what's done with R and Python from a data science standpoint as ad-hoc analysis. This is when we are answering very specific questions of the data that in most cases, require very specific datasets not supported by the enterprise or some form of standard and consistent level. In these cases, why does it need to live on the backend? Why not closer to the client on the front-end?
However, what happens when that ad-hoc analysis needs to become part of the enterprise? What happens when it needs to become part of that standard where it has to be automated and supported by someone? That's when questions like, "where does this sit?" come into play. In most cases, regardless of the language, it's going to sit as close to the data as possible because that's where it belongs. It's just a matter of what approach makes the most sense for the business. That's where the justification or use case comes into play with burning the resources to support it or not.
I think the most common approach most database guys try to take when it comes to turning ad-hoc into something that's part of the enterprise/standardize model is, "How can we convert this into something we can support?" That normally translates into, how can we convert R into TSQL, MDX, C++, whatever. Unfortunately, that adds additional overhead to the problem and potentially huge bottlenecks much like it sometimes does to the application developer who has to have all his logic converted to a stored procedure from some SQL Developer. In the future, I bet you will see the same problems you may have with the developer, but with the data scientist. :hehe:
For me, I think the best approach is to keep it native to the analyst as possible when something goes from ad-hoc to enterprise/standard. If there is a way for the existing systems to integrate R or Python and do it well, then that makes the most sense. However, that greatly depends again, whether or not this is ad-hoc or needs to be part of the the enterprise/standard model that has to be consistent, automated and supported. Other than that, why worry about it that much. Deliver data, let them go to town until they create something awesome that you can reuse. Then integrate.
(Side Note: It would be awesome to have an editorial that covers where the roles and responsibilities lay for supporting data analyst that do ad-hoc analysis that are not supported by the existing model. For example, does it make sense to have multiple data warehouses that allow analyst to query directly? If so, where do the roles and responsibilities lay with more BI specific focused teams?)
October 20, 2016 at 7:32 am
I view most of what's done with R and Python from a data science standpoint as ad-hoc analysis.
Mostly true. There are folks that use these tools as part of their normal production work flow.
I think the most common approach most database guys try to take when it comes to turning ad-hoc into something that's part of the enterprise/standardize model is, "How can we convert this into something we can support?"
Also how can you speed up the processing/limit the resources? Sometimes this as easy as which packages you use with R or Python. There's also some wicked optimization tools like Numbra that exists, but optimized C++ is 2 orders of magnitude faster than either vanilla R or Python.
If you can afford the software licenses, having additional servers for testing ad-hoc and preliminary production code, might be worth the effort.
October 20, 2016 at 7:50 am
Steve Jones - SSC Editor (10/19/2016)
...I would suggest that the R client is the way to experiment, preferably on a copy of data that allows someone to build scripts and determine if there is insight to be gained from a particular set of data. Build a Proof of Concept (POC), and only deploy it to a SQL Server if you find it provides value....
Regardless of technology employed, production is never the place to experiment.
Gaz
-- Stop your grinnin' and drop your linen...they're everywhere!!!
October 20, 2016 at 8:01 am
Reading the replies above, and by way of analogy, I was reminded of a deposition I was given about 40 years ago. A lady had fallen in a health spa I worked on, hurt herself terribly, and (after far too many) questions the deposing lawyer asked if there were any other trades on the job?
"Yes", I replied. "There were electricians, glaziers, plumbers and so forth."
"What sort of work did they do?" he replied.
"Well", I replied (thoroughly immersed in the schadenfreude 🙂 ), "The glaziers did glass work, the plumbers, well, they did the plumbing and..." You can imagine the other lawyers sitter there snickering under their collective breaths.
What I take away from the prior comments is that the plumbers should do plumbing, and so forth. That is, if the analyst needs raw data give it to them or if many folks need pre-digested data put it "out there". Make it as fast and approachable as possible and never forget it is about the client, not how smart we may be.
We should serve and be grateful others are willing to pay us to do the sort of work we do so enjoy.
October 20, 2016 at 10:03 am
I put together a list of some R links here:
https://sqlserver.miraheze.org/wiki/R
412-977-3526 call/text
October 20, 2016 at 10:50 am
Regardless of technology employed, production is never the place to experiment.
I agree totally.
October 20, 2016 at 12:48 pm
"Production" is a bit more nuanced than it used to be.
Would I endorse data scientists conducting experiments on a DB server providing critical services? NO.
Would I provide them with near real-time data and historical data from production systems? A qualified yes.
The qualification being sufficient security clearance and processes to make sure that no laws are broken and no-one steals anything.
A lot of what data scientists do fails anf that is good because they are testing hypotheses and scenarios. A success is great, but a failure is almost as good. The worse result is a 50:50 because you learn nothing.
One of the really exciting innovations I've seen is one that generates and tests thousands of scenarios and can run many in parallel. This finds strange correlations where no human would think to look.
On Big retail websites a test using data science datasets only needs to run for a few minutes to determine it's impact. At the extreme end Amazon only need a feature switched on for a few seconds.
The challenge is in architecting a system that allows such experimentation in safety. This experimentation takes place in production because it needs real customer interaction to determine success/failure
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply