Help with Counts

  • SQLKnowItAll (1/17/2012)


    L' Eomot Inversé (1/17/2012)a set of datakey-landline pairs such that each landline and each datakey occurs only once

    As far as I can tell, this is an impossible statement without saying WHICH comes first.. Eliminate duplicate landlines when one is found and then duplicate datakeys, or the other way around. It seems that the request is not set based, but order based. However, I still don't know what the order is...

    Tom nailed it some posts ago when he mentioned 'constrained maximisations'. Thanks Tom, I didn't know the problem had a name. The query I posted last, dealing with landlines only, does work so far (i.e. with a large dataset and the "current spec") but may require a third iteration (or even fourth) to eliminate residuals on really large data sets.

    Since data will be eliminated by design, and there's a choice of data to eliminate, then of course a part of the spec is missing. Priority must be given to either datakey or landline, and also to sort order of each.

    My query doesn't appear to give a priority to either datakey or landline, instead simply giving priority to lowest over highest for both columns together. It may not be a valid solution. If it isn't, then I hope it helps others to understand the problem and arrive at a valid one.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • bicky1980 (1/17/2012)


    SQLKnowItAll (1/17/2012)


    L' Eomot Inversé (1/17/2012)a set of datakey-landline pairs such that each landline and each datakey occurs only once

    As far as I can tell, this is an impossible statement without saying WHICH comes first.. Eliminate duplicate landlines when one is found and then duplicate datakeys, or the other way around. It seems that the request is not set based, but order based. However, I still don't know what the order is...

    The required order is dedupe datakey first, followed by landline.

    Bicky, can you demonstrate with some sample data so everyone clearly understands? Cheers.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • ChrisM@home (1/17/2012)


    bicky1980 (1/17/2012)


    SQLKnowItAll (1/17/2012)


    L' Eomot Inversé (1/17/2012)a set of datakey-landline pairs such that each landline and each datakey occurs only once

    As far as I can tell, this is an impossible statement without saying WHICH comes first.. Eliminate duplicate landlines when one is found and then duplicate datakeys, or the other way around. It seems that the request is not set based, but order based. However, I still don't know what the order is...

    The required order is dedupe datakey first, followed by landline.

    Bicky, can you demonstrate with some sample data so everyone clearly understands? Cheers.

    What do you need me to demonstrate? Getting confused now with all the different answers. 🙂 do you just need a dataset to work with?

  • bicky1980 (1/17/2012)


    ChrisM@home (1/17/2012)


    bicky1980 (1/17/2012)


    SQLKnowItAll (1/17/2012)


    L' Eomot Inversé (1/17/2012)a set of datakey-landline pairs such that each landline and each datakey occurs only once

    As far as I can tell, this is an impossible statement without saying WHICH comes first.. Eliminate duplicate landlines when one is found and then duplicate datakeys, or the other way around. It seems that the request is not set based, but order based. However, I still don't know what the order is...

    The required order is dedupe datakey first, followed by landline.

    Bicky, can you demonstrate with some sample data so everyone clearly understands? Cheers.

    What do you need me to demonstrate? Getting confused now with all the different answers. 🙂 do you just need a dataset to work with?

    I think we really need the rules. They have not been clearly defined. Let me see if I can give what I understand (for landlines) and if it is correct, great. If not, please try to adjust using my framework, this is not necessarily the best approach in SQL, but the easiest way to understand it:

    1. Insert ALL data into a staging table

    2. Find all datakeys that only have 1 unique landline in the staging table and insert the datakey and landline into a temp table

    3. delete any duplicate landlines from the temp table

    4. join the staging table on the temp table and delete any landlines from it that exist in the temp table including the data that moved to the temp table

    5. Insert the data from the temp table BACK into the staging table.

    6. Repeat the process until no duplicate landlines

    Although this may not be the best procedural solution in SQL... Does it fundamentally describe the business rules? I couldn't come up with a better stepwise way to illustrate my understanding of the rules 🙂

    EDIT: So we are systematically eliminating duplicate landlines ensuring that datakeys that only have 1 unique landline get precedence for the count over datakeys that have more than 1 unique landline.

    Jared
    CE - Microsoft

  • SQLKnowItAll (1/17/2012)


    bicky1980 (1/17/2012)


    ChrisM@home (1/17/2012)


    bicky1980 (1/17/2012)


    SQLKnowItAll (1/17/2012)


    L' Eomot Inversé (1/17/2012)a set of datakey-landline pairs such that each landline and each datakey occurs only once

    As far as I can tell, this is an impossible statement without saying WHICH comes first.. Eliminate duplicate landlines when one is found and then duplicate datakeys, or the other way around. It seems that the request is not set based, but order based. However, I still don't know what the order is...

    The required order is dedupe datakey first, followed by landline.

    Bicky, can you demonstrate with some sample data so everyone clearly understands? Cheers.

    What do you need me to demonstrate? Getting confused now with all the different answers. 🙂 do you just need a dataset to work with?

    I think we really need the rules. They have not been clearly defined. Let me see if I can give what I understand (for landlines) and if it is correct, great. If not, please try to adjust using my framework, this is not necessarily the best approach in SQL, but the easiest way to understand it:

    1. Insert ALL data into a staging table

    2. Find all datakeys that only have 1 unique landline in the staging table and insert the datakey and landline into a temp table

    3. delete any duplicate landlines from the temp table

    4. join the staging table on the temp table and delete any landlines from it that exist in the temp table including the data that moved to the temp table

    5. Insert the data from the temp table BACK into the staging table.

    6. Repeat the process until no duplicate landlines

    Although this may not be the best procedural solution in SQL... Does it fundamentally describe the business rules? I couldn't come up with a better stepwise way to illustrate my understanding of the rules 🙂

    If the data could be processed in the following manner that would be great

    1. Extract all the data with a landline

    2. Dedupe the data on datakey - keep only one record (e.g. 3 duplicate datakeys would become just one datakey)

    3. Dedupe the data on landline - same rule as above

  • Here's another attempt at making some progress towards a precise statement or the requirement.

    I'm going to talk about the landlinecount only, because the datakey count is a trivial one (just count distinct) and the mobile and email counts follow the same principles as the landline count.

    The number delivered as the landline count is the cardinality of some set A. We have the following constraints on A:

    C1) A is a subset of the projection of the original table onto its datakey and landline attributes.

    C2) None of the pairs in A has the zero length string as it's landline element.

    C3) No two pairs in A have the same value in their datakey element.

    C4) No two pairs in A have the same value in their landline element.

    Obviously A has to be further constrained, because in some sense it needs to "cover" the original data (otherwise we could just forget about the data and set the landline count to zero). There are only two possible constraints on A that I can imagine being useful and/or acceptable (because other possible constraints imply constraints on the original data and we don't know what such constraints, if any, are permissable).

    P1) A has the highest cardinality of any set satisfying C1,C2,C3 and C4.

    P2) A is maximal in the sense that it has no proper superset which satisfies C1,C2,C3,C4.

    P3) A is minimal in the sense that it has the smallest cardinality of any set satisfying C1,C2,C3,C4 and P2.

    P1 gives us a costly maximisation problem; it entails getting the cardinality of every set that satisfies C1,C2,C3 and C4 (or at least of every one of them that satisfies P2).

    P3 gives us a costly minimisation problem; it entails finding the cardinalities of all the sets satisfying the C1-C4 and P2 and taking the minimum.

    P2 seems less difficult, it only needs to find one maximal set; but depending on the algorithms used a change to metadata might result in a different maximal set being found for the same data, and Bicky has stated that that would be unacceptable; so P2 has to come with a constraint that the algorithms employed must not change results if metadata changes.

    Is there some other constraint than P1, P2, or P3 that would lead to the "right" value for the landline count?

    Are the conditions C1,C2,C3 and C4 the correct ones to start with? They are what I've understood Bicky to say, but I could be wrong.

    Does anyone have an idea for a decent algorithm to work with P2? Obviously there is a recursive (or iterative instead, of course) construction for such a set, using subcounts plus either the native ordering of strings or of the attribute value to make choices where needed, and it's fairly easy to make it reasonably close to P1 by using subcounts to drive it wherever possible (Bicky's reply to an earlier post suggests that he'd prefer to be closer to P1, or even actually there, if it can be done reasonably) without of course any guarantee that P1 is actually reached, but I'm worried about performance using recursive CTEs or an iterative loop. If we knew the maximum number of landlines that could be associated with a single datakey we would know a bound on the depth of recursion or number of times round the iterative loop and could potentially unroll it if the number is low enough. Does anyone know a more direct al;gorithm?

    Does anyone know how to go for P1 with certainty of getting there? If there's a good efficient algorithm that can be expressed tidily in a relational manner that would probably be the best way to go?

    Tom

  • I think this whole thing may be getting massively overcomplicated by the math.

    Is the actual business requirement simply a count of how many distinct datakeys there are, and a count of how many distinct landline-datakey pairs there are, or did some manager/executive somewhere actually request something this mathematically complicated?

    What is the actual statement of the exact business need? As written/stated by the person who will actually be using this data, not summarized or anything like that.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • If the data could be processed in the following manner that would be great

    1. Extract all the data with a landline

    2. Dedupe the data on datakey - keep only one record (e.g. 3 duplicate datakeys would become just one datakey)

    3. Dedupe the data on landline - same rule as above

    This misses which record to keep in step 2. That is the big issue here... Which record (if any) get retained for comparison on landline from datakey? Clearly, that matters. I believe it is based on maximizing the count; i.e. if a data key only has 1 unique landline, do not count that landline in the determination of other datakey counts. Right?

    EDIT: If the above is true, then a count of distinct landlines where landline meets the acceptable criteria for length is the same thing.

    Jared
    CE - Microsoft

  • SQLKnowItAll (1/17/2012)


    If the data could be processed in the following manner that would be great

    1. Extract all the data with a landline

    2. Dedupe the data on datakey - keep only one record (e.g. 3 duplicate datakeys would become just one datakey)

    3. Dedupe the data on landline - same rule as above

    This misses which record to keep in step 2. That is the big issue here... Which record (if any) get retained for comparison on landline from datakey? Clearly, that matters. I believe it is based on maximizing the count; i.e. if a data key only has 1 unique landline, do not count that landline in the determination of other datakey counts. Right?

    EDIT: If the above is true, then a count of distinct landlines where landline meets the acceptable criteria for length is the same thing.

    Ok, my edit was wrong because in the case where a datakey has 2 different landlines, but those 2 landlines do not appear on any other datakey... They still only count as 1. I think that is correct?

    Jared
    CE - Microsoft

  • SQLKnowItAll (1/17/2012)


    SQLKnowItAll (1/17/2012)


    If the data could be processed in the following manner that would be great

    1. Extract all the data with a landline

    2. Dedupe the data on datakey - keep only one record (e.g. 3 duplicate datakeys would become just one datakey)

    3. Dedupe the data on landline - same rule as above

    This misses which record to keep in step 2. That is the big issue here... Which record (if any) get retained for comparison on landline from datakey? Clearly, that matters. I believe it is based on maximizing the count; i.e. if a data key only has 1 unique landline, do not count that landline in the determination of other datakey counts. Right?

    EDIT: If the above is true, then a count of distinct landlines where landline meets the acceptable criteria for length is the same thing.

    Ok, my edit was wrong because in the case where a datakey has 2 different landlines, but those 2 landlines do not appear on any other datakey... They still only count as 1. I think that is correct?

    Your question, "in the case where a datakey has 2 different landlines, but those 2 landlines do not appear on any other datakey... They still only count as 1. I think that is correct?" - The answer is yes.

    Just to try and simplify my request, this is how I have been told to preform these counts in the past: (Indkey is a unique key)

    select *

    into [tablename1_landlines]

    from [tablename1]

    where landline is not null and landline!=''

    Select max(indkey) as indkey,

    max(datakey) as datakey

    into [tablename1_landlines_temp]

    from [tablename1_landlines]

    group by datakey

    delete from [tablename1_landlines]

    where indkey not in (select indkey from [tablename1_landlines_temp])

    Select max(indkey) as indkey,

    max(landline) as landline

    into [tablename1_landlines_temp2]

    from [tablename1_landlines]

    group by datakey

    delete from [tablename1_landlines]

    where indkey not in (select indkey from [tablename1_landlines_temp2])

    I know this is certainly not the best way to do these and is very slow hence why I am trying to get a faster, better solution.

  • bicky1980 (1/17/2012)


    SQLKnowItAll (1/17/2012)


    SQLKnowItAll (1/17/2012)


    If the data could be processed in the following manner that would be great

    1. Extract all the data with a landline

    2. Dedupe the data on datakey - keep only one record (e.g. 3 duplicate datakeys would become just one datakey)

    3. Dedupe the data on landline - same rule as above

    This misses which record to keep in step 2. That is the big issue here... Which record (if any) get retained for comparison on landline from datakey? Clearly, that matters. I believe it is based on maximizing the count; i.e. if a data key only has 1 unique landline, do not count that landline in the determination of other datakey counts. Right?

    EDIT: If the above is true, then a count of distinct landlines where landline meets the acceptable criteria for length is the same thing.

    Ok, my edit was wrong because in the case where a datakey has 2 different landlines, but those 2 landlines do not appear on any other datakey... They still only count as 1. I think that is correct?

    Your question, "in the case where a datakey has 2 different landlines, but those 2 landlines do not appear on any other datakey... They still only count as 1. I think that is correct?" - The answer is yes.

    Just to try and simplify my request, this is how I have been told to preform these counts in the past: (Indkey is a unique key)

    select *

    into [tablename1_landlines]

    from [tablename1]

    where landline is not null and landline!=''

    Select max(indkey) as indkey,

    max(datakey) as datakey

    into [tablename1_landlines_temp]

    from [tablename1_landlines]

    group by datakey

    delete from [tablename1_landlines]

    where indkey not in (select indkey from [tablename1_landlines_temp])

    Select max(indkey) as indkey,

    max(landline) as landline

    into [tablename1_landlines_temp2]

    from [tablename1_landlines]

    group by datakey

    delete from [tablename1_landlines]

    where indkey not in (select indkey from [tablename1_landlines_temp2])

    I know this is certainly not the best way to do these and is very slow hence why I am trying to get a faster, better solution.

    This will also give you inconsistent results that do not follow any concrete rules. This is arbitrary. Kind of scary actually.

    Jared
    CE - Microsoft

  • SQLKnowItAll (1/17/2012)


    I think we really need the rules. They have not been clearly defined. Let me see if I can give what I understand (for landlines) and if it is correct, great. If not, please try to adjust using my framework, this is not necessarily the best approach in SQL, but the easiest way to understand it:

    1. Insert ALL data into a staging table

    2. Find all datakeys that only have 1 unique landline in the staging table and insert the datakey and landline into a temp table

    3. delete any duplicate landlines from the temp table

    4. join the staging table on the temp table and delete any landlines from it that exist in the temp table including the data that moved to the temp table

    5. Insert the data from the temp table BACK into the staging table.

    6. Repeat the process until no duplicate landlines

    Although this may not be the best procedural solution in SQL... Does it fundamentally describe the business rules? I couldn't come up with a better stepwise way to illustrate my understanding of the rules 🙂

    EDIT: So we are systematically eliminating duplicate landlines ensuring that datakeys that only have 1 unique landline get precedence for the count over datakeys that have more than 1 unique landline.

    If that would work it would be a great solution. Unfortunately it's very easy to find data on which it won't work - for example if your initial data has only datakeys that have two or more landlines each, you never eliminate any rows; and even if it doesn't break before it starts each time round the loop risks delivering, at the end of step 4 before the temp table rows have been added back, a staging table which at that stage has no datakeys with one landline, so after than you never elimate another row. But although it doesn't work, it's an excellent step on the way to a solution.

    You can improve the algorithm's chances by having two different kinds of step: the one that looks for datakeys with only one landline, and another that looks for landlines with only one datakey and deals with them, and alternating those steps. This actually a good way to start fixing the non-termination problem, because although it doesn't fix it it does ensure that in general you hit the "there are no singletons so I must do something else" situation less often than if you don't do this, and the code for handling that situation is going to be messy.

    If you adapt it so that it operates (when there are neither datakeys with only one landline nor landlines with only one datakey) not specifically on datakeys with a single landline but on datakeys with whatever number of landlines the datakey with least remaining landlines has it will work. You then have to choose a good algrithm to deal with the case where that number isn't 1 (perhaps start by looking for landlines that occur only once in the temp table - but once again you have to be aware that there may not be any). Perhaps you might have a similar thing for landlines with the minimum number of datakeys, and pick one of these two things according to which was the type of main step which detected the first of a consecutive pair of "no singleton" conditions. I don't know which is better.

    Also, you have to decide you algorithm for step 3 (choose to keep the record with the lowest datakey for this landline, perhaps? I imagine it doesn't much matter what it is, as long as it's easy to specify and doesn't cost an arm and a leg to compute).

    Tom

  • L' Eomot Inversé (1/17/2012)


    Here's another attempt at making some progress towards a precise statement or the requirement....

    Tom, how does the query I posted here fit in with all this? Be nice, I majored in biochemistry/microbiology, not maths.

    BTW Bicky's not necessarily a "he" - could be Rebecca from Australia 😉


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • GSquared (1/17/2012)


    I think this whole thing may be getting massively overcomplicated by the math.

    Is the actual business requirement simply a count of how many distinct datakeys there are, and a count of how many distinct landline-datakey pairs there are, or did some manager/executive somewhere actually request something this mathematically complicated?

    What is the actual statement of the exact business need? As written/stated by the person who will actually be using this data, not summarized or anything like that.

    I'm sure you are right, that the way it's been described to us is vastly overcomplicated. Now that Bicky has posted the code he's trying to replace we can maybe work out what it actually does and take that as a definite requirement - because the motivation for replacing it appears to be to get it to go faster, not to change the functionality.

    On "by the math" I disagree with you: math is a way of providing clear and unequivocal specifications of requirements; one for of math is to present an algorithm (which Bicky has now done, and which Jared made an interesting attempt at, no doubt provoking that response from Bicky). What has complicated this has been the utter lack of clarity in the requirement (which Bicky's most recent post has probably fixed).

    Tom

  • L' Eomot Inversé (1/17/2012)


    GSquared (1/17/2012)


    I think this whole thing may be getting massively overcomplicated by the math.

    Is the actual business requirement simply a count of how many distinct datakeys there are, and a count of how many distinct landline-datakey pairs there are, or did some manager/executive somewhere actually request something this mathematically complicated?

    What is the actual statement of the exact business need? As written/stated by the person who will actually be using this data, not summarized or anything like that.

    I'm sure you are right, that the way it's been described to us is vastly overcomplicated. Now that Bicky has posted the code he's trying to replace we can maybe work out what it actually does and take that as a definite requirement - because the motivation for replacing it appears to be to get it to go faster, not to change the functionality.

    On "by the math" I disagree with you: math is a way of providing clear and unequivocal specifications of requirements; one for of math is to present an algorithm (which Bicky has now done, and which Jared made an interesting attempt at, no doubt provoking that response from Bicky). What has complicated this has been the utter lack of clarity in the requirement (which Bicky's most recent post has probably fixed).

    I did try explaining, to me, my initial request still makes sense 😉 btw I am male

Viewing 15 posts - 91 through 105 (of 120 total)

You must be logged in to reply to this topic. Login to reply