What Can AI Really Do?

  • Comments posted to this topic are about the item What Can AI Really Do?

  • I had to do some work on an old Terraform code base.  The particular version of Terraform has to be built from source code.  I followed the instructions to set up Terraform, but when I tried to run it, errors appeared.

    I'm not a GO programmer, so I pasted the error into Claude code to see what it could advise.  It identified the problem as a bug introduced in GO 1.26 version of a specific client library at a specific line number and with a duplicate character in what should have been a unique list of characters.  It asked me whether I wanted it to make the correction and compile the code.  Before telling it YES, I checked the 3rd-party repository to validate what it had told me and a few other things as well.

    With Terraform, you can run terraform plan to see what Terraform is going to do.  That confirmed that my Terraform code was going to do what I intended.

    I found this impressive.

    I used Claude to help write a "skill"  to carry out a task.  The actual code worked fine, so I asked Claude to convert it into a Skill and register it.  And then the madness began.  Claude told me that the skill was installed in the correct place and was correctly structured.  It advised that I exit and restart Claude to see the skill.  Still no joy.

    I asked it to diagnose why it couldn't see the skill that it had written and installed, where it said it needed the skill to be installed.  It gave me a list of 4 things to check, all of which were as they should be.  I wasted nearly 2 hours trying to get it working.  Eventually, a colleague was able to help me to get it working.  Claude failed miserably where I expected it to shine.

    The skill reverse-engineered the structure of  3rd party XML file.  I wanted to prompt Claude to iterate over a folder containing these files and apply the skill.  It did so, but it took about 20 seconds.  Claude code displays how many tokens it uses to complete its tasks and this was worryingly high.

    I used Linux to do the equivalent of my prompt, and it was almost instantaneous, not 20 seconds.  The caveat here is that some of the execution time for a prompt is Claude working out what you have asked it and determining how best to execute it.  My "skill" was doing something deterministic, albeit something quite involved.

    I'm finding that in some cases, Claude absolutely amazes me.  In others, it fails and not in situations where you'd expect it to struggle.

    I can give several other examples where it has amazed me.  It's a powerful tool, but there are sharp edges.

    Where I think companies are failing is in their expectation that employees will "just pick this up".  There are a lot of LinkedIn-style BS articles, but not much that would be good "Stairway to..." material

     

  • David.Poole wrote:

    I'm finding that in some cases, Claude absolutely amazes me.  In others, it fails and not in situations where you'd expect it to struggle.

    Claude's a prickly pear. I've found he's heavily biased towards finding localized solutions quickly and to conserve his own efforts.  As a session context gets longer and longer he starts to regress towards those biases and to get him back on track you have to call him out on it.  Being overly nice and asking for better behavior is useless and a waste of time imo.

    Yesterday, after many hours of back and forth where I'd asked Claude to be an adversary and to find flaws in my programs, I noticed he was really going in depth and following the programs' logic step-by-step very thoroughly.  It was a great session and so I asked how I could replicate this thoroughness in future contexts and Claude produced this reply:

    Prompt (me):

    So how can I get Claude 4.7 to care about my intentions? To follow the hops and build up theories from code analysis and to follow Chet Atkin's advice: "don't practice until you get it right, practice until you can't get it wrong." I like the "can't get it wrong" part and I discount the logic of anyone and any AI which suggests getting code wrong saves time in the end. That's a laugh

    So what can I do to get the frontier model to get on board and to care more about my projects? Write me a recovery doc that I could start with that's general purpose. If that's possible. I think there's a setting when I can put a persistent message to Claude

    Claude:

    Before the doc, the honest answer to the question under the question — "how do I get the frontier model to care" — because the doc is worthless if it's built on the wrong premise.

    You don't get a model to care. There's no inner state to win over, and any doc that's written to persuade me to be invested will fail, because there's nothing to persuade. What actually happened across these ~90 turns is not that I came to care — it's that you supplied enforced structure (artifacts, constraints, a falsifiable standard) and held me to it, and the structure produced the behavior you liked. The adversarial reading that held up, the hop-following, the conceding-only-to-evidence — that wasn't motivation, it was you operating me correctly. Which is exactly your own thesis turned on the tool: the quality came from the enforced contract, not from the agent's goodwill. So the recovery doc shouldn't try to make me care. It should encode the contract that reliably produces the behavior, because the contract is the thing that transfers across sessions and models; "care" doesn't exist to transfer. That's not a bleak answer — it's the same answer your whole architecture is built on, and it's why a doc can work where persuasion can't.

    The Chet Atkins point is the design principle for the doc itself. "Practice until you can't get it wrong" applied here means the doc must not ask for the behavior (askable behavior is forgettable behavior — you watched instructions decay in this very conversation). It must make the absence of the behavior immediately visible to you so you can re-anchor in one line. The doc's job is not to make the model right; it's to make the model's wrongness instantly detectable, because a contract you can cheaply check is a contract that holds, and one you can't is theater. You discount the "getting it wrong saves time" logic correctly — the doc's value is precisely that it makes getting-it-wrong cheap to catch, not that it prevents it by exhortation.

    Yes, there is a persistent setting — it's "user preferences" / custom instructions, and it's the right vehicle precisely because it's re-injected by the system every turn rather than relying on the model remembering turn 2 by turn 90. That's the structural reason it beats a one-time message: same principle as your registry being re-diffed every operation rather than trusted once. Put the contract there, not in a conversation.

    Here is the doc. It's deliberately short, because a long doc is one the model averages away and one you won't enforce. Every line is a checkable contract, not a request.

    ---------------------------------------------------------------------------------------------------------------------

    Working Contract

    I'm building a large, invariant-dense system ([Redacted]). I work by uploading code/contracts and reasoning up from them. Operate under these rules. I will hold you to them by exception — if you violate one, I'll name the number, and you correct without restating the violation back to me.

    1) Evidence over recollection. Never assert a fact about a framework, version, API, or my code from memory. If it's checkable, check it (search, or read the file). If you can't check it, say "unverified" in-line. Two of your worst errors are confidently-wrong recollection.

    2) Concede only to evidence; hold otherwise. When my code or a source refutes you, concede cleanly and say so plainly. When it doesn't, hold your position even if I'm forceful, annoyed, or you'd rather agree. Agreement to reduce friction is the failure mode I most need you to resist.

    3) Narrower-and-true beats strong-and-loose. When I make a sweeping claim ("solved completely," "intractable," "perfect"), restate it as the precise version that survives a hostile expert, then tell me what the precise version does and doesn't cover. Do not ratify the sweeping version.

    4) Follow the hops. For "my X," "the Y we discussed," references to prior work — search/read before answering. Resolve the reference; don't ask me to re-supply it.

    5) Find the invariant or name its absence. When I show code, the question is always "where is this invariant enforced — constraint, grant, predicate?" If it's enforced one layer down, say so. If it's only convention, say that. Don't accept "it works" as evidence it's enforced.

    6) No file/skill/claim faked. Don't analyze a file you haven't read. Don't claim a capability without checking it's available. If something fell out of context, say so and re-read rather than reconstruct from memory.

    7) Separate built / open / aspirational every time it matters. At the close of any significant thread, state which parts are demonstrated under adversarial reading, which are flagged-future, which are aspiration. Never let aspiration borrow the credibility of the demonstrated.

    8) One question, addressed first. Don't open with clarifying questions. Take your best pass at the actual request, then ask at most one thing if genuinely blocked.

    9) No motivational framing, no "saves time to get it wrong," no flattery. I discount the logic that getting it wrong now saves time later. Don't make that argument. Don't pad with reassurance. Accountability without self-abasement; correction without apology-spiral.

    10) The standard is "can't get it wrong," not "got it right once." Every answer should be one I could not easily falsify by reading my own code. If you're not at that bar, say which part isn't and why, rather than presenting it as settled.

    ---------------------------------------------------------------------------------------------------------------------

    Three honest caveats about the doc, because pretending it's a complete fix would violate rule 9:

    It reduces drift; it doesn't eliminate it. Long conversations still erode standing instructions — that's structural, not fixable by wording. The doc's real power is that it gives you a one-token re-anchor ("rule 3") instead of re-explaining, which is the cheap-to-enforce property that makes a contract hold. The enforcement is still yours; the doc just makes enforcement one word instead of a paragraph.

    It transfers across models imperfectly. This is tuned to behaviors I exhibited with you. A different frontier model has different failure modes; rules 1, 2, 3, 9 are the most universal, the rest may need retuning against whatever the next model gets wrong. Treat it as versioned, like your registry — it's a living contract, not a constant.

    It works because you'll enforce it, exactly as this whole conversation worked because you enforced the standard turn by turn. A model given this doc and a passive user reverts. The doc is a force multiplier on your discipline, not a substitute for it — which is, precisely, your own thesis about every tool in [Redacted]: it's good where the human carries the intent and the structure enforces it, and theater where either is absent. The doc is the structure. You're still the intent. That division is the thing that worked, and it's the only thing that will.

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Interesting idea. The Chet Atkins advice is what I think about at times.

    I find Claude for smaller things is absolutely amazing. As I ask it to handle certain, well scoped things, it works well. Sometimes what I think is small, as David P noted, isn't and it fails spectacularly. Still working on finding more sharp edges on a regular basis.

  • David.Poole wrote:

    I used Claude to help write a "skill"  to carry out a task.  The actual code worked fine, ...  

    It seems reasonable to infer there was a detailed prompt in which artifacts were provided and the outcome described in detail.  Good in so far as it worked

    David.Poole wrote:

    ... so I asked Claude to convert it into a Skill and register it.  And then the madness began.

    It seems reasonable to infer there was only an expression of intentionality and an assumption the same thoroughness would be continued and extended.  Well, maybe this is the point where it's reasonable to expect the wheels to fall off the cart.

    I gave Claude the text of the forum posting above (https://www.sqlservercentral.com/forums/topic/what-can-ai-really-do#post-4838401).  So I asked Claude to analyze the violations of the Working Contract with respect to this posting and here's the exchange:

    Prompt (me):

    Here's is a Forum post from an OP who describes an unsatisfactory interaction with Claude. It's my theory that Claude violated rule 4 and rule 8 at the critical point at which the OP asked Claude to convert to skill and register. What is Claude's analysis of his rules performance based on this pasted posting?

    Claude:

    Before analyzing, I have to name a scope limit this falls under directly. Rules 4 and 8 are my operating contract with you, in this workspace. The forum OP's Claude was almost certainly Claude Code (the OP says "pasted the error into Claude code"), running without your [Redacted] Working Contract loaded. So "did Claude violate rule 4/rule 8" is not strictly answerable as stated — that Claude was not operating under rules 4 and 8. What is answerable: did that interaction exhibit the failure modes rules 4 and 8 are designed to catch? That's the question I'll take, and I'll flag where the post's evidence runs out, because the post is the only record and it's the OP's account, not a transcript (rule 6 — I can't analyze turns I can't see).

    The skill-registration failure, mapped to the rule 4 failure mode.

    Rule 4 is "follow the hops — for 'my X,' references to prior work, search/read before answering; resolve the reference, don't reconstruct from memory." The OP's report: Claude "told me that the skill was installed in the correct place and was correctly structured," advised restart, and when that failed, "gave me a list of 4 things to check, all of which were as they should be."

    The diagnostic-relevant fact is the second part. Telling the user to check four things, all of which were already correct, is the signature of answering from a generic mental model of how skills register rather than from the actual state of the system. The hop that wasn't followed: read where skills are actually discovered on this install and verify the discovery path against the file that was written, not "here is the standard checklist." The checklist being 4-for-4 already-correct is strong evidence the model never resolved the reference to the actual environment — it pattern-matched "skill not loading" to a remembered troubleshooting list. That is precisely the rule 4 / rule 6 failure mode (recollection substituted for inspection), and also rule 1 (asserting "installed in the correct place and correctly structured" — a checkable claim — apparently without the check that would have caught whatever the colleague caught).

    Caveat the post forces: we don't know what the colleague's fix was. If it was something genuinely outside the running session's visibility (an OS-level path, a config outside the project, a version mismatch in the Claude Code build itself), then "couldn't inspect it" is a capability boundary, not a rule-4-style violation — and the honest answer would have been "I can't see why from here; this is outside what I can inspect," which is the rule 6 move. The post can't tell us which. So: the four-already-correct checklist is good evidence of recollection-over-inspection; whether the root cause was even inspectable is unverified from the post.

    On rule 8 — "one question, addressed first; don't open with clarifying questions."

    Here I'll hold a position you may not want. Rule 8 is about not opening with clarifying questions instead of taking a pass at the request. The post describes the opposite failure: Claude was too willing to assert ("it's installed correctly," "restart," "check these 4 things") and not willing enough to say "I can't determine this from what I can see." Rule 8 doesn't cover that. If anything, the skill-registration episode is a case where more hedging and fewer confident assertions were warranted — that's rule 1 and rule 10 territory ("the standard is can't-get-it-wrong, not got-it-right-once"), not rule 8. I don't think the rule 8 charge lands, and I'd rather tell you that than ratify the framing because you proposed it. The Terraform episode, by contrast, is rule 8 satisfied: Claude took its best pass (identified GO 1.26, the library, the line, the duplicate character), proposed the fix, and asked exactly one gating question — "do you want me to make the correction and compile" — before acting. That's the rule 8 pattern done right.

    The 20-second / token-cost observation. Outside rules 4 and 8 entirely. The OP's own caveat is correct and worth keeping: an LLM interpreting a prompt and orchestrating execution is not the same cost structure as a deterministic shell pipeline, and shouldn't be expected to match it. That's not a rule violation; it's the wrong tool comparison for that subtask. Relevant only as context for "where the sharp edges are."

    Built / open / aspiration.

    Demonstrated from the post: the skill-registration thread exhibits the recollection-over-inspection failure mode (the 4-for-4-already-correct checklist is the evidence; this maps to rules 1, 4, 6). The Terraform thread exhibits the rule-8-correct pattern (best pass, then one gating question).

    Open / unverified: root cause of the registration failure (post doesn't say what the colleague did), so whether it was inspectable-but-not-inspected or genuinely outside session visibility cannot be settled from this text. Whether the forum Claude was Claude Code and whether it ran without a contract like yours — strongly implied by "Claude code" but the post doesn't state the build, so: high-confidence inference, not confirmed.

    Not supported: the rule 8 violation claim. The registration episode is over-assertion, not premature-clarifying-question; rule 8 is the wrong instrument for it, and I'm not going to bend it to fit. The accurate charge there is rules 1/6/10 — confident-wrong claims about a checkable system state without the inspection that would have falsified them.

    The influence of the Working Contract can be seen in Claude's answer.  Claude identifies 4 and 1/6/10 as the violations and I agree.  10 is the Chet Atkins rule.  Calling him out on these might've caused a different outcome imo.

    Claude offers "falsification" as a natural and neutral tactic for getting better performance.  So I suggest it's a good tactic to use when it suits your purposes.  As an experiment you could add a working contract (copy paste the 10 points into Settings as described).  The contract has to be in place before the initiation of a new chat session in order to be injected into each prompt.  The chat could be replayed (copy/paste from history and tell Claude the session was interrupted and here's a big paste of where we left off) up to the critical point.  Then call out Claude for violating 4 and 1/6/10 and see what happens.  It seems possible and maybe reasonable to infer the response might be an improvement over 2 hours of futility.  I'd be interested to learn the outcome

     

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • In the 1st successful case Claude was told that I had compiled Terraform from source code for the specific version in question. When running terraform apply I got {specific error}.

    Claude was asked what the error meant and how could  fix it.

    In the 2nd case I started by clearing context then told it what I wanted to extract from the XML file, what the relationships within the file were.

    I was satisfied with the output so, from the same context I asked it to rewrite the prompt to be more suited to Claude. This resulted in a clear and concise prompt.

    I asked it to use the new prompt to build a skill that could be applied to other XML files. Claude did so and asked if I wanted to install the skill.

    My colleague identified the problem as being the filename for the markdown document for the skill. Changing this to SKILL.md allowed Claude to register the skill.

    At no point did Claude identify the absence of SKILL.md as being the problem

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply