ML Engineer Walkthrough¶

This walkthrough is for engineers building and shipping ML skills, managing experimental iterations, and ensuring reproducibility across team members. You'll work primarily with the CLI, import research artifacts from GitHub, author a custom skill, use the memory system to capture experiment decisions, and pin versions via lock files to lock in reproducible baselines.

Prerequisites¶

What you need

Python 3.9+ — check with python --version
SkillMeat CLI installed — pip install skillmeat or uv tool install skillmeat
GitHub token — Required for private repos; optional for public. Get one at github.com/settings/tokens
Claude Code active — You have a project with a .claude/ directory
Familiarity with CLI — You're comfortable with terminal commands and flags
Git — You're using Git to version your project

Verify CLI installation:

skillmeat --version

Configure your GitHub token for reproducible imports:

skillmeat config set github-token ghp_your_token_here

Step 1: Import Artifacts from Research & GitHub¶

You have research repos or shared team artifacts on GitHub. Import them directly without copying files manually.

Import a Skill from a Research Repo¶

Research repos often live in private GitHub organizations or team accounts. Use the full source path with an optional SHA to lock to a specific commit:

# Import from a specific commit (most reproducible)
skillmeat add skill owner/research-repo/skills/neural-ranker@abc1234def567

# Import from a branch tip (less stable, will track updates)
skillmeat add skill owner/research-repo/skills/neural-ranker@main

# Import latest release tag
skillmeat add skill owner/research-repo/skills/neural-ranker@v2.1.0

SkillMeat validates the artifact, checks permissions, then adds it to your collection:

Fetching artifact from GitHub...
Added skill: neural-ranker
  Location: ~/.skillmeat/collection/skills/neural-ranker/
  Source: owner/research-repo/skills/neural-ranker
  Pinned to: abc1234

Import Multiple Artifacts from the Same Repo¶

If your research repo has multiple skills in different subdirectories:

# Add each separately with SHAs for reproducibility
skillmeat add skill owner/research-repo/skills/embedder@abc1234
skillmeat add skill owner/research-repo/skills/ranker@abc1234
skillmeat add command owner/research-repo/commands/eval@abc1234

Verify Imports¶

Check what you've imported:

skillmeat list --type skill

You'll see all skills in your collection with their sources:

Artifacts (3)
┌──────────────┬───────┬────────────────────────────────────────┐
│ Name         │ Type  │ Origin                                 │
├──────────────┼───────┼────────────────────────────────────────┤
│ neural-ranker│ skill │ owner/research-repo/skills/neural-ranker
│ embedder     │ skill │ owner/research-repo/skills/embedder
│ eval         │ cmd   │ owner/research-repo/commands/eval
└──────────────┴───────┴────────────────────────────────────────┘

Step 2: Create a Custom Skill for Your ML Workflow¶

Now you'll author your own skill — a reproducible unit of ML logic (preprocessing, model evaluation, experiment tracking) that your team can reuse and version.

Skill Structure¶

A skill is a directory with metadata and implementation. Create it locally in your project:

mkdir -p .claude/skills/ml-eval-harness
cd .claude/skills/ml-eval-harness

Minimal Skill File¶

Create skill.yaml (SkillMeat metadata):

---
name: ml-eval-harness
version: 0.1.0
description: >
  Evaluation harness for ML ranking models:
  - Loads datasets from HF
  - Computes NDCG, MRR, MAP
  - Outputs JSON results
type: skill
author: you@company.com
scope: local  # Only available in this project
tags:
  - ml
  - evaluation
  - ranking
dependencies:
  - datasets  # Python package
  - scipy
---

Skill Implementation¶

Create harness.py — your actual ML code:

"""ML evaluation harness.

Supports ranking metrics: NDCG, MRR, MAP over HuggingFace datasets.
"""

import json
from scipy.stats import rankdata
from datasets import load_dataset


def load_eval_data(dataset_name: str, split: str = "test"):
    """Load evaluation dataset.

    Args:
        dataset_name: HF dataset identifier (e.g., "BeIR/nfcorpus")
        split: Dataset split ("test", "val", etc.)

    Returns:
        Dataset with qid, doc_ids, relevant fields
    """
    return load_dataset(dataset_name, split=split)


def compute_ndcg(scores: list[float], relevances: list[int], k: int = 10) -> float:
    """Normalized Discounted Cumulative Gain @ k."""
    ranked = rankdata(scores, method='ordinal')
    dcg = sum(rel / (i + 1) for i, rel in enumerate(sorted(relevances, reverse=True)[:k]))
    idcg = sum(1 / (i + 1) for i in range(min(k, len([r for r in relevances if r > 0]))))
    return dcg / idcg if idcg > 0 else 0.0


def evaluate_on_dataset(model_scores_file: str, dataset_name: str) -> dict:
    """Run full evaluation pipeline.

    Args:
        model_scores_file: Path to JSON with {qid: [scores...]}
        dataset_name: HF dataset for relevance labels

    Returns:
        Metrics dict with NDCG@10, MRR, MAP
    """
    with open(model_scores_file) as f:
        model_scores = json.load(f)

    dataset = load_eval_data(dataset_name)
    metrics = {"ndcg@10": [], "mrr": [], "map": []}

    for sample in dataset:
        qid = sample["qid"]
        if qid not in model_scores:
            continue

        scores = model_scores[qid]
        relevances = sample["relevant"]

        metrics["ndcg@10"].append(compute_ndcg(scores, relevances, k=10))

    return {
        "ndcg@10_avg": sum(metrics["ndcg@10"]) / len(metrics["ndcg@10"]),
        "samples": len(metrics["ndcg@10"]),
    }


if __name__ == "__main__":
    result = evaluate_on_dataset("scores.json", "BeIR/nfcorpus")
    print(json.dumps(result, indent=2))

Deploy the Skill to Your Project¶

cd /path/to/your/project
skillmeat deploy ml-eval-harness

Verify deployment:

ls -la .claude/skills/ml-eval-harness/

Your skill is now available in your project. Import it into notebooks or CLI scripts:

import sys
sys.path.insert(0, ".claude/skills/ml-eval-harness")
from harness import evaluate_on_dataset

result = evaluate_on_dataset("scores.json", "BeIR/nfcorpus")
print(result)

Step 3: Capture Experiment Context with Memory¶

The memory system lets you record experiment decisions, hyperparameter choices, and findings so you (and your team) can recall the rationale behind decisions months later.

Capture an Experiment Decision¶

When you try a new approach and want to remember why:

skillmeat memory item create \
  --project skillmeat \
  --type decision \
  --content "Switched to BM25+neural ensemble after single-neural underperformed on precision. BM25 provides recall; neural reranks top-100. Trade-off: +0.05 NDCG@10, +15ms latency." \
  --confidence 0.9 \
  --anchor "harness.py:code:45-65" \
  --anchor ".claude/experiments/exp-001-ensemble.md:doc"

Capture a Constraint or Gotcha¶

When you discover a limitation:

skillmeat memory item create \
  --project skillmeat \
  --type gotcha \
  --content "HF BeIR/nfcorpus is sparse: ~100 test queries but large corpus. Model must handle variable list lengths or dataset preprocessing will timeout. Pre-batch queries in groups of 10." \
  --confidence 0.85 \
  --anchor "harness.py:code:25-35"

Capture Experiment Learnings¶

Record what you learned for reproducibility:

skillmeat memory item create \
  --project skillmeat \
  --type learning \
  --content "Warm-starting from published ranker checkpoint (model-v1.0) cut training time from 8h to 2h. Verify checkpoint license before using in production." \
  --confidence 0.92 \
  --anchor ".claude/experiments/exp-003-finetune.md:doc:1-50"

Search Your Experiment Memory¶

Later, when you want to recall what you decided:

skillmeat memory search "ensemble" --project skillmeat

Returns matches across all memory items. Use this before running a new experiment to check if you've already tried something.

For a full overview, see Memory System Guide.

Step 4: Pin Versions with Snapshots & Lock Files¶

Reproducibility means others can re-run your experiment with identical artifact versions. Use snapshots and lock files to capture exact baselines.

Create a Snapshot Before an Experiment¶

Before running a significant experiment, snapshot your collection:

skillmeat snapshot create --name exp-001-baseline

This records every artifact version in your collection at this moment:

Snapshot created: exp-001-baseline
  Timestamp: 2026-04-20T14:30:00Z
  Artifacts: 5
  Location: ~/.skillmeat/collection/snapshots/exp-001-baseline.lock

View Your Lock File¶

The lock file is TOML and records exact versions:

cat ~/.skillmeat/collection/snapshots/exp-001-baseline.lock

Output:

[lock]
version = "1.0.0"
snapshot_name = "exp-001-baseline"
created_at = "2026-04-20T14:30:00Z"

[lock.entries.neural-ranker]
source = "owner/research-repo/skills/neural-ranker"
resolved_sha = "abc1234def5678..."
resolved_version = "v2.1.0"

[lock.entries.embedder]
source = "owner/research-repo/skills/embedder"
resolved_sha = "def5678ghi9012..."
resolved_version = "v1.8.0"

[lock.entries.ml-eval-harness]
source = ".claude/skills/ml-eval-harness"
resolved_sha = "local-abc1234"
resolved_version = "0.1.0"

Commit the Lock File¶

Add it to Git so teammates can reproduce:

git add ~/.skillmeat/collection/snapshots/exp-001-baseline.lock
git commit -m "snapshot: exp-001-baseline before neural ensemble trial"

Restore a Snapshot Later¶

If you want to reproduce an old experiment exactly:

skillmeat snapshot restore exp-001-baseline

All artifacts revert to the pinned versions. Your collection is now in the state it was when you created the snapshot.

For detailed versioning and rollback workflows, see Sync, Versioning & Rollback.

Step 5: Jupyter Notebook Integration¶

Use your SkillMeat skills in Jupyter notebooks without copying code around.

Load a Skill in a Notebook¶

In a Jupyter cell, make your skills importable:

import sys
sys.path.insert(0, "./.claude/skills")

# Import your skill module
from ml_eval_harness.harness import evaluate_on_dataset, compute_ndcg

# Now use it
result = evaluate_on_dataset("scores.json", "BeIR/nfcorpus")
print(f"NDCG@10: {result['ndcg@10_avg']:.4f}")

Keep Skill Code Synced¶

Your .claude/skills/ directory is checked into Git. When you update a skill:

Edit files in .claude/skills/ml-eval-harness/
Reload the module in Jupyter:

import importlib
import ml_eval_harness.harness as harness

importlib.reload(harness)
# Re-run your evaluation with updated logic

Export your custom skill from your local project to your team's shared collection:

# Export the skill to collection (not just project)
skillmeat export ml-eval-harness --to-collection

# Push to shared repo (if you're maintaining a shared collection)
cd ~/.skillmeat/collection/skills/ml-eval-harness
git push origin main

Teammates can then import it:

skillmeat add skill yourorg/shared-collection/skills/ml-eval-harness@latest

Jupyter Best Practices¶

Keep skill logic in .claude/skills/ — notebooks are for experiments, skills are for reproducibility
Version skills via SkillMeat snapshots — not manual notebook saves
Use memory to capture notebook discoveries — record findings that inform skill design
Export analysis notebooks separately — store notebooks that use skills in a notebooks/ directory, not in skills

Verification Checklist¶

[ ] Imported at least one artifact from GitHub with a pinned SHA
[ ] Created a custom skill with skill.yaml and implementation code
[ ] Deployed the skill to your project (visible in .claude/skills/)
[ ] Captured ≥2 memory items (decision, gotcha, or learning) using skillmeat memory item create
[ ] Created a snapshot of your collection before an experiment
[ ] Located the lock file and verified it contains pinned versions
[ ] Loaded a skill in a Jupyter notebook without copying code
[ ] Verified teammates can import your shared skill from collection

What You've Accomplished¶

✅ Imported research artifacts from GitHub with reproducible SHAs
✅ Authored a custom skill with versioning and dependencies
✅ Captured experiment decisions in the memory system
✅ Pinned reproducible baselines via snapshots and lock files
✅ Integrated skills into Jupyter for interactive analysis

Your ML workflow is now: - Reproducible — Lock files ensure exact versions for every experiment - Documented — Memory system captures decision rationale - Shareable — Skills can be published to team collection - Auditable — Git tracks all artifacts and decisions

Next Steps¶

Deepen memory usage — See Memory System Guide for advanced workflows (context packs for experiment bundles, search across team memory)
Automate with programmatic API — Use SkillMeat's API to trigger experiments and skill deployments from your CI/CD — see Programmatic API Usage Guide
Team synchronization — If you're building a shared skill library, see Adding Artifacts for publication workflows and Sync, Versioning & Rollback for keeping team baselines consistent
Fine-tune for your domain — Extend the ML evaluation skill with your custom metrics or model types

Getting Help¶

skillmeat memory --help — Memory system commands
skillmeat snapshot --help — Snapshot operations
skillmeat add --help — Importing from GitHub
skillmeat deploy --help — Deploying skills to projects
Check the CLI Reference for detailed command documentation

Time to complete: ~30 minutes
Complexity: Advanced
Prerequisites met: ✅
Target audience: ML engineers, research teams, skill authors, power users