ML Engineer Walkthrough¶
This walkthrough is for engineers building and shipping ML skills, managing experimental iterations, and ensuring reproducibility across team members. You'll work primarily with the CLI, import research artifacts from GitHub, author a custom skill, use the memory system to capture experiment decisions, and pin versions via lock files to lock in reproducible baselines.
Prerequisites¶
What you need
- Python 3.9+ — check with
python --version - SkillMeat CLI installed —
pip install skillmeatoruv tool install skillmeat - GitHub token — Required for private repos; optional for public. Get one at github.com/settings/tokens
- Claude Code active — You have a project with a
.claude/directory - Familiarity with CLI — You're comfortable with terminal commands and flags
- Git — You're using Git to version your project
Verify CLI installation:
Configure your GitHub token for reproducible imports:
Step 1: Import Artifacts from Research & GitHub¶
You have research repos or shared team artifacts on GitHub. Import them directly without copying files manually.
Import a Skill from a Research Repo¶
Research repos often live in private GitHub organizations or team accounts. Use the full source path with an optional SHA to lock to a specific commit:
# Import from a specific commit (most reproducible)
skillmeat add skill owner/research-repo/skills/neural-ranker@abc1234def567
# Import from a branch tip (less stable, will track updates)
skillmeat add skill owner/research-repo/skills/neural-ranker@main
# Import latest release tag
skillmeat add skill owner/research-repo/skills/neural-ranker@v2.1.0
SkillMeat validates the artifact, checks permissions, then adds it to your collection:
Fetching artifact from GitHub...
Added skill: neural-ranker
Location: ~/.skillmeat/collection/skills/neural-ranker/
Source: owner/research-repo/skills/neural-ranker
Pinned to: abc1234
Import Multiple Artifacts from the Same Repo¶
If your research repo has multiple skills in different subdirectories:
# Add each separately with SHAs for reproducibility
skillmeat add skill owner/research-repo/skills/embedder@abc1234
skillmeat add skill owner/research-repo/skills/ranker@abc1234
skillmeat add command owner/research-repo/commands/eval@abc1234
Verify Imports¶
Check what you've imported:
You'll see all skills in your collection with their sources:
Artifacts (3)
┌──────────────┬───────┬────────────────────────────────────────┐
│ Name │ Type │ Origin │
├──────────────┼───────┼────────────────────────────────────────┤
│ neural-ranker│ skill │ owner/research-repo/skills/neural-ranker
│ embedder │ skill │ owner/research-repo/skills/embedder
│ eval │ cmd │ owner/research-repo/commands/eval
└──────────────┴───────┴────────────────────────────────────────┘
Step 2: Create a Custom Skill for Your ML Workflow¶
Now you'll author your own skill — a reproducible unit of ML logic (preprocessing, model evaluation, experiment tracking) that your team can reuse and version.
Skill Structure¶
A skill is a directory with metadata and implementation. Create it locally in your project:
Minimal Skill File¶
Create skill.yaml (SkillMeat metadata):
---
name: ml-eval-harness
version: 0.1.0
description: >
Evaluation harness for ML ranking models:
- Loads datasets from HF
- Computes NDCG, MRR, MAP
- Outputs JSON results
type: skill
author: you@company.com
scope: local # Only available in this project
tags:
- ml
- evaluation
- ranking
dependencies:
- datasets # Python package
- scipy
---
Skill Implementation¶
Create harness.py — your actual ML code:
"""ML evaluation harness.
Supports ranking metrics: NDCG, MRR, MAP over HuggingFace datasets.
"""
import json
from scipy.stats import rankdata
from datasets import load_dataset
def load_eval_data(dataset_name: str, split: str = "test"):
"""Load evaluation dataset.
Args:
dataset_name: HF dataset identifier (e.g., "BeIR/nfcorpus")
split: Dataset split ("test", "val", etc.)
Returns:
Dataset with qid, doc_ids, relevant fields
"""
return load_dataset(dataset_name, split=split)
def compute_ndcg(scores: list[float], relevances: list[int], k: int = 10) -> float:
"""Normalized Discounted Cumulative Gain @ k."""
ranked = rankdata(scores, method='ordinal')
dcg = sum(rel / (i + 1) for i, rel in enumerate(sorted(relevances, reverse=True)[:k]))
idcg = sum(1 / (i + 1) for i in range(min(k, len([r for r in relevances if r > 0]))))
return dcg / idcg if idcg > 0 else 0.0
def evaluate_on_dataset(model_scores_file: str, dataset_name: str) -> dict:
"""Run full evaluation pipeline.
Args:
model_scores_file: Path to JSON with {qid: [scores...]}
dataset_name: HF dataset for relevance labels
Returns:
Metrics dict with NDCG@10, MRR, MAP
"""
with open(model_scores_file) as f:
model_scores = json.load(f)
dataset = load_eval_data(dataset_name)
metrics = {"ndcg@10": [], "mrr": [], "map": []}
for sample in dataset:
qid = sample["qid"]
if qid not in model_scores:
continue
scores = model_scores[qid]
relevances = sample["relevant"]
metrics["ndcg@10"].append(compute_ndcg(scores, relevances, k=10))
return {
"ndcg@10_avg": sum(metrics["ndcg@10"]) / len(metrics["ndcg@10"]),
"samples": len(metrics["ndcg@10"]),
}
if __name__ == "__main__":
result = evaluate_on_dataset("scores.json", "BeIR/nfcorpus")
print(json.dumps(result, indent=2))
Deploy the Skill to Your Project¶
Verify deployment:
Your skill is now available in your project. Import it into notebooks or CLI scripts:
import sys
sys.path.insert(0, ".claude/skills/ml-eval-harness")
from harness import evaluate_on_dataset
result = evaluate_on_dataset("scores.json", "BeIR/nfcorpus")
print(result)
Step 3: Capture Experiment Context with Memory¶
The memory system lets you record experiment decisions, hyperparameter choices, and findings so you (and your team) can recall the rationale behind decisions months later.
Capture an Experiment Decision¶
When you try a new approach and want to remember why:
skillmeat memory item create \
--project skillmeat \
--type decision \
--content "Switched to BM25+neural ensemble after single-neural underperformed on precision. BM25 provides recall; neural reranks top-100. Trade-off: +0.05 NDCG@10, +15ms latency." \
--confidence 0.9 \
--anchor "harness.py:code:45-65" \
--anchor ".claude/experiments/exp-001-ensemble.md:doc"
Capture a Constraint or Gotcha¶
When you discover a limitation:
skillmeat memory item create \
--project skillmeat \
--type gotcha \
--content "HF BeIR/nfcorpus is sparse: ~100 test queries but large corpus. Model must handle variable list lengths or dataset preprocessing will timeout. Pre-batch queries in groups of 10." \
--confidence 0.85 \
--anchor "harness.py:code:25-35"
Capture Experiment Learnings¶
Record what you learned for reproducibility:
skillmeat memory item create \
--project skillmeat \
--type learning \
--content "Warm-starting from published ranker checkpoint (model-v1.0) cut training time from 8h to 2h. Verify checkpoint license before using in production." \
--confidence 0.92 \
--anchor ".claude/experiments/exp-003-finetune.md:doc:1-50"
Search Your Experiment Memory¶
Later, when you want to recall what you decided:
Returns matches across all memory items. Use this before running a new experiment to check if you've already tried something.
For a full overview, see Memory System Guide.
Step 4: Pin Versions with Snapshots & Lock Files¶
Reproducibility means others can re-run your experiment with identical artifact versions. Use snapshots and lock files to capture exact baselines.
Create a Snapshot Before an Experiment¶
Before running a significant experiment, snapshot your collection:
This records every artifact version in your collection at this moment:
Snapshot created: exp-001-baseline
Timestamp: 2026-04-20T14:30:00Z
Artifacts: 5
Location: ~/.skillmeat/collection/snapshots/exp-001-baseline.lock
View Your Lock File¶
The lock file is TOML and records exact versions:
Output:
[lock]
version = "1.0.0"
snapshot_name = "exp-001-baseline"
created_at = "2026-04-20T14:30:00Z"
[lock.entries.neural-ranker]
source = "owner/research-repo/skills/neural-ranker"
resolved_sha = "abc1234def5678..."
resolved_version = "v2.1.0"
[lock.entries.embedder]
source = "owner/research-repo/skills/embedder"
resolved_sha = "def5678ghi9012..."
resolved_version = "v1.8.0"
[lock.entries.ml-eval-harness]
source = ".claude/skills/ml-eval-harness"
resolved_sha = "local-abc1234"
resolved_version = "0.1.0"
Commit the Lock File¶
Add it to Git so teammates can reproduce:
git add ~/.skillmeat/collection/snapshots/exp-001-baseline.lock
git commit -m "snapshot: exp-001-baseline before neural ensemble trial"
Restore a Snapshot Later¶
If you want to reproduce an old experiment exactly:
All artifacts revert to the pinned versions. Your collection is now in the state it was when you created the snapshot.
For detailed versioning and rollback workflows, see Sync, Versioning & Rollback.
Step 5: Jupyter Notebook Integration¶
Use your SkillMeat skills in Jupyter notebooks without copying code around.
Load a Skill in a Notebook¶
In a Jupyter cell, make your skills importable:
import sys
sys.path.insert(0, "./.claude/skills")
# Import your skill module
from ml_eval_harness.harness import evaluate_on_dataset, compute_ndcg
# Now use it
result = evaluate_on_dataset("scores.json", "BeIR/nfcorpus")
print(f"NDCG@10: {result['ndcg@10_avg']:.4f}")
Keep Skill Code Synced¶
Your .claude/skills/ directory is checked into Git. When you update a skill:
- Edit files in
.claude/skills/ml-eval-harness/ - Reload the module in Jupyter:
import importlib
import ml_eval_harness.harness as harness
importlib.reload(harness)
# Re-run your evaluation with updated logic
Share Skills with Team via Collection¶
Export your custom skill from your local project to your team's shared collection:
# Export the skill to collection (not just project)
skillmeat export ml-eval-harness --to-collection
# Push to shared repo (if you're maintaining a shared collection)
cd ~/.skillmeat/collection/skills/ml-eval-harness
git push origin main
Teammates can then import it:
Jupyter Best Practices¶
- Keep skill logic in
.claude/skills/— notebooks are for experiments, skills are for reproducibility - Version skills via SkillMeat snapshots — not manual notebook saves
- Use memory to capture notebook discoveries — record findings that inform skill design
- Export analysis notebooks separately — store notebooks that use skills in a
notebooks/directory, not in skills
Verification Checklist¶
- [ ] Imported at least one artifact from GitHub with a pinned SHA
- [ ] Created a custom skill with
skill.yamland implementation code - [ ] Deployed the skill to your project (visible in
.claude/skills/) - [ ] Captured ≥2 memory items (decision, gotcha, or learning) using
skillmeat memory item create - [ ] Created a snapshot of your collection before an experiment
- [ ] Located the lock file and verified it contains pinned versions
- [ ] Loaded a skill in a Jupyter notebook without copying code
- [ ] Verified teammates can import your shared skill from collection
What You've Accomplished¶
✅ Imported research artifacts from GitHub with reproducible SHAs
✅ Authored a custom skill with versioning and dependencies
✅ Captured experiment decisions in the memory system
✅ Pinned reproducible baselines via snapshots and lock files
✅ Integrated skills into Jupyter for interactive analysis
Your ML workflow is now: - Reproducible — Lock files ensure exact versions for every experiment - Documented — Memory system captures decision rationale - Shareable — Skills can be published to team collection - Auditable — Git tracks all artifacts and decisions
Next Steps¶
- Deepen memory usage — See Memory System Guide for advanced workflows (context packs for experiment bundles, search across team memory)
- Automate with programmatic API — Use SkillMeat's API to trigger experiments and skill deployments from your CI/CD — see Programmatic API Usage Guide
- Team synchronization — If you're building a shared skill library, see Adding Artifacts for publication workflows and Sync, Versioning & Rollback for keeping team baselines consistent
- Fine-tune for your domain — Extend the ML evaluation skill with your custom metrics or model types
Getting Help¶
skillmeat memory --help— Memory system commandsskillmeat snapshot --help— Snapshot operationsskillmeat add --help— Importing from GitHubskillmeat deploy --help— Deploying skills to projects- Check the CLI Reference for detailed command documentation
Time to complete: ~30 minutes
Complexity: Advanced
Prerequisites met: ✅
Target audience: ML engineers, research teams, skill authors, power users