Reading the docs isn't enough. The most valuable developer feedback lives inside GitHub issues, bug reports, and feature discussions. In this article, I share the checklist I use to mine repositories and coding frame work for the issue, uncover real developer pain points, and turn engineering conversations into meaningful UX research findings.
Reading GitHub issues without a coding framework produces impressions. Coding them produces findings.
When you code an issue, you are making an explicit analytical decision about what type of signal it contains. You are saying: "This issue is evidence of a feedback gap at Stage 4 (model ), filed by an ML engineer with a new user experience level, in KServe v0.11, and it directly answers my research question about what information engineers need during model that the product currently does not provide."
That sentence assembled from your coding decisions is a finding. Multiply it across 100 issues and you have a research study.
Without coding, you have 100 anecdotes. With coding, you have a dataset.
It makes patterns visible. When you code 20 issues and notice that 14 of them map to the same stage and the same challenge type, you have found your highest-priority UX problem. You would never see that pattern by reading casually.
It makes your findings defensible. "Engineers struggle with KServe" is an opinion. "18 issues across v0.10βv0.13 filed by engineers in their first deployment show identical feedback gap patterns at Stage 4, with an average of 14 comments per issue" is a finding.
It separates your role from the engineer's role. Engineers read GitHub issues as bug reports. You read them as evidence of design decisions. The coding framework is the analytical lens that makes your UX reading possible.
Before any analysis, capture the foundational data for every issue. This establishes the quantitative baseline for your research report.
Issue Metadata Checklist
βββββββββββββββββββββββββββββββββββββββββββββ
[ ] Issue URL / Number e.g. #1234
[ ] Issue creation date
[ ] Issue category Bug | Question | Feature Request | Docs
[ ] Labels e.g. InferenceService, Control Plane, Kubernetes
[ ] Current state Open | Closed | Stale | Merged
[ ] Resolution type Code Fix | Docs Update | Workaround | No resolution
[ ] Total comment count
[ ] Unique commenter count
[ ] Ping-pong count Back-and-forth before root cause found
[ ] Brief issue summary
The "ping-pong count" β the number of back-and-forth diagnostic comments between the user and maintainers before the root cause was identified β is a particularly powerful metric. High ping-pong means the product gave the user no diagnostic signal. That is a UX failure in the product itself.
Understanding user demographics is essential in any research project because it tells you whose problems to prioritise. In interviews or surveys, you simply ask. In GitHub mining, you have to infer from signals embedded in the issue itself.
In the context of KServe, we are primarily trying to distinguish between three main personas.
The language engineers use tells you their world immediately.
| Persona | Typical keywords | What they focus on |
|---|---|---|
| Data Scientist / ML Engineer | ||
| PyTorch, TensorFlow, HuggingFace, weights, predictor, artifact, S3, inputs/outputs | The model itself β getting a Python script to serve predictions | |
| Platform / DevOps Engineer | ||
| CRDs, Istio, Knative, ingress, RBAC, service account, Helm, multi-cluster, HPA | Infrastructure, networking, security, cluster stability | |
| Application Developer | ||
| REST API, gRPC, JSON payload, curl, SDK, timeout, 503 error, endpoint | Consuming the model β integrating the endpoint into a larger application |
KServe issues typically fall into three abstraction layers:
kubectl describe
outputs, cluster events, Helm values files, or Istio configurations β they speak in Kubernetes YAML
Quick triage question:Is this person treating KServe as aninfrastructure componentor as amodel-delivery tool? Infrastructure β Platform/DevOps. Model-delivery β Data Scientist/ML Engineer.
Demographics Checklist
βββββββββββββββββββββββββββββββββββββββββββββ
[ ] Inferred user persona
Data Scientist / ML Engineer
Platform / DevOps Engineer
Application Developer
ML-expert / K8s-novice β the hardest edge case
Unclear
[ ] Experience level
New user (first issues, plain English, no version info)
Experienced (provides full env, logs, rules out causes)
Unclear
[ ] Deployment environment
Local (Kind / Minikube)
Cloud managed (EKS / GKE / AKS)
On-premises / bare metal
[ ] Deployment scale
Single-cluster
Multi-cluster
Multi-tenant
[ ] Deployment method
Helm | Kustomize | ArgoCD / GitOps | Direct kubectl
The hardest edge case:ML-experienced / Kubernetes-novice engineers. They write technically confident issues about model formats or serving runtimes β but are completely confused about Istio or Knative. Always code these as a separate category β they reveal a completely different class of UX failure.
The key insight for non-technical UX researchers: you do not need to understand the technical content of an issue to identify its UX signal. The signals are in the language, not the configuration.
Here is how to read any issue in under two minutes.
Scan for time words before you read anything else.
| Time word | Severity | What it means |
|---|---|---|
| "minutes" | Low | Minor gap, quickly resolved |
| "hours" | Medium | Significant friction, real work lost |
| "days" | High | Severe friction, deadline impact |
| "weeks" / "months" | Critical | Product-level failure |
| "gave up" / "switching to X" | Abandonment | User is leaving |
Real example: "I've spent the last three days trying to figure out why my model stays in Unknown status."
You do not need to know what Unknown status means. "Three days" tells you this is a high-severity finding.
Every friction-revealing issue has the word "but" at a specific moment. Everything before "but" is what the engineer did correctly. Everything after is where the product failed them.
Real example: "I followed the quickstart exactly but the webhook never became Ready."
Before "but" = user followed instructions. After "but" = product gave instructions that led to failure. That is a documentation UX finding, not a technical bug.
Real example: "The status shows Ready but every curl request returns 503."
This is the "misleading success" friction type, the product declared success when the user's actual goal was completely unmet. One of the most trust-destroying UX failures possible.
These phrases directly reveal a mental model gap β the engineer built an incorrect picture of how the system works, and reality contradicted it.
Real example: "I thought Transformer meant a language model component like BERT. Turns out it's just data preprocessing."
This is not a bug. It is a naming failure. "Transformer" means attention-based neural architecture in the ML world. KServe uses it to mean "a component that preprocesses data before the model." Every ML engineer who encounters this name builds the wrong mental model from it.
Real example: "I assumed KServe would track model versions automatically β like a proper ML serving platform should."
This is scope confusion β the engineer's expectation of what KServe is does not match what it actually is. That mismatch is a design communication failure, not a user error.
Each "I had to" signals a missing workflow step β something the engineer needed that the product should have provided.
Real example: "I had to write a polling script to check when the InferenceService became Ready, because there's no built-in wait command."
Count the "I had to" chains in an issue. An issue with four of them put four separate manual burdens on an engineer for a task that should have been automated.
Scan for capitalized tool names: Knative, Istio, Prometheus, MLflow, Argo, Triton, HuggingFace. Count them.
When an engineer writes "sorry if this is a basic question" β that is not politeness. That is evidence the product made a competent person feel responsible for the product's communication failure.
Once you have your basic metadata and demographics, here is the full coding structure to apply to every issue.
Map every issue to one of 8 stages. This tells you where in the journey the product is losing engineers.
Deployment Stage Checklist
βββββββββββββββββββββββββββββββββββββββββββββ
[ ] Stage 1 Β· Setup Installing KServe and dependencies
Signals: "webhook not ready" Β· "quickstart fails"
[ ] Stage 2 Β· Storage Getting the trained model accessible
Signals: "access denied" Β· "model not found" Β· "storageUri"
[ ] Stage 3 Β· Configuration Writing the InferenceService YAML spec
Signals: "minimum config?" Β· "deprecated field" Β· "required fields?"
[ ] Stage 4 Β· Applying config and waiting for model to load
Signals: "Unknown status" Β· "how long?" Β· "no logs" Β· "OOMKilled"
[ ] Stage 5 Β· Network Reaching the deployed endpoint
Signals: "Ready but 503" Β· "connection refused" Β· "EXTERNAL-IP pending"
[ ] Stage 6 Β· Inference Sending requests and getting predictions back
Signals: "400 error" Β· "what format?" Β· "V1 vs V2 protocol"
[ ] Stage 7 Β· Hardening Making the deployment production-reliable
Signals: "zero downtime update" Β· "autoscaling conflict" Β· "SLA"
[ ] Stage 8 Β· Day-2 ops Updating, monitoring, governing over time
Signals: "rollback" Β· "update my model" Β· "60 models across teams"
[ ] Cross-stage? Root cause in Stage X, discovered at Stage Y
(delayed discovery = highest severity finding)
The most important finding:Stages 4 and 5 consistently produce the highest issue volume in K-Serve. The product is completely silent at the two moments when engineers are most anxious and most blind.
Usability Challenge Checklist
βββββββββββββββββββββββββββββββββββββββββββββ
[ ] U1 Β· Learnability breakdown
Cannot figure out how to do the task the first time
Signal: "how do I" Β· already answered in docs Β· confused by concepts
[ ] U2 Β· Error recovery failure
Hits error, can't understand it, doesn't know which log to check
Signal: pastes cryptic error, "stuck for days", tries random things
[ ] U3 Β· Feedback & visibility gap
System gives no signal β Unknown, Pending, complete silence
Signal: "nothing happens" Β· "how long should this take?" Β· "no logs"
[ ] U4 Β· Configuration complexity
Too many fields, unclear defaults, no minimum viable spec
Signal: "is all of this needed?" Β· "which fields are required?"
[ ] U5 Β· Mental model mismatch
Expectation contradicts how system actually works
Signal: "I expected" Β· "I thought" Β· "this makes no sense"
[ ] U6 Β· Workaround proliferation
User invented their own solution to fill a product gap
Signal: "I wrote a script" Β· "I had to" Β· shared snippets in comments
Developer Friction Checklist
βββββββββββββββββββββββββββββββββββββββββββββ
[ ] F1 Β· Invisible wall System silent β nothing to debug
[ ] F2 Β· Misleading success "Ready" but goal completely unmet
[ ] F3 Β· Hidden prerequisite Required knowledge never communicated until failure
[ ] F4 Β· Terminology confusion Word means something different in this context
[ ] F5 Β· Broken feedback loop Can't tell if a change had any effect
[ ] F6 Β· Forced context switch Must configure Istio/Knative to complete one KServe task
[ ] F7 Β· Documentation gap Knows what they want, can't find how to do it
[ ] F8 Β· Accumulated friction 5β6 small frictions in sequence β abandonment signal
System Challenge Checklist
βββββββββββββββββββββββββββββββββββββββββββββ
[ ] Ownership ambiguity KServe says "that's Istio", Istio says "that's KServe"
[ ] Abstraction leakage InferenceService was meant to hide Knative/Istio; it doesn't
[ ] Observability gap Logs scattered across 4+ components; no unified view
[ ] Role boundary collision ML engineer task structurally requires platform engineer action
[ ] Upgrade path fragility Every version upgrade risks production breakage
Environmental Challenge Checklist
βββββββββββββββββββββββββββββββββββββββββββββ
[ ] Managed K8s divergence EKS, GKE Autopilot, OpenShift behave differently
[ ] Corporate proxy / air-gap No public internet; private registry; air-gapped
[ ] GPU & hardware OOMKilled, VRAM insufficient, driver mismatch
[ ] Org security policy OPA, Gatekeeper, PodSecurityAdmission blocking KServe
[ ] On-premises / hybrid No managed LoadBalancer, NFS storage, bare metal
[ ] Regulated / compliance HIPAA, SOC2, GDPR, data residency requirements
This is what transforms your research from a snapshot into a longitudinal UX health report.
Version Tracking Checklist
βββββββββββββββββββββββββββββββββββββββββββββ
[ ] KServe version (exact) e.g. 0.11.2
[ ] Previous version (if upgrade) e.g. 0.10 β 0.11
[ ] Kubernetes version e.g. 1.27
[ ] Cloud provider EKS / GKE / AKS / On-prem / Local
[ ] Version stated by: User upfront | Maintainer had to ask | Never provided
[ ] Upgrade experience: Better | Same | Worse | New regression introduced
[ ] Chronic pain signal: Same issue present in prior version? Yes / No
The chronic problem listβ friction points that appear in the top-3 across three or more versions β is your most powerful finding. A problem that survived three release cycles is not a bug. It is an architectural decision.
LLM Inference Checklist
βββββββββββββββββββββββββββββββββββββββββββββ
[ ] Is this an LLM issue? Yes | No | Hybrid
[ ] LLM model family Llama | Mistral | Qwen | Gemma | Custom
[ ] LLM runtime vLLM | TGI | OpenAI-compatible | Custom
[ ] Capability attempted:
Basic inference
Streaming tokens (SSE)
Multi-GPU / tensor parallelism
LoRA / adapter serving
HuggingFace Hub authentication
OpenAI API compatibility
Quantisation (GPTQ, AWQ)
[ ] LLM-specific challenge:
GPU OOM / VRAM insufficient
Model with no progress signal
Streaming failure through gateway
HuggingFace auth failing in cluster
Runtime version lag behind vLLM/TGI ecosystem
No LLM-specific metrics (token throughput, TTFT)
[ ] Innovation lag signal:
Date of capability request: ___________
Date KServe released support: ___________
Gap (days): ___________
For each issue, record which research question it provides evidence for. This anchors your mining to your study goals.
Research Question Mapping
βββββββββββββββββββββββββββββββββββββββββββββ
Current-state questions (what is broken today)
[ ] RQ1 Β· First deployment challenges across roles and experience levels
[ ] RQ2 Β· Workflow gaps between deployed model and reliable production
[ ] RQ3 Β· Observability and debugging challenges by stage
[ ] RQ4 Β· LLM deployment challenges vs classical ML serving
[ ] RQ5 Β· Environmental factors shaping deployment experience
[ ] RQ6 Β· How challenges evolved across versions (your unique longitudinal contribution)
[ ] RQ7 Β· Design changes that would most reduce friction
UX improvement questions (what should be designed differently)
[ ] UX1 Β· Time-to-first-inference reduction
[ ] UX3 Β· Model progress visibility (highest volume finding)
[ ] UX4 Β· Self-service diagnostic experience
[ ] UX9 Β· LLM mental model bridge (vLLM/HuggingFace β KServe)
[ ] UX11 Β· Environment validator / dependency pre-flight checker
Once you spot a pattern across multiple issues, record it here. One template per pattern β not per issue.
UX Finding Template
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Finding type: ___________________________________________
Affected users: Role Β· Experience level Β· Version band
Deployment stage: ___________________________________________
Evidence: N issues Β· Date range Β· e.g. "14 issues, 2022β2024"
Best quote: Under 25 words β your strongest evidence
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
UX finding statement:
"Engineers [doing X] cannot [accomplish Y] because [design gap Z],
which means [impact on time / confidence / adoption]."
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Severity: Low | Medium | High | Critical
Chronic? Present across ___ versions
Design recommendation: ________________________________________
Research question answered: ___________________________________
Mining Session Completion Checklist
βββββββββββββββββββββββββββββββββββββββββββββ
Issue coverage
[ ] 50+ general deployment issues coded across all 8 stages
[ ] 30+ LLM inference issues coded and version-tracked
[ ] 10+ issues per major version band (v0.10, v0.11, v0.12, v0.13+)
[ ] 15+ upgrade issues with before/after UX delta recorded
[ ] Top 30 most-commented issues reviewed (sort:comments-desc)
[ ] 10+ abandoned issues (open 6+ months, last message unanswered)
[ ] 15+ success cases (quickly-resolved β positive signal baseline)
[ ] All enhancement/feature-request labels reviewed
[ ] All competitive tool mentions captured (BentoML, Seldon, Ray Serve)
Baseline measurement
[ ] Metrics table filled per version band:
Issue count | Avg comments | 7-day resolution rate | Emotional language %
[ ] Top-3 friction points per version β chronic problem list built
[ ] LLM innovation lag calculated for 5+ capabilities
[ ] Version reporting rate: % of issues that include version upfront
Quality
[ ] 15% of issues coded by a second researcher (inter-rater reliability)
[ ] Every research question has at least 3 issues as evidence
[ ] One finding statement written per significant pattern found
The biggest barrier to studying developer tools as a UX researcher is the assumption that you need to understand the code to understand the problem. This framework removes that barrier entirely.
When you code an issue, you are not evaluating the correctness of someone's Kubernetes configuration. You are recording what the issue reveals about the human experience of using the product. "Status shows Unknown for 20 minutes" tells you everything you need regardless of whether you understand what Unknown means technically. The product left a user without feedback during its most critical operation. That is a UX finding independent of any technical knowledge.
The coding sheet converts anecdote into pattern. Instead of "users seem to struggle with deployment," you can say "14 of 20 issues sampled from v0.11 show feedback gap failures at Stage 4, with an average of 18 comments per issue, suggesting state communication is the highest-priority improvement target for this version band."
That is a product roadmap argument. The coding sheet built it.
The research questions embedded in the coding sheet map directly to design recommendations with evidence behind them. A contributor who wants to make a meaningful impact on user experience now has specific, evidence-backed targets β not "improve docs" but "add granular status conditions per phase that distinguish between 10 failure modes currently all reporting as Unknown."
When researchers share findings publicly in CNCF Blogs, KubeCon talks, or articles like this one β the coding framework makes the research reproducible. Other researchers can apply the same checklist to a different version or a different tool and compare results. That cumulative body of evidence is what eventually changes product direction.
GitHub issues are not a bug tracker. They are a longitudinal, naturalistic record of where real engineers encounter the gap between what a product promises and what it delivers.
A coding sheet is the analytical framework that transforms that record into research. Without it, you are reading. With it, you are studying.
The framework I built for KServe β covering demographics, deployment stages, usability challenges, friction types, mental model gaps, system challenges, environmental barriers, version tracking, and LLM inference β did not emerge from theory. It emerged from reading hundreds of issues and asking the same question every time: what is the UX researcher's reading of this, beyond what the engineer sees?
The answer is always the same: engineers see symptoms. The coding sheet helps you see the design decisions that caused them.
Start with the "but" sentence. Work backwards to the design failure. Code it. Repeat 100 times. Then write the research report.
This coding framework was developed as part of a UX research study on ML model deployment in KServe. If you are working on similar research in the cloud-native or MLOps space, I would love to hear your thoughts.