Sunday, April 5, 2026

 Q. As an Azure Cloud Solution Architecture, how would you go about ensuring business continuity for your clients and their workloads?

A: As an Azure cloud solution architect, my first move is to turn “disaster recovery” into a workload-specific operating model, not a generic secondary-region checkbox. My clients have workloads that fall into one of the following categories: 1. having resource types of web apps for APIs, storage account based static website for UI application and an application gateway for web-application firewall and bot protection, 2. having automation in terms of Azure Databricks jobs, notebooks and azure-data-factory-based data-transfer pipelines that run on scheduled basis 3. having significant and heavy Kubernetes applications and jobs with either MySQL, CosmosDB or Postgres backend databases with airflow scheduler and 4. having GenAI heavy Databricks applications with Langfuse monitoring and remote model and deployment API calls using OpenAI chat specification

Because all of these stamps live in Central US, I should anchor the DR design on Azure region-pairing1, service-native replication, pre-determined RTO/RPO targets, and rehearsed failover/failback runbooks; Azure documents that paired regions are in the same geography, are updated sequentially, and are prioritized for recovery during a broad outage. For a Central US footprint, the practical implication is that I prefer a paired-region strategy for the dependent services and the platform control plane, then decide case by case whether the secondary landing zone should be active-passive or active-active based on business criticality, latency tolerance, and the cost of duplicate infrastructure.

For the first stamp, where I have web apps plus a storage-account-backed static site for UI and APIs behind Application Gateway with WAF, the continuity design should separate traffic steering, application state, and content distribution.2 Use a secondary region with identical infrastructure deployed from code, put the web tier behind a failover-capable global entry point if the business requires regional survivability, and make the Application Gateway/WAF configuration itself reproducible so that a new gateway can be stood up quickly in the secondary region. For the static UI, I make sure the storage account uses a geo-redundant replication strategy appropriate for the RPO my clients are willing to accept, because storage failover is distinct from application failover and the app must be able to point to the recovered endpoint after a region event. My runbook should include DNS or traffic-manager cutover, WAF policy validation, secret and certificate rehydration, and health-probe checks that confirm both the APIs and the static website are serving correctly before declaring the failover complete.

For the second stamp, where Databricks jobs, notebooks, and Azure Data Factory pipelines dominate, the real continuity challenge is orchestration and data synchronization rather than just compute redeployment.3 Azure Databricks guidance for DR emphasizes having a secondary workspace in a secondary region, stopping workloads in the primary, starting recovery in the secondary, updating routing and workspace URLs, and then retriggering jobs once the secondary environment is operational. In practice, that means my client’ notebooks, job definitions, cluster policies, libraries, secrets integration, and workspace dependencies must be stored in source control and redeployed automatically, while the actual data layer uses a replication or reprocessing plan that matches the pipeline’s tolerance for replay. For ADF, I treat metadata, triggers, linked services, and integration runtimes as recoverable control-plane assets and separately design for self-hosted integration runtime (SHIR) redundancy if those pipelines depend on SHIR, since the integration runtime can become the hidden single point of failure. The failover sequence should be duly tested: I stop or freeze primary runs, validate data consistency, fail over the data platform, rebind the orchestration layer, and then resume scheduled jobs only after confirming downstream dependencies and checkpoint state.

For the third stamp, where heavy Kubernetes workloads depend on MySQL, Cosmos DB, or PostgreSQL plus Airflow, I think in layers: cluster recovery, workload redeployment, workflow state, and database continuity.4 Azure recommends an active-passive pattern for AKS Disaster Recovery in which we deploy two identical clusters in two regions and protect node pools with availability zones within each region, because cluster-local HA does not substitute for regional DR. I also need backup-and-restore discipline for cluster state and namespaces, with Azure Backup for AKS or equivalent backup tooling providing recoverable manifests, persistent volume data, and application hooks where needed; cross-region restore is operationally more complex than same-region restore, so my clients’ recovery objectives should reflect the restore time, not just the existence of backups. For the backend database, Cosmos DB is strongest if I configure multi-region distribution and automatic failover, because Microsoft documents mention high availability and turnkey DR for multi-region accounts. PostgreSQL flexible server can use geo-restore or cross-region read replicas, with failover behavior and RPO depending on the selected configuration, while MySQL should be handled with its own BCDR pattern and automated backups or replication design appropriate to the service tier. Airflow itself should not be treated as an afterthought: the scheduler, metadata database, DAG definitions, and any XCom or queue dependencies must be recoverable as code and data, and I rehearse how the scheduler is restarted only after the database and storage backends are consistent and reachable.

For the fourth stamp, where the environment is GenAI-heavy with Databricks, Langfuse monitoring, and remote model calls using the OpenAI chat-style API, continuity depends on both platform resilience and external dependency management.5 Databricks DR guidance still applies here, but I also need to account for the fact that model calls may be routed to a remote service that is outside my Azure region strategy, so the application must be resilient to transient model endpoint failures, rate limits, and regional unavailability through retries, fallback models, circuit breakers, and queue-based buffering. Langfuse telemetry, prompt logs, and trace data should be shipped to resilient storage or a secondary observability plane so that I do not lose auditability during failover, because post-incident reconstruction is especially important in GenAI systems where prompt versions, tools, and output traces materially affect behavior. In a high-security design, keep secrets in managed key stores, isolate outbound access, restrict model endpoints through approved egress paths, and ensure the secondary region can re-establish the same network posture, identity bindings, and policy controls before any production workload is re-enabled. If the model provider is unavailable, the application should degrade gracefully rather than fail catastrophically, for example by switching to cached responses, a smaller fallback model, or a read-only mode for non-critical workflows, and my client’s DR test plan should specifically validate those behavioral fallbacks rather than only infrastructure recovery.

 

Saturday, April 4, 2026

 In drone-based video sensing, the captured image stream can be understood as a temporally ordered sequence of highly correlated visual frames, where consecutive frames differ only incrementally due to the drone’s smooth motion and relatively stable environment. This continuity induces substantial redundancy, making it computationally advantageous to model frame progression in a formal, automata-theoretic framework. By conceptualizing frames as symbols in a string, the video stream can be treated analogously to a sequence of characters subjected to pattern recognition techniques such as the Knuth–Morris–Pratt (KMP) algorithm. In KMP, the presence of repeating substrings enables efficient pattern matching through the construction of partial match tables that avoid redundant computations. Similarly, in video data, repeated or near-identical frames may be interpreted as recurring “symbols” within an input sequence, suggesting a structural parallel between image repetition and substring recurrence.

An automaton defined over this sequence of frames can function as a state machine capturing the evolution of visual contexts during the drone’s flight. Each state in the automaton corresponds to a distinct visual configuration or stationary context, while transitions between states are triggered by detectable deviations in the input data, such as changes in color distribution, object presence, or spatial structure. Thus, the automaton abstracts the continuous video feed into a discrete set of states and transitions, effectively summarizing the perceptual variation encountered during the observation period.

The utility of this model lies in its ability to produce a compact representation of the entire flight. Rather than retaining every frame, which largely encodes redundant information, the automaton emphasizes transition points—moments when the state sequence changes—thereby isolating salient frames corresponding to significant environmental or positional changes. This process induces a “signature” of the flight, a compressed temporal trace that preserves the structural pattern of observed changes while discarding repetitive content.

From a computational perspective, the method provides both efficiency and interpretability. It reduces temporal redundancy by formalizing similarity relations among frames and yields a mathematically grounded representation suitable for downstream tasks such as indexing, retrieval, or anomaly detection. The resulting automaton-based abstraction thus serves as a formal mechanism for encoding, analyzing, and interpreting dynamic visual data, capturing the essential structure of the drone’s perceptual experience through the lens of automata theory and pattern matching.


Friday, April 3, 2026

 This is a summary of a book “Bulletproof Your Marketplace: Strategies for Protecting Your Digital Platform” written by Jeremy H. Gottschalk and published by Forbes Books, 2025. The book is written for the new generation of marketplace builders—founders who can spin up a platform quickly but may not realize how many legal, operational, and reputational risks are baked into “just connecting buyers and sellers.” Gottschalk, an attorney and longtime advisor to digital platforms, argues that a marketplace’s true durability depends less on its interface and growth metrics than on how early it treats governance, security, and accountability as core product decisions rather than after-the-fact fixes. 

Gottschalk opens with a simple warning: in a public marketplace—physical or digital—conflict is not a remote possibility but an eventual certainty. Online platforms now function as gathering places as surely as the town markets of earlier centuries, except their scale is global and the pace is instantaneous. With hundreds of millions of Americans shopping online and billions of people worldwide participating in digital commerce, even a small platform can find itself hosting disputes between users, facing coordinated fraud, or responding to a data breach. As he puts it, “It’s just a matter of time before something avoidably bad happens, whether that’s an incident between users, nefarious actors infiltrating your community, a data breach, or something worse.” For founders dazzled by speed-to-market tools and low overhead, the message is clear: your risk posture must mature as fast as your user base does. 

The book explains how US law both protects and constrains digital platform operators. Gottschalk highlights Section 230 of the Communications Decency Act of 1996 as the foundational shield that allowed internet businesses to flourish. Before Section 230, courts wavered on whether an online service should be treated like a bookstore (generally not liable for what others say) or like a publisher (potentially liable for every statement it distributes). Section 230 resolved much of that uncertainty by broadly limiting a platform’s liability for user-generated content. Gottschalk illustrates how this protection has repeatedly kept marketplaces out of the blast radius of their users’ speech and conduct—whether the dispute involves defamatory posts, negative reviews, or allegations that a platform facilitated unlawful behavior. Yet he also emphasizes that the existence of a legal shield is not the same as having a “free pass.” Litigation is expensive even when you win, and the reputational costs of being associated with harmful conduct can be more damaging than the court’s final ruling. 

Where founders get into trouble, Gottschalk notes, is when they forget that Section 230 does not excuse what the business itself creates or materially shapes. Courts have been willing to treat a platform as a content “developer” when it fabricates profiles, makes specific promises, or forces users into structured disclosures that cross legal lines. He points to cases where platforms still ended up in court because an employee’s assurance became an enforceable contract, or because the platform allegedly knew about illegal activity and failed to act. Over time, lawmakers have also carved out exceptions—most notably in areas such as sex trafficking—shrinking the space where a platform can assume immunity. The practical lesson is sober: “Your case can be legally solid as a rock, but that doesn’t mean you’ll walk away unscathed.” 

From there, the book turns to one of the most underused tools in a marketplace founder’s toolkit: the terms of use. Users rarely read them, and many operators treat them as generic boilerplate, but Gottschalk frames them as a form of operational insurance—an enforceable contract that can reduce exposure where statutory protections end. He cautions against copying and pasting terms from unrelated companies, since irrelevant provisions can create confusion and conflict with how the product actually works. He also warns founders not to let marketing claims outrun the contract: hype can be persuasive, but overpromising becomes dangerous when it collides with what the terms actually guarantee. 

In Gottschalk’s view, strong terms of use do three things well. First, they set boundaries—limitations of liability that define what the company is (and is not) responsible for when transactions go wrong. Second, they establish process through dispute-resolution language: where claims must be brought, what law governs, and whether disputes go to court or arbitration. He lays out the tradeoffs plainly. Courts provide predictability because precedent constrains outcomes, while arbitration can be faster and private, but also binding, difficult to appeal, and sometimes surprisingly expensive as fees accumulate. Third, terms can discourage “litigation by volume” with provisions such as class action waivers. Even if such clauses may be challenged, he argues that including them is often a sensible layer of protection. 

Just as important, Gottschalk urges founders to plan for change. Marketplaces evolve quickly—new features, new policies, new jurisdictions—and the contract needs to keep up. That means reserving the right to amend terms, but also giving users clear notice when changes occur and capturing affirmative assent in a way a court will respect. He explains why “browsewrap” terms that merely sit behind a link tend to be least enforceable, while sign-in or click-through approaches create a clearer record that the user knowingly agreed. His warning is blunt: “Your terms of use may not be enforceable if a court deems that your users did not have sufficient notice of them or take affirmative actions to manifest their assent to them.” 

From contracts, the book moves into privacy and data practices—another area where many marketplaces stumble by treating compliance as a checkbox instead of a trust-building promise. Platforms often collect sensitive information such as names, ages, addresses, or payment details to enable transactions and personalize experiences. But Gottschalk stresses that the era of invisible collection is over. High-profile scandals, including the Cambridge Analytica episode involving tens of millions of Facebook users, changed consumer expectations and triggered regulatory action. He notes that while the United States still lacks a single comprehensive federal privacy law, states (including California) have enacted significant requirements, and a growing number of jurisdictions now impose obligations on how data is collected, used, and disclosed. Founders, he argues, should aim to meet the strictest standards they are likely to face rather than racing to the minimum, because regulation tends to expand, not shrink. 

One nuance he calls out can surprise founders: the moment a privacy policy is turned into something users must “agree” to, it may start functioning like a contract rather than a simple disclosure. As he writes, “The minute you fold your data privacy policy into your terms of use, or you require your users to agree to your privacy policy, you’ve morphed them into a binding contract.” For that reason, clarity matters. A strong privacy policy should plainly state what information is collected, why it is needed, how long it is retained, and what safeguards protect it. It should also tell users how to contact the business, how complaints are handled, and what enforcement mechanisms back the company’s stated commitments. 

All of that feeds into the theme Gottschalk returns to repeatedly: trust and safety is not a “later” problem. Data breaches at household-name companies—Yahoo’s multi-billion-account breach and the Equifax incident affecting over a hundred million consumers—demonstrate that the fallout can include lawsuits, regulatory fines, and long-term reputational damage. His prevention advice starts with restraint: collect and store the minimum information required to operate the marketplace. In his words, “If you don’t keep [data], you can’t lose it. If you don’t have it, bad actors can’t access it if (and when) they hack into your system.” From there, he advocates for practical baselines: know who your users are, authenticate identities to reduce bots and impersonation, implement content moderation appropriate to the community, and invest in fraud detection that balances effective screening with a smooth user experience. 

Finally, Gottschalk emphasizes preparedness for the day prevention fails. When something goes wrong—a user harmed by another user, a fraud ring exploiting onboarding gaps, a breach exposing personal information—the first signals may be a customer-service ticket, a public review, or a social media post. Sometimes the first contact comes from law enforcement, a journalist, or a lawyer’s demand letter. He advises companies to respond quickly, communicate with humility, and avoid reflexive defensiveness; where service failures occur, an appropriate expression of contrition can reduce escalation. He notes that most people with grievances will complain directly to support channels or publicly online rather than contacting the media, which gives a platform an opportunity to address issues before they spiral. He also recommends early engagement with insurers: notify carriers promptly when incidents occur and ensure coverage matches the marketplace’s actual risk profile, since underwriters can tailor policies only if founders clearly explain how the platform operates. 

“It’s just a matter of time before something avoidably bad happens, whether that’s an incident between users, nefarious actors infiltrating your community, a data breach, or something worse.” 

Today’s software helps entrepreneurs launch their own new marketplaces without investing in expensive offices or other facilities. Online marketplaces can facilitate introductions and transactions among users, with the entrepreneur collecting a subscription fee, a sales commission, or both. New specialists keep entering the market while traditional vendors continue to enhance their digital and online capabilities. 

The primary legislation that shields marketplaces from liability in the United States is Section 230 of the Communications Decency Act of 1996. Prior to this legislation, companies had serious concerns about their legal liability for online content. For example, the platform CompuServe once hosted forums where people could express their opinions. In the early 1990s, a publication posted comments there about a rival who subsequently sued for defamation. A district court ruled against the plaintiff, comparing CompuServe to a bookstore that isn’t responsible for the content of the books on display. 

Taken together, Bulletproof Your Marketplace reads less like abstract legal theory and more like a founder’s field guide to building platforms that can survive success. Gottschalk’s central narrative is that marketplaces don’t fail only because of weak demand or poor product design; they can fail because the operator underestimated liability, treated policies as boilerplate, collected too much data without a clear rationale, or waited too long to invest in trust and safety. His background as the founder and CEO of Marketplace Risk—and as former general counsel for the caregiving marketplace Sittercity—shows in the book’s consistent focus on practical risk tradeoffs: what you must do, what you should do, and what you can’t afford to ignore if you want users, investors, and regulators to trust the platform you’re building. 

Thursday, April 2, 2026

 The following is a sample code for getting custom insights into GitHub issues opened against a repository on a periodic basis: 

#! /usr/bin/python 

import os, requests, json, datetime, re 

REPO = os.environ["REPO"] 

TOKEN = os.environ["GH_TOKEN"] 

WINDOW_DAYS = int(os.environ.get("WINDOW_DAYS","7")) 

HEADERS = {"Authorization": f"Bearer {TOKEN}", "Accept": "application/vnd.github+json, application/vnd.github.mockingbird-preview+json", "X-GitHub-Api-Version": "2026-03-10"} 

since = (datetime.datetime.utcnow() - datetime.timedelta(days=WINDOW_DAYS)).isoformat() + "Z" 

 

# ---- Helpers ---- 

def gh_get(url, params=None,ignore_status_codes=None): 

  r = requests.get(url, headers=HEADERS, params=params) 

  if ignore_status_codes is not None: 

       if isinstance(ignore_status_codes, int): 

             ignore_status_codes = {ignore_status_codes} 

       else: 

             ignore_status_codes = set(ignore_status_codes) 

       if r.status_code in ignore_status_codes: 

             return None 

  r.raise_for_status() 

  return r.json() 

 

def gh_get_text(url): 

  r = requests.get(url, headers=HEADERS) 

  r.raise_for_status() 

  return r.text 

 

issues_url = f"https://api.github.com/repos/{REPO}/issues" 

params = {"state":"closed","since":since,"per_page":100} 

items = gh_get(issues_url, params=params) 

 

issues = [] 

for i in items: 

  if "pull_request" in i: 

    continue 

  comments = gh_get(i["comments_url"], params={"per_page":100}) 

  pr_urls = set() 

  for c in comments: 

    body = c.get("body","") or "" 

    for m in re.findall(r"https://github\.com/[^/\s]+/[^/\s]+/pull/\d+", body): 

      pr_urls.add(m) 

    for m in re.findall(r"(?:^|\s)#(\d+)\b", body): 

      pr_urls.add(f"https://github.com/{REPO}/pull/{m}") 

  issues.append({ 

    "number": i["number"], 

    "title": i.get("title",""), 

    "user": i.get("user",{}).get("login",""), 

    "created_at": i.get("created_at"), 

    "closed_at": i.get("closed_at"), 

    "html_url": i.get("html_url"), 

    "comments": [{"id":c.get("id"), "body":c.get("body",""), "created_at":c.get("created_at")} for c in comments], 

    "pr_urls": sorted(pr_urls) 

  }) 

 

with open("issues.json","w") as f: 

  json.dump(issues, f, indent=2) 

print(f"WROTE_ISSUES={len(issues)}") 

 

import os, requests, datetime, pandas as pd 

 

REPO = os.environ["REPO"] 

TOKEN = os.environ["GH_TOKEN"] 

WINDOW_DAYS = int(os.environ.get("WINDOW_DAYS", "7")) 

 

headers = { 

  "Authorization": f"Bearer {TOKEN}", 

  "Accept": "application/vnd.github+json", 

} 

 

since = (datetime.datetime.utcnow() - datetime.timedelta(days=WINDOW_DAYS)).isoformat() + "Z" 

url = f"https://api.github.com/repos/{REPO}/issues" 

 

def fetch(state): 

  items = [] 

  page = 1 

  while True: 

    r = requests.get( 

      url, 

      headers=headers, 

      params={"state": state, "since": since, "per_page": 100, "page": page}, 

    ) 

    r.raise_for_status() 

    batch = [i for i in r.json() if "pull_request" not in i] 

    if not batch: 

      break 

    items.extend(batch) 

    if len(batch) < 100: 

      break 

    page += 1 

  return items 

 

opened = fetch("open") 

closed = fetch("closed") 

 

df = pd.DataFrame( 

  [ 

    {"metric": "opened", "count": len(opened)}, 

    {"metric": "closed", "count": len(closed)}, 

  ] 

) 

 

df.to_csv("issue_activity.csv", index=False) 

print(df) 

 

import os, re, json, datetime, requests 

import hcl2 

import pandas as pd 

 

REPO = os.environ["GITHUB_REPOSITORY"] 

GH_TOKEN = os.environ["GH_TOKEN"] 

HEADERS = {"Authorization": f"Bearer {GH_TOKEN}", "Accept": "application/vnd.github+json, application/vnd.github.mockingbird-preview+json", "X-GitHub-Api-Version": "2026-03-10"} 

 

# ---- Time window (last 7 days) ---- 

since = (datetime.datetime.utcnow() - datetime.timedelta(days=7)).isoformat() + "Z" 

 

# ---- Helpers ---- 

def list_closed_issues(): 

  # Issues API returns both issues and PRs; filter out PRs. 

  url = f"https://api.github.com/repos/{REPO}/issues" 

  items = gh_get(url, params={"state":"closed","since":since,"per_page":100}) 

  return [i for i in items if "pull_request" not in i] 

 

PR_HTML_URL_RE = re.compile( 

    r"https?://github\.com/(?P<owner>[^/\s]+)/(?P<repo>[^/\s]+)/pull/(?P<num>\d+)", 

    re.IGNORECASE, 

) 

PR_API_URL_RE = re.compile( 

    r"https?://api\.github\.com/repos/(?P<owner>[^/\s]+)/(?P<repo>[^/\s]+)/pulls/(?P<num>\d+)", 

    re.IGNORECASE, 

) 

 

# Shorthand references that might appear in text: 

#   - #123  (assumed to be same repo) 

#   - owner/repo#123 (explicit cross-repo) 

SHORTHAND_SAME_REPO_RE = re.compile(r"(?<!\w)#(?P<num>\d+)\b") 

SHORTHAND_CROSS_REPO_RE = re.compile( 

    r"(?P<owner>[A-Za-z0-9_.-]+)/(?P<repo>[A-Za-z0-9_.-]+)#(?P<num>\d+)\b" 

) 

 

def _normalize_html_pr_url(owner: str, repo: str, num: int) -> str: 

    return f"https://github.com/{owner}/{repo}/pull/{int(num)}" 

 

def _collect_from_text(text: str, default_owner: str, default_repo: str) -> set: 

    """Extract candidate PR URLs from free text (body/comments/events text).""" 

    found = set() 

    if not text: 

        return found 

  

    # 1) Direct HTML PR URLs 

    for m in PR_HTML_URL_RE.finditer(text): 

        found.add(_normalize_html_pr_url(m.group("owner"), m.group("repo"), m.group("num"))) 

 

    # 2) API PR URLs -> convert to HTML 

    for m in PR_API_URL_RE.finditer(text): 

        found.add(_normalize_html_pr_url(m.group("owner"), m.group("repo"), m.group("num"))) 

 

    # 3) Cross-repo shorthand: owner/repo#123 (we will treat it as PR URL candidate) 

    for m in SHORTHAND_CROSS_REPO_RE.finditer(text): 

        found.add(_normalize_html_pr_url(m.group("owner"), m.group("repo"), m.group("num"))) 

 

    # 4) Same-repo shorthand: #123 

    for m in SHORTHAND_SAME_REPO_RE.finditer(text): 

        found.add(_normalize_html_pr_url(default_ownerdefault_repom.group("num"))) 

 

    return found 

 

def _paginate_gh_get(url, headers=None, per_page=100): 

    """Generator: fetch all pages until fewer than per_page are returned.""" 

    page = 1 

    while True: 

        data = gh_get(url, params={"per_page": per_page, "page": page}) 

        if not isinstance(data, list) or len(data) == 0: 

            break 

        for item in data: 

            yield item 

        if len(data) < per_page: 

            break 

        page += 1 

 

def extract_pr_urls_from_issue(issue_number: int): 

    """ 

    Extract PR URLs associated with an issue by scanning: 

      - Issue body 

      - Issue comments 

      - Issue events (including 'mentioned', 'cross-referenced', etc.) 

      - Issue timeline (most reliable for cross references) 

 

    Returns a sorted list of unique, normalized HTML PR URLs. 

    Requires: 

      - REPO = "owner/repo" 

      - gh_get(url, params=None, headers=None) is available 

    """ 

    owner, repo = REPO.split("/", 1) 

    pr_urls = set() 

 

    # Baseline Accept header for REST v3 + timeline support. 

    # The timeline historically required a preview header. Keep both for compatibility. 

    base_headers = { 

        "Accept": "application/vnd.github+json, application/vnd.github.mockingbird-preview+json" 

    } 

 

    # 1) Issue body 

    issue_url = f"https://api.github.com/repos/{REPO}/issues/{issue_number}" 

    issue = gh_get(issue_url) 

    if isinstance(issue, dict): 

        body = issue.get("body") or "" 

        pr_urls |= _collect_from_text(body, owner, repo) 

 

        # If this issue IS itself a PR (when called with a PR number), make sure we don't add itself erroneously 

        # We won't add unless text contains it anyway; still fine. 

 

    # 2) All comments 

    comments_url = f"https://api.github.com/repos/{REPO}/issues/{issue_number}/comments" 

    for c in _paginate_gh_get(comments_url): 

        body = c.get("body") or "" 

        pr_urls |= _collect_from_text(body, owner, repo) 

 

    # 3) Issue events (event stream can have 'mentioned', 'cross-referenced', etc.) 

    events_url = f"https://api.github.com/repos/{REPO}/issues/{issue_number}/events" 

    for ev in _paginate_gh_get(events_url): 

        # (a) Free-text fields: some events carry body/commit messages, etc. 

        if isinstance(evdict): 

            body = ev.get("body") or "" 

            pr_urls |= _collect_from_text(body, owner, repo) 

 

            # (b) Structured cross-reference (best: 'cross-referenced' events) 

            #     If the source.issue has 'pull_request' key, it's a PR; use its html_url. 

            if ev.get("event") == "cross-referenced": 

                src = ev.get("source") or {} 

                issue_obj = src.get("issue") or {} 

                pr_obj = issue_obj.get("pull_request") or {} 

                html_url = issue_obj.get("html_url") 

                if pr_obj and html_url and "/pull/" in html_url: 

                    pr_urls.add(html_url) 

                # Fallback: If not marked but looks like a PR in URL 

                elif html_url and "/pull/" in html_url: 

                    pr_urls.add(html_url) 

 

        # (c) Also include 'mentioned' events (broadened): inspect whatever text fields exist 

        # Already covered via 'body' text extraction 

 

    # 4) Timeline API (the most complete for references) 

    timeline_url = f"https://api.github.com/repos/{REPO}/issues/{issue_number}/timeline" 

    for item in _paginate_gh_get(timeline_url): 

        if not isinstance(item, dict): 

            continue 

 

        # Free-text scan on any plausible string field 

        for key in ("body", "message", "title", "commit_message", "subject"): 

            val = item.get(key) 

            if isinstance(val, str): 

                pr_urls |= _collect_from_text(val, owner, repo) 

 

        # Structured cross-reference payloads 

        if item.get("event") == "cross-referenced": 

            src = item.get("source") or {} 

            issue_obj = src.get("issue") or {} 

            pr_obj = issue_obj.get("pull_request") or {} 

            html_url = issue_obj.get("html_url") 

            if pr_obj and html_url and "/pull/" in html_url: 

                pr_urls.add(html_url) 

            elif html_url and "/pull/" in html_url: 

                pr_urls.add(html_url) 

 

        # Some timeline items are themselves issues/PRs with html_url 

        html_url = item.get("html_url") 

        if isinstance(html_url, str) and "/pull/" in html_url: 

            pr_urls.add(html_url) 

 

        # Occasionally the timeline includes API-style URLs 

        api_url = item.get("url") 

        if isinstance(api_url, str): 

            m = PR_API_URL_RE.search(api_url) 

            if m: 

                pr_urls.add(_normalize_html_pr_url(m.group("owner"), m.group("repo"), m.group("num"))) 

 

    # Final normalization: keep only HTML PR URLs and sort 

    pr_urls = {m.group(0) for url in pr_urls for m in [PR_HTML_URL_RE.search(url)] if m} 

    return sorted(pr_urls) 

 

def pr_number_from_url(u): 

  m = re.search(r"/pull/(\d+)", u) 

  return int(m.group(1)) if m else None 

 

def list_pr_files(pr_number): 

  url = f"https://api.github.com/repos/{REPO}/pulls/{pr_number}/files" 

  files = [] 

  page = 1 

  while True: 

    batch = gh_get(url, params={"per_page":100,"page":page}ignore_status_codes=404) 

    if not batch: 

      break 

    files.extend(batch) 

    page += 1 

  return files 

 

def get_pr_head_sha(pr_number): 

  url = f"https://api.github.com/repos/{REPO}/pulls/{pr_number}" 

  pr = gh_get(urlignore_status_codes=404) 

  return pr["head"]["sha"] 

 

def get_file_at_sha(path, sha): 

  # Use contents API to fetch file at a specific ref (sha). 

  url = f"https://api.github.com/repos/{REPO}/contents/{path}" 

  r = requests.get(url, headers=HEADERS, params={"ref": sha}) 

  if r.status_code == 404: 

    return None 

  r.raise_for_status() 

  data = r.json() 

  if isinstance(data, dict) and data.get("type") == "file" and data.get("download_url"): 

    return gh_get_text(data["download_url"]) 

  return None 

 

def extract_module_term_from_source(src: str) -> str | None: 

    """ 

    Given a module 'source' string, return the last path segment between the 

    final '/' and the '?' (or end of string if '?' is absent). 

    Examples: 

      git::https://...//modules/container/kubernetes-service?ref=v4.0.15 -> 'kubernetes-service' 

      ../modules/network/vnet -> 'vnet' 

      registry- or other sources with no '/' -> returns None 

    """ 

    if not isinstance(src, str) or not src: 

        return None 

    # Strip query string 

    path = src.split('?', 1)[0] 

    # For git:: URLs that include a double-slash path component ("//modules/..."), 

    # keep the right-most path component regardless of scheme. 

    # Normalize backslashes just in case. 

    path = path.replace('\\', '/') 

    # Remove trailing slashes 

    path = path.rstrip('/') 

    # Split and take last non-empty part 

    parts = [p for p in path.split('/') if p] 

    if not parts: 

        return None 

    return parts[-1] 

 

def parse_module_terms_from_tf(tf_text): 

    """ 

    Parse HCL to find module blocks and return the set of module 'terms' 

    extracted from their 'source' attribute (last segment before '?'). 

    """ 

    terms = set() 

    try: 

        obj = hcl2.loads(tf_text) 

    except Exception: 

        return terms 

 

    mods = obj.get("module", []) 

    # module is usually list of dicts[{ "name": { "source": "...", ... }}, ...] 

    def add_src_term(src_str: str): 

        term = extract_module_term_from_source(src_str) 

        if term: 

            terms.add(term) 

 

    if isinstance(mods, list): 

        for item in mods: 

            if isinstance(item, dict): 

                for _, body in item.items(): 

                    if isinstance(body, dict): 

                        src = body.get("source") 

                        if isinstance(src, str): 

                            add_src_term(src) 

    elif isinstance(mods, dict): 

        for _, body in mods.items(): 

            if isinstance(body, dict): 

                src = body.get("source") 

                if isinstance(src, str): 

                    add_src_term(src) 

    return terms 

 

def parse_module_sources_from_tf(tf_text): 

  # Extract module "x" { source = "..." } blocks. 

  sources = set() 

  try: 

    obj = hcl2.loads(tf_text) 

  except Exception: 

    return sources 

 

  mods = obj.get("module", []) 

  # module is usually list of dicts[{ "name": { "source": "...", ... }}, ...] 

  if isinstance(mods, list): 

    for item in mods: 

      if isinstance(item, dict): 

        for _, body in item.items(): 

          if isinstance(body, dict): 

            src = body.get("source") 

            if isinstance(src, str): 

              sources.add(src) 

  elif isinstance(mods, dict): 

    for _, body in mods.items(): 

      if isinstance(body, dict): 

        src = body.get("source") 

        if isinstance(src, str): 

          sources.add(src) 

  return sources 

 

def normalize_local_module_path(source, app_dir): 

  # Only resolve local paths within repo; ignore registry/git/http sources. 

  if source.startswith("./") or source.startswith("../"): 

    # app_dir is like "workload/appA" 

    import posixpath 

    return posixpath.normpath(posixpath.join(app_dir, source)) 

  return None 

 

def list_repo_tf_files_under(dir_path, sha): 

  # Best-effort: use git (checked out main) for listing; then fetch content at sha. 

  # We only need paths; use `git ls-tree` against sha for accuracy. 

  import subprocess 

  try: 

    out = subprocess.check_output(["git","ls-tree","-r","--name-only",sha,dir_path], text=True) 

    paths = [p.strip() for p in out.splitlines() if p.strip().endswith(".tf")] 

    return paths 

  except Exception: 

    return [] 

 

def collect_module_terms_for_app(app_dir, sha): 

    """ 

    Scan all .tf in the app dir at PR head sha; extract: 

      1) module terms directly used by the app 

      2) for any local module sources, recurse one level and extract module terms defined there 

    """ 

    terms = set() 

    module_dirs = set() 

 

    tf_paths = list_repo_tf_files_under(app_dir, sha) 

    for p in tf_paths: 

        txt = get_file_at_sha(p, sha) 

        if not txt: 

            continue 

        # Collect module terms directly in the app 

        terms |= parse_module_terms_from_tf(txt) 

        # Track local modules so we can scan their contents 

        for src in parse_module_sources_from_tf(txt): 

            local = normalize_local_module_path(srcapp_dir) 

            if local: 

                module_dirs.add(local) 

 

    # Scan local module dirs for additional module terms (one level deep) 

    for mdir in sorted(module_dirs): 

        m_tf_paths = list_repo_tf_files_under(mdir, sha) 

        for p in m_tf_paths: 

            txt = get_file_at_sha(p, sha) 

            if not txt: 

                continue 

            terms |= parse_module_terms_from_tf(txt) 

 

    return terms 

 

# ---- Main: issues -> PRs -> touched apps -> module terms ---- 

issues = list_closed_issues() 

 

issue_to_terms = {}  # issue_number -> set(module_terms) 

for issue in issues: 

  inum = issue["number"] 

  pr_urls = extract_pr_urls_from_issue(inum) 

  pr_numbers = sorted({pr_number_from_url(u) for u in pr_urls if pr_number_from_url(u)}) 

 

  if not pr_numbers: 

    continue 

 

  terms_for_issue = set() 

 

  for prn in pr_numbers: 

    sha = get_pr_head_sha(prn) 

    files = list_pr_files(prn) 

    if not sha or not files: 

        continue 

    # Identify which workload apps are touched by this PR. 

    # Requirement: multiple application folders within "workload/". 

    touched_apps = set() 

    for f in files: 

      path = f.get("filename","") 

      if not path.startswith("workload/"): 

        continue 

      parts = path.split("/") 

      if len(parts) >= 2: 

        touched_apps.add("/".join(parts[:2]))  # workload/<app> 

 

    # For each touched app, compute module terms by scanning app + local modules. 

    for app_dir in sorted(touched_apps): 

      terms_for_issue |= collect_module_terms_for_app(app_dir, sha) 

 

  if terms_for_issue: 

    issue_to_terms[inum] = sorted(terms_for_issue) 

 

# Build severity distribution: "severity" = number of issues touching each module term. 

rows = [] 

for inum, terms in issue_to_terms.items(): 

  for t in set(terms): 

    rows.append({"issue": inum, "module_term": t}) 

print(f"rows={len(rows)}") 

 

df = pd.DataFrame(rows) 

df.to_csv("severity_data.csv", index=False) 

 

# Also write a compact JSON for debugging/audit. 

with open("issue_to_module_terms.json","w") as f: 

  json.dump(issue_to_terms, f, indent=2, sort_keys=True) 

 

print(f"Closed issues considered: {len(issues)}") 

print(f"Issues with PR-linked module impact: {len(issue_to_terms)}") 

 

import osjson, re, requests, subprocess 

import hcl2 

REPO = os.environ["REPO"] 

TOKEN = os.environ["GH_TOKEN"] 

HEADERS = {"Authorization": f"Bearer {TOKEN}", "Accept": "application/vnd.github+json, application/vnd.github.mockingbird-preview+json", "X-GitHub-Api-Version": "2026-03-10"} 

 

with open("issues.json") as f: 

  issues = json.load(f) 

 

issue_to_terms = {} 

issue_turnaround = {} 

module_deps = {}  # app_dir -> set(module paths it references) 

 

for issue in issues: 

  inum = issue["number"] 

  created = issue.get("created_at") 

  closed = issue.get("closed_at") 

  if created and closed: 

    from datetime import datetime 

    fmt = "%Y-%m-%dT%H:%M:%SZ" 

    try: 

      dt_created = datetime.strptime(created, fmt) 

      dt_closed = datetime.strptime(closed, fmt) 

      delta_days = (dt_closed - dt_created).total_seconds() / 86400.0 

    except Exception: 

      delta_days = None 

  else: 

    delta_days = None 

  issue_turnaround[inum] = delta_days 

 

  pr_urls = issue.get("pr_urls",[]) 

  pr_numbers = sorted({pr_number_from_url(u) for u in pr_urls if pr_number_from_url(u)}) 

  terms_for_issue = set() 

  for prn in pr_numbers: 

    sha = get_pr_head_sha(prn) 

    files = list_pr_files(prn) 

    touched_apps = set() 

    for f in files: 

      path = f.get("filename","") 

      if path.startswith("workload/"): 

        parts = path.split("/") 

        if len(parts) >= 2: 

          touched_apps.add("/".join(parts[:2])) 

    for app_dir in sorted(touched_apps): 

      terms_for_issue |= collect_module_terms_for_app(app_dir, sha) 

      # collect module sources for dependency graph 

      # scan app tf files for module sources at PR head 

      tf_paths = list_repo_tf_files_under(app_dir, sha) 

      for p in tf_paths: 

        txt = get_file_at_sha(p, sha) 

        if not txt: 

          continue 

        for src in parse_module_sources_from_tf(txt): 

          local = normalize_local_module_path(srcapp_dir) 

          if local: 

            module_deps.setdefault(app_dir, set()).add(local) 

  if terms_for_issue: 

    issue_to_terms[inum] = sorted(terms_for_issue) 

 

rows = [] 

for inum, terms in issue_to_terms.items(): 

  for t in set(terms): 

    rows.append({"issue": inum, "module_term": t}) 

import pandas as pd 

df = pd.DataFrame(rows) 

df.to_csv("severity_data.csv", index=False) 

 

ta_rows = [] 

for inum, days in issue_turnaround.items(): 

  ta_rows.append({"issue": inum, "turnaround_days": days}) 

pd.DataFrame(ta_rows).to_csv("turnaround.csv", index=False) 

 

with open("issue_to_module_terms.json","w") as f: 

  json.dump(issue_to_terms, f, indent=2) 

with open("issue_turnaround.json","w") as f: 

  json.dump(issue_turnaround, f, indent=2) 

with open("module_deps.json","w") as f: 

  json.dump({k: sorted(list(v)) for k,v in module_deps.items()}, f, indent=2) 

 

print(f"ISSUES_WITH_TYPES={len(issue_to_terms)}") 

 

import osjson, datetime, glob 

import pandas as pd 

import matplotlib.pyplot as plt 

import seaborn as sns 

import networkx as nx 

 

ts = datetime.datetime.utcnow().strftime("%Y%m%d-%H%M%S") 

os.makedirs("history", exist_ok=True) 

 

# --- Severity bar (existing) --- 

if os.path.exists("severity_data.csv"): 

  df = pd.DataFrame(columns=["issue", "module_term"]) 

  try: 

      df = pd.read_csv("severity_data.csv") 

  except: 

     pass 

  counts = df.groupby("module_term")["issue"].nunique().sort_values(ascending=False) 

else: 

  counts = pd.Series(dtype=int) 

 

png_sev = f"history/severity-by-module-{ts}.png" 

plt.figure(figsize=(12,6)) 

if not counts.empty: 

  counts.plot(kind="bar") 

  plt.title("Issue frequency by module term") 

  plt.xlabel("module_term") 

  plt.ylabel("number of closed issues touching module term") 

else: 

  plt.text(0.5, 0.5, "No module-impacting issues in window", ha="center", va="center") 

  plt.axis("off") 

plt.tight_layout() 

plt.savefig(png_sev) 

plt.clf() 

 

# --- Heatmap: module_term x issue (binary or counts) --- 

heat_png = f"history/heatmap-module-issues-{ts}.png" 

 

if os.path.exists("severity_data.csv"): 

  mat = pd.DataFrame(columns=["issue", "module_term"]) 

  try: 

      mat = pd.read_csv("severity_data.csv") 

  except: 

     pass   

  if not mat.empty: 

    pivot = mat.pivot_table(index="module_term", columns="issue", aggfunc='size', fill_value=0) 

    # Optionally cluster or sort by total counts 

    pivot['total'] = pivot.sum(axis=1) 

    pivot = pivot.sort_values('total', ascending=False).drop(columns=['total']) 

    # limit columns for readability (most recent/top issues) 

    if pivot.shape[1] > 100: 

      pivot = pivot.iloc[:, :100] 

    plt.figure(figsize=(14, max(6, 0.2 * pivot.shape[0]))) 

    sns.heatmap(pivot, cmap="YlOrRd", cbar=True) 

    plt.title("Heatmap: module terms (rows) vs issues (columns)") 

    plt.xlabel("Issue number (truncated)") 

    plt.ylabel("module terms") 

    plt.tight_layout() 

    plt.savefig(heat_png) 

    plt.clf() 

  else: 

    plt.figure(figsize=(6,2)) 

    plt.text(0.5,0.5,"No data for heatmap",ha="center",va="center") 

    plt.axis("off") 

    plt.savefig(heat_png) 

    plt.clf() 

else: 

  plt.figure(figsize=(6,2)) 

  plt.text(0.5,0.5,"No data for heatmap",ha="center",va="center") 

  plt.axis("off") 

  plt.savefig(heat_png) 

  plt.clf() 

 

# --- Trend lines: aggregate historical severity_data.csv files in history/ --- 

trend_png = f"history/trendlines-module-{ts}.png" 

collect historical CSVs that match severity_data pattern 

hist_files = sorted(glob.glob("history/*severity-data-*.csv") + glob.glob("history/*severity_data.csv") + glob.glob("history/*severity-by-module-*.csv")) 

also include current run's severity_data.csv 

if os.path.exists("severity_data.csv"): 

  hist_files.append("severity_data.csv") 

# Build weekly counts per module terms by deriving timestamp from filenames where possible 

trend_df = pd.DataFrame() 

for f in hist_files: 

  try: 

    # attempt to extract timestamp from filename 

    import re 

    m = re.search(r"(\d{8}-\d{6})", f) 

    ts_label = m.group(1) if m else os.path.getmtime(f) 

    from datetime import datetime 

    ts_label = str(datetime.utcfromtimestamp(ts_label).strftime(“%Y%m%d-%H%M%S”)) 

    tmp = pd.DataFrame(columns=["issue", "module_term"]) 

    try: 

        tmp = pd.read_csv(f) 

    except: 

       pass 

    if tmp.empty: 

        continue   

    counts_tmp = tmp.groupby("module_term")["issue"].nunique().rename(ts_label) 

    trend_df = pd.concat([trend_dfcounts_tmp], axis=1) 

  except Exception: 

    continue 

if not trend_df.empty: 

  trend_df = trend_df.fillna(0).T 

  # convert index to datetime where possible 

  plt.figure(figsize=(14,6)) 

  # plot top N module_terms by latest total 

  latest = trend_df.iloc[-1].sort_values(ascending=False).head(8).index.tolist() 

  for col in latest: 

    plt.plot(trend_df.indextrend_df[col], marker='o', label=col) 

  plt.legend(loc='best', fontsize='small') 

  plt.title("Trend lines: issue frequency over time for top module_terms") 

  plt.xlabel("time") 

  plt.ylabel("issue count") 

  plt.xticks(rotation=45) 

  plt.tight_layout() 

  plt.savefig(trend_png) 

  plt.clf() 

else: 

  plt.figure(figsize=(8,2)) 

  plt.text(0.5,0.5,"No historical data for trend lines",ha="center",va="center") 

  plt.axis("off") 

  plt.savefig(trend_png) 

  plt.clf() 

 

# --- Dependency graph: build directed graph from module_deps.json --- 

dep_png = f"history/dependency-graph-{ts}.png" 

if os.path.exists("module_deps.json"): 

  with open("module_deps.json") as f: 

    deps = json.load(f) 

  G = nx.DiGraph() 

  # add edges app -> module 

  for app, mods in deps.items(): 

    G.add_node(app, type='app') 

    for m in mods: 

      G.add_node(m, type='module') 

      G.add_edge(app, m) 

  if len(G.nodes) == 0: 

    plt.figure(figsize=(6,2)) 

    plt.text(0.5,0.5,"No dependency data",ha="center",va="center") 

    plt.axis("off") 

    plt.savefig(dep_png) 

    plt.clf() 

  else: 

    plt.figure(figsize=(12,8)) 

    pos = nx.spring_layout(G, k=0.5, iterations=50) 

    node_colors = ['#1f78b4' if G.nodes[n].get('type')=='app' else '#33a02c' for n in G.nodes()] 

    nx.draw_networkx_nodes(G, pos, node_size=600, node_color=node_colors) 

    nx.draw_networkx_edges(G, pos, arrows=True, arrowstyle='->', arrowsize=12, edge_color='#888888') 

    nx.draw_networkx_labels(G, pos, font_size=8) 

    plt.title("Module dependency graph (apps -> local modules)") 

    plt.axis('off') 

    plt.tight_layout() 

    plt.savefig(dep_png) 

    plt.clf() 

else: 

  plt.figure(figsize=(6,2)) 

  plt.text(0.5,0.5,"No dependency data",ha="center",va="center") 

  plt.axis("off") 

  plt.savefig(dep_png) 

  plt.clf() 

 

# --- Turnaround chart (existing) --- 

ta_png = f"history/turnaround-by-issue-{ts}.png" 

if os.path.exists("turnaround.csv"): 

  ta = pd.DataFrame(columns=["issue", "turnaround_days"]) 

  try: 

      ta = pd.read_csv("turnaround.csv") 

  except: 

      pass 

  ta = ta.dropna(subset=["turnaround_days"]) 

  if not ta.empty: 

    ta_sorted = ta.sort_values("turnaround_days", ascending=False).head(50) 

    plt.figure(figsize=(12,6)) 

    plt.bar(ta_sorted["issue"].astype(str), ta_sorted["turnaround_days"]) 

    plt.xticks(rotation=90) 

    plt.title("Turnaround time (days) for closed issues in window") 

    plt.xlabel("Issue number") 

    plt.ylabel("Turnaround (days)") 

    plt.tight_layout() 

    plt.savefig(ta_png) 

    plt.clf() 

  else: 

    plt.figure(figsize=(8,2)) 

    plt.text(0.5,0.5,"No turnaround data available",ha="center",va="center") 

    plt.axis("off") 

    plt.savefig(ta_png) 

    plt.clf() 

else: 

  plt.figure(figsize=(8,2)) 

  plt.text(0.5,0.5,"No turnaround data available",ha="center",va="center") 

  plt.axis("off") 

  plt.savefig(ta_png) 

  plt.clf() 

 

# --- Issue activity charts (opened vs closed) --- 

activity_png = f"history/issue-activity-{ts}.png" 

 

if os.path.exists("issue_activity.csv"): 

    act = pd.read_csv("issue_activity.csv") 

 

    plt.figure(figsize=(6,4)) 

    plt.bar(act["metric"], act["count"], color=["#1f78b4", "#33a02c"]) 

    plt.title("GitHub issue activity in last window") 

    plt.xlabel("Issue state") 

    plt.ylabel("Count") 

    plt.tight_layout() 

    plt.savefig(activity_png) 

    plt.clf() 

else: 

    plt.figure(figsize=(6,2)) 

    plt.text(0.5, 0.5, "No issue activity data", ha="center", va="center") 

    plt.axis("off") 

    plt.savefig(activity_png) 

    plt.clf() 

 

# --- AI summary (who wants what) --- 

if os.path.exists("issues.json"): 

  with open("issues.json") as f: 

    issues = json.load(f) 

else: 

  issues = [] 

condensed = [] 

for i in issues: 

  condensed.append({ 

    "number": i.get("number"), 

    "user": i.get("user"), 

    "title": i.get("title"), 

    "html_url": i.get("html_url") 

  }) 

with open("issues_for_ai.json","w") as f: 

  json.dump(condensed, f, indent=2) 

 

# call OpenAI if key present (same approach as before) 

import subprocess, os 

OPENAI_KEY = os.environ.get("OPENAI_API_KEY") 

ai_text = "AI summary skipped (no OPENAI_API_KEY)." 

if OPENAI_KEY: 

  prompt = ("You are given a JSON array of GitHub issues with fields: number, user, title, html_url. " 

            "Produce a concise list of one-line 'who wants what' statements, one per issue, in plain text. " 

            "Format: '#<number> — <user> wants <succinct request derived from title>'. " 

            "Do not add commentary.") 

  payload = { 

    "model": "gpt-4o-mini", 

    "messages": [{"role":"system","content":"You are a concise summarizer."}, 

                 {"role":"user","content": prompt + "\\n\\nJSON:\\n" + json.dumps(condensed)[:15000]}], 

    "temperature":0.2, 

    "max_tokens":400 

  } 

  proc = subprocess.run([ 

    "curl","-sS","https://api.openai.com/v1/chat/completions", 

    "-H", "Content-Type: application/json", 

    "-H", f"Authorization: Bearer {OPENAI_KEY}", 

    "-d", json.dumps(payload) 

  ], capture_output=True, text=True) 

  if proc.returncode == 0 and proc.stdout: 

    try: 

      resp = json.loads(proc.stdout) 

      ai_text = resp["choices"][0]["message"]["content"].strip() 

    except Exception: 

      ai_text = "AI summary unavailable (parsing error)." 

 

# --- Write markdown report combining all visuals --- 

md_path = f"history/severity-report-{ts}.md" 

with open(md_path, "w") as f: 

  f.write("# Weekly Terraform module hotspot report\n\n") 

  f.write(f"**Window (days):** {os.environ.get('WINDOW_DAYS','7')}\n\n") 

  f.write("## AI Summary (who wants what)\n\n") 

  f.write("```\n") 

  f.write(ai_text + "\n") 

  f.write("```\n\n") 

  f.write("## GitHub issue activity (last window)\n\n") 

  f.write(f"![{os.path.basename(activity_png)}]" 

          f"({os.path.basename(activity_png)})\n\n") 

 

  if os.path.exists("issue_activity.csv"): 

      act = pd.read_csv("issue_activity.csv") 

      f.write(act.to_markdown(index=False) + "\n\n") 

  f.write("## Top module terms by issue frequency\n\n") 

  if not counts.empty: 

    f.write("![" + os.path.basename(png_sev) + "](" + os.path.basename(png_sev) + ")\n\n") 

    f.write(counts.head(30).to_frame("issues").to_markdown() + "\n\n") 

  else: 

    f.write("No module-impacting issues found in the selected window.\n\n") 

  f.write("## Heatmap: module terms vs issues\n\n") 

  f.write("![" + os.path.basename(heat_png) + "](" + os.path.basename(heat_png) + ")\n\n") 

  f.write("## Trend lines: historical issue frequency for top module terms\n\n") 

  f.write("![" + os.path.basename(trend_png) + "](" + os.path.basename(trend_png) + ")\n\n") 

  f.write("## Dependency graph: apps -> local modules\n\n") 

  f.write("![" + os.path.basename(dep_png) + "](" + os.path.basename(dep_png) + ")\n\n") 

  f.write("## Turnaround time for closed issues (days)\n\n") 

  f.write("![" + os.path.basename(ta_png) + "](" + os.path.basename(ta_png) + ")\n\n") 

  f.write("## Data artifacts\n\n") 

  f.write("- `severity_data.csv` — per-issue module term mapping\n") 

  f.write("- `turnaround.csv` — per-issue turnaround in days\n") 

  f.write("- `issue_to_module_terms.json` — mapping used to build charts\n") 

  f.write("- `module_deps.json` — module dependency data used for graph\n") 

 

# Save current CSVs into history with timestamp for future trend aggregation 

try: 

  import shutil 

  if os.path.exists("severity_data.csv"): 

    shutil.copy("severity_data.csv", f"history/severity-data-{ts}.csv") 

  if os.path.exists("turnaround.csv"): 

    shutil.copy("turnaround.csv", f"history/turnaround-{ts}.csv") 

except Exception: 

  pass 

 

print(f"REPORT_MD={md_path}") 

print(f"REPORT_PNG={png_sev}") 

print(f"REPORT_HEAT={heat_png}") 

print(f"REPORT_TREND={trend_png}") 

print(f"REPORT_DEP={dep_png}") 

print(f"REPORT_TA={ta_png}") 

 

import os, re 

from pathlib import Path 

 

hist = Path("history") 

hist.mkdir(exist_ok=True) 

 

# Pair md+png by timestamp in filename: severity-by-module-YYYYMMDD-HHMMSS.(md|png) 

pat = re.compile(r"^severity-by-module-(\d{8}-\d{6})\.(md|png)$") 

 

groups = {} 

for p in hist.iterdir(): 

  m = pat.match(p.name) 

  if not m: 

    continue 

  ts = m.group(1) 

  groups.setdefault(ts, []).append(p) 

 

# Keep newest 10 timestamps 

timestamps = sorted(groups.keys(), reverse=True) 

keep = set(timestamps[:10]) 

drop = [p for ts, files in groups.items() if ts not in keep for p in files] 

 

for p in drop: 

  p.unlink() 

 

print(f"Kept {len(keep)} report sets; pruned {len(drop)} files.") 

 

--- 

This produces sample output including the various json and csv files as mentioned above. We list just one of them: 

                  metric       count 

0               #opened   8 

1               #closed     8 
Care must be taken to not run into rate limits: For example: 

{“message”: “API rate limit exceeded for <client-ip-address>”, documentation_url”: https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting}