{"slug": "the-unreasonable-redundancy-of-nature-s-protein-folds", "title": "The Unreasonable Redundancy of Nature's Protein Folds", "summary": "Deep neural networks have enabled generative modeling of biomolecules, with models like AlphaFold3 and Chai-2 now capable of predicting and designing drug-like molecules and antibodies. However, researchers at Ligo found that scaling structural training data by folding more natural protein sequences yields diminishing returns, as protein folds are far more redundant than the vast number of sequences suggests. This redundancy challenges the assumption that simply converting sequence databases into predicted 3D structures will provide the diverse training data needed to improve generative models for enzyme design.", "body_md": "# The Unreasonable Redundancy of Nature's Protein Folds\n\nOver the last few years, deep neural networks have made generative language modeling dramatically\nmore powerful, giving us large language models. A similar leap happened for continuous\nmodalities like images and videos. Recently, similar techniques have been applied to the generative\nmodeling of\nbiomolecules with great success. Models such as DeepMind's AlphaFold3 made it much easier to predict\nbiomolecular interactions, including drug-protein and antibody-protein complexes, and soon after people\nfigured out how to re-purpose\nthese capabilities to design drug-like molecules.\n[Chai-2](https://www.chaidiscovery.com/news/chai-2-mab),\n[Latent-X2](https://www.latentlabs.com/latent-x2/), and\n[Nabla](https://www.nabla.bio/platform) all report developable antibody\nor biologics designs.\nIn the near future, we might see most\nantibodies entering the clinic designed in large part with deep-learning-based generative models,\npotentially\nwith superior pharmaceutical properties and targeting receptors that have resisted wet-lab based approaches.\n\nHow would you improve on these systems? We definitely want to have better biomolecular modeling so we can put better drugs into the clinic. The recipe for improving a deep learning system has been surprisingly simple at a high level: you scale the model, scale the compute, and scale the data. LLMs are obviously improving by being scaled aggressively. AlphaFold3 was also a major effort to scale the model and data; it is trained on a broad collection of known biomolecular complexes, from experimental structures and protein-ligand complexes to the enormous sequence databases produced by genomics and metagenomics such as MGnify. Internally, DeepMind called the project \"all-PDB\" for a while, referring to all the interactions represented in the Protein Data Bank.\n\nThe key move in AlphaFold3's scaling recipe was to turn sequence scale into structure scale: use structure prediction to convert large protein sequence databases into predicted 3D structures. Genomics and metagenomics have given us billions of protein sequences, many inferred from environmental DNA collected from organisms that have never been cultured in the lab. For training structure-based design models, though, the useful object is often the 3D structure. Structure prediction models let us convert some of that sequence scale into structural data: take millions of natural sequences, predict the folds they adopt, and use those predicted structures as training examples for the next generation of biomolecular models.\n\nAt Ligo, we care about this recipe because we train generative models for designing enzymes. When we tried to scale our structural training data by folding more natural sequences, we ran into a problem: natural protein sequences are vast, but their folds are much more redundant than the sequence counts suggest. This post is about that mismatch, and about why simply folding more natural sequences may not buy as much new structural diversity as we hoped. We will describe data engineering tricks for clustering the known protein universe, and what our results imply about how to think about the enzyme design problem.\n\n## Modern biomolecular models rely on sequence scale\n\nModern structure prediction models rely heavily on multiple sequence alignments. A multiple sequence alignment, or MSA, lines up related versions of a protein from different organisms. When two positions in that alignment tend to change together, Coevolution means that two positions change in a coordinated way across related proteins. For example, if one position is usually negatively charged and touches a positively charged position, evolution may flip both together while avoiding pairs that would repel each other. it can be a clue that the corresponding residues are close in 3D space or tied together by function. My mental model of AlphaFold2 is that it used this kind of coevolutionary signal to constrain the rough geometry of a protein, then learned how to fill in the rest of the structure.\n\nAlphaFold3 seems to be doing something broader. Its antibody-antigen performance is especially interesting because there are no MSAs to extract clues from. Antibodies and their targets do not share an evolutionary history. To do well there, the model has to learn something about protein surfaces themselves: which shapes, chemistries, and local geometries are likely to be compatible with each other. That is a different kind of signal than residue coevolution within one protein family.\n\nThis is where MGnify-scale data may matter. Metagenomic sequence resources expose models\nto enormous numbers of natural variants, many from organisms we have never cultured. The\nempirical clue is that models trained with MGnify-scale protein distillation seem to separate\nmost clearly on antibody-antigen prediction, where direct coevolution cannot explain the\ninteraction signal ([Supplementary info](#supplementary-interface-benchmarks)).\nThat increased coverage of sequence space looks valuable. The question is whether it also\ncomes with comparable diversity in protein folds.\n\n## Sequence diversity is not fold diversity\n\nThe theoretical protein sequence space is absurdly large: a protein of length N has\n20N possible amino-acid sequences. Natural proteins occupy only a tiny,\nhighly structured part of that space. Evolution tends to reuse folds that are stable,\nexpressible, and adaptable, rather than scattering proteins uniformly across every possible\nsequence and shape.\n\nThat matters for training data. When we scale predicted structures, we are not necessarily adding independent examples. We may also be adding many sequence variants of the same fold families, domain combinations, and evolutionary compromises. The example below shows the basic problem: proteins can look far apart when measured by sequence similarity, while still being very close in fold space.\n\nOne concrete example from our AFDB fragment clusters: in structural cluster\n`A0A242HMU2_f1`\n\n, three proteins are only 23.9–28.3% identical in sequence\nwhile still sharing the same fold (TM-score > 0.75).\nPairwise global identities after clipping: 28.2%, 28.3%, and 23.9%.\nLocal TM-align scores on the predicted structures are 0.768–0.813\nusing average-length normalization.\n\n| Fragment | UniProt annotation | Length | Seq. id. to rep. | TM to rep. |\n|---|---|---|---|---|\n`A0A518BRX6_f1` |\n3-oxoacyl-[acyl-carrier-protein] reductase FabG, Bacteria | 249 aa | 100% | 1.000 |\n`A0A1Q3EPK1_f1` |\nNAD-binding protein, Lentinula edodes |\n283 aa | 28.2% | 0.768 |\n`A0A6I8MDZ6_f1` |\nShort-chain dehydrogenase/reductase SDR, Oceanivirga miroungae |\n261 aa | 23.9% | 0.793 |\n\nAs we scale up our sequence datasets, how many genuinely new folds should we expect to see? If MGnify grew 10x, how many of those new sequences would actually be structurally novel?\n\nTo answer this systematically across the whole space, we need a scalable clustering\nalgorithm. Foldseek is a brilliant tool for this, and its authors have already clustered\nthe AlphaFold Database with it,\n[reporting 2.3 million\nnon-singleton structural clusters](https://www.nature.com/articles/s41586-023-06510-w). But there are real issues with clustering\npredicted structures, and the clustering problem itself is ill-posed. We think the\ntrue number of reusable structural neighborhoods **is much closer to tens of thousands\nthan to the 2.3 million non-singleton clusters reported by that fast Foldseek pass**\n— closer to 25,000 than 2.3 million in our current analysis. Here's the reasoning.\n\n## The predicted-structure problem for clustering\n\nPredicted structures are different from crystals. The sequences and MSAs are real, but the structures are missing context, and AlphaFold will predict the whole chain: ordered domains, floppy tails, long linkers, signal peptides, and multi-domain proteins whose relative placement may not be meaningful.\n\nThis makes the clustering problem ill-posed. Are two proteins the same fold because one domain matches? Are they different because one has a disordered extension?\n\nThe shape of predicted structures is also a problem for training generative models on this data. You don't want to waste model capacity fitting disordered regions, and you don't want to learn to generate bizarre, elongated chains. You could filter on global pLDDT, radius of gyration, and similar whole-chain metrics, but those filters are too crude for data shaped like this — they throw out good domains attached to bad tails. We need a more surgical way to keep the signal and drop the noise.\n\n## First pass: remove the obvious noise\n\nOur first attempt was simple. Remove residues below a pLDDT threshold, split what remains into contiguous sequence fragments, and then spatially rejoin fragments that are clearly touching. The rejoin step is a union-find problem: if fragment A touches B, and B touches C, then A, B, and C become one connected fragment.\n\n- Residues below pLDDT 65 are marked as unusable.\n- Remaining residues become contiguous sequence fragments.\n- Fragments with enough close contacts are merged spatially.\n- The resulting candidates can then be filtered before clustering.\n\nThis gets rid of a lot of obvious disorder. It also keeps ordered domains that global filters would throw away because the full protein looked too long, too extended, or too messy.\n\n**Limitations of naive fragmentation.** The obvious failure\nmode is a high-confidence linker. If the linker survives the pLDDT\nfilter and makes enough contacts, the spatial merge can connect two\ndomains that we would rather treat separately. Union-find then does\nexactly what it was asked to do: it turns the connected chain into one\nfragment.\n\nThe problem is that this is not really a local-confidence question. The residues can be predicted confidently and still be the wrong unit of training. What we need to detect is the bottleneck in the spatial graph: the narrow path that connects otherwise independent pieces.\n\n### A0A0E0RCK4 first pass\n\nThe full chain is kept intact; fragments shorter than 20 residues are left unnumbered.\n\n## The graph-theoretic split\n\nWe need a way to split proteins based on how the residues are connected to each other. For that, a protein is naturally a graph: each residue is a node, and edges connect residues that are close in space. We use C-alpha atoms from the confident part of the prediction, connect each residue to its spatial nearest neighbors, and give close neighbors stronger weights than distant ones. In the current version, each residue sees its 15 nearest spatial neighbors.\n\nThis turns the fragmentation problem into a connectivity problem. A compact domain becomes a dense local graph. A high-confidence linker becomes a narrow bridge between two dense regions. Graph theory gives us tools for asking whether that bridge is really part of one unit, or whether it is holding two independent pieces together.\n\n### Graph theoretic split of a nearest neighbour protein graph\n\nSpectral bisection asks for the weakest global connection in this graph (why the quantity\nwe compute finds this connection is a little bit black magic to me, ask the graph theorists).\nWe found that the spectral bisection points of a protein correlate very well with the points\nwe'd cut at if we were manually identifying different protein regions. A standalone version\nof the splitter is included in [Supplementary info](#supplementary-spectral-split).\n\n### For the curious: the Fiedler vector\n\nGiven a weighted adjacency matrix \\(W\\) of a graph, the normalized graph Laplacian is:\n\nHere \\(D\\) is the diagonal degree matrix. The eigenvectors of \\(L_{\\mathrm{sym}}\\) reveal graph structure:\n\n- The smallest eigenvalue is always 0.\n- The second smallest eigenvalue, \\(\\lambda_2\\), measures algebraic connectivity.\n- The corresponding Fiedler vector gives a useful two-way partition: residues with opposite signs sit on opposite sides of the cut.\n\nThis is the same quantity shown on the graph above. The protein is just green context, while the graph edges are colored by the average Fiedler value of their two endpoint residues. Strongly negative edges are blue, strongly positive edges are red, and edges near zero are pale. Those near-zero residues are the bottleneck between the two sides, so those are the residues we remove before assigning fragments.\n\n### Recursive bisection\n\nA single bisection only finds one cut. Multi-domain proteins need repeated\ncuts, so after each split we check each partition separately. If a\npartition has `λ`\n\n, we treat it\nas internally well connected and stop. Otherwise, we split again.\n2 > threshold\n\n### Naive merge versus spectral pipeline\n\nBoth panels load one full-chain CIF; fragments shorter than 20 residues are left unnumbered.\n\n## Clustering the fragments\n\nOnce we split proteins into their \"interacting units\", the unit of clustering is no longer one predicted protein. It is one compact fragment. We can cluster those fragments by structural similarity and ask how much independent fold signal the distillation sets actually contain.\n\nWe cluster MGnify at roughly 30% sequence identity with MMseqs2, which gives about 40\nmillion sequence clusters. From there, we discard sequence singletons, then use the\n[OpenFold3-predicted structures\nreleased through the OpenFold datasets portal](https://portal.openfold.omsf.io/datasets) for the remaining MGnify sequences.\nThis is the handoff point from sequence-space cleaning to structure-space cleaning:\nsequence singletons are removed first, then the OpenFold3 predictions are fragmented\nand filtered.\nWe fragment those predicted structures and filter the fragments with quality metrics\nmeant to keep examples amenable to training a generative model (we will write more\nabout this in a later post). The structural clustering below starts from the resulting\nset: about 2 million MGnify fragments.\n\n[We use Foldseek for the\nfirst pass of clustering](#supplementary-foldseek-command). Foldseek's fast mode uses both\nstructural and sequence-derived signals, which is what makes\nit practical, but also means it can split fragments that are\nstructurally very similar and sequence-divergent.\n\n### Common pitfall: Foldseek singletons are not always singletons\n\nA Foldseek singleton is not necessarily a new fold. It only means no other fragment crossed the thresholds in that particular Foldseek run. To check this failure mode, we took 1,000 fragments that Foldseek had labeled as singletons and compared them against each other with pairwise TM-score. At TM ≥ 0.8, 373 of those fragments fell into 69 connected components. The largest hidden cluster had 35 members. These were supposed to be singletons. Foldseek's fast pass mixes 3Di and sequence-derived signals. The TM-align audit is slower, but it asks the question we actually care about here: whether the backbones superpose.\n\n### Hidden clusters among Foldseek singletons\n\nEach panel overlays four fragments that Foldseek had separated into singleton clusters.\n\n`TMalign.cpp`\n\n.\nThe practical lesson is to be skeptical of the singletons. If we treat every Foldseek singleton as an independent structural mode, we overestimate novelty and give the sampler a distorted picture of fold space. TM-score is much slower, but it is the right ground-truth audit pass when the question is whether two fragments really share a fold.\n\nSo we added a second pass over cluster representatives. Instead of comparing every fragment to every other fragment, we compared representatives against representatives with a more structure-centered alignment, then confirmed candidate merges with our TM-align implementation. Merge criterion: min(tm_norm_a, tm_norm_b) ≥ 0.7. Both directions had to independently confirm fold-level similarity. If two representatives passed that test, we merged their clusters with union-find.\n\nMain result\n\n| Dataset | Fragments in multi-member clusters | Multi-member clusters | Largest cluster | Top 100 | Top 1,000 |\n|---|---|---|---|---|---|\n| AlphaFold Database | 1,592,372 | 30,622 | 20,836 | 26.0% | 64.3% |\n| MGnify | 1,961,750 | 25,302 | 41,801 | 29.1% | 71.5% |\n\nThese are the results that surprised us. After sequence clustering, fragmentation, quality filtering, and dropping singleton clusters from this summary, MGnify is not two million independent structural examples. The repeated part of the dataset is closer to twenty-five thousand structural neighborhoods, with most of the mass concentrated in a small head of the distribution. The top 1,000 MGnify clusters are only about 4.0% of the multi-member cluster list, but they contain 71.5% of the fragments in those clusters.\n\n### MGnify is dominated by a small head of clusters\n\nSingleton clusters are omitted here to focus on the multi-member distribution.\n\n**1.96M** fragments plotted\n\n**25.3K** multi-member clusters\n\n**71.5%** in the top 1,000\n\n#### Cumulative fragment mass\n\nThat changes what \"sampling from MGnify\" means. Sampling fragments uniformly mostly revisits the same common neighborhoods. Sampling clusters uniformly goes too far the other way and gives the smallest repeated neighborhoods too much influence. The sampler needs to live somewhere between those two extremes.\n\n## Sampling from a redundant world\n\nIf we are training a generative model on MGnify, how should we sample from these clusters?\nThe standard recipe is: pick a cluster uniformly, then pick a member uniformly from inside\nit. For data this skewed, that overshoots in the opposite direction.\nWith uniform cluster sampling, the top 1,000 MGnify clusters\n— which hold 71.5% of multi-member fragments — would be sampled only about 4%\nof the time.\nWe instead use a balancing exponent γ (implemented as `cluster_size_exponent`\n\n),\nwhere the aggregate sampling probability of a cluster scales like Nγ with N\nthe cluster size. γ = 1 recovers the dataset distribution; γ = 0 weights every\ncluster equally; values in between trade off natural abundance against fold diversity.\n\n### Sampling mass under γ-reweighting\n\nPer-cluster aggregate weight scales as Nγ.\n\n**10,462** effective clusters\n\n**22.25%** sampling mass in top 1,000\n\nip\n\ni\n\n2, where p\n\niis the sampling mass assigned to cluster i.\n\nThe right γ depends on what the downstream model is supposed to learn. A generative\nmodel trying to *cover* fold space wants lower γ, since it benefits from seeing\nrare folds more than once per epoch. A folding model trying to *match* natural\nconditional distributions wants higher γ. We don't have a settled answer, but\nγ around 0.5 has been a reasonable starting point in our setups — it preserves the\nhead's dominance while flattening the long tail. The effective-cluster count in the figure\nis a useful sanity check: it's the number of clusters you would need if they were all\nequally weighted to give the same diversity as your reweighted distribution.\nAt γ = 0.5, aggregate cluster weight grows like the square root of cluster size:\na cluster with 100 times more fragments gets 10 times the sampling mass.\n\n## Conclusion\n\nWhat surprised us is how redundant natural fold space looks once you pick the right\nunit of clustering. After cleaning up predicted structures, cutting away obvious\nnoise, splitting multi-domain chains, auditing Foldseek singletons, and clustering\nthe resulting fragments, most of the mass sits in a small number of structural\nneighborhoods. **Natural proteins do not appear to be exploring backbone space\nuniformly.** They seem to reuse a relatively small set of fold solutions over\nand over.\n\nNatural enzymes often evolve by modifying existing proteins: duplication, divergence, loop changes, active-site mutations, cofactors, metals, and changes in the local environment around a substrate. What surprised us was not that nature reuses folds, but how strongly that reuse shows up once we process predicted structures into training units and cluster them at scale.\n\nFor enzyme design, this leaves two different possibilities. One is the nature-like route: choose a familiar scaffold and learn how to engineer the active-site neighborhood with much higher precision. In that view, the loops, first-shell residues, ligand pose, cofactors, and interaction geometry matter more than making the backbone globally novel. If that is the right regime, then simply adding more natural sequence-derived structures may not help much by itself; it may mostly give us more examples of the same scaffold families.\n\nThe other possibility is more speculative. Evolution is historically constrained; new enzymes do not appear from nowhere, and natural fold space may be shaped by what was easy to reach through duplication and divergence. If design models become good enough, there may be useful backbone space that nature never explored. But that raises a harder question: can models trained mostly on natural folds learn to explore outside the natural fold manifold, or do they inherit the same redundancy we are measuring here?\n\nWe will find out in the lab as we try to design enzymes, seeing which designs actually express, fold, and catalyze. More on this later.\n\n## Supplementary info\n\n### Where the MGnify distillation advantage actually shows up\n\nOne reason I am suspicious of treating MGnify as just more sequence data: the performance advantage does not appear uniformly across every interface benchmark. It shows up strongly in antibody-antigen prediction, while most of the broader cofolding benchmarks remain closer together. This is not a clean causal experiment — architecture, compute, and training details all move at once — but it fits the intuition that protein-protein interaction accuracy improves when the model has seen much more of natural protein space.\n\n### Spectral linker split\n\nThis is a standalone version of the graph split used above. It assumes\nyou have already filtered out low-confidence residues, and that\n`ca_coords`\n\ncontains only the C-alpha coordinates you still\nwant to consider. The function returns the C-alpha indices that sit\nclosest to Fiedler sign-change boundaries; in our pipeline those\nresidues become linker/cut residues before fragment IDs are assigned.\n\n``` python\nimport numpy as np\nfrom scipy.sparse import csr_matrix\nfrom scipy.spatial import KDTree\n\ndef spectral_linker_indices(\n    ca_coords: np.ndarray,\n    *,\n    k: int = 15,\n    sigma: float = 8.0,\n    connectivity_threshold: float = 0.05,\n    boundary_size: int = 4,\n    min_partition_size: int = 50,\n) -> np.ndarray:\n    \"\"\"\n    Find linker-like C-alpha positions with recursive spectral bisection.\n\n    Parameters\n    ----------\n    ca_coords:\n        Array with shape (N, 3), containing only the C-alpha coordinates that\n        survived whatever filtering you want to apply first, usually pLDDT.\n    k:\n        Number of spatial nearest neighbors used to build the residue graph.\n    sigma:\n        Gaussian bandwidth for edge weights: exp(-d^2 / (2 sigma^2)).\n    connectivity_threshold:\n        Stop splitting when the Fiedler value is above this threshold.\n    boundary_size:\n        Number of residues closest to each sign-change boundary to mark.\n    min_partition_size:\n        Do not try to split partitions smaller than this.\n\n    Returns\n    -------\n    np.ndarray\n        Sorted indices into ca_coords. These are the residues to treat as\n        linkers/cuts before assigning final fragments.\n    \"\"\"\n    coords = np.asarray(ca_coords, dtype=float)\n    if coords.ndim != 2 or coords.shape[1] != 3:\n        raise ValueError(\"ca_coords must have shape (N, 3)\")\n\n    n = len(coords)\n    if n < min_partition_size:\n        return np.array([])\n\n    k_eff = min(k, n - 1)\n    if k_eff < 1:\n        return np.array([])\n\n    tree = KDTree(coords)\n    distances, neighbors = tree.query(coords, k=k_eff + 1)\n\n    # Skip the self-neighbor in column 0.\n    row = np.repeat(np.arange(n), k_eff)\n    col = neighbors[:, 1:].ravel()\n    weights = np.exp(-(distances[:, 1:] ** 2).ravel() / (2 * sigma**2))\n\n    W = csr_matrix((weights, (row, col)), shape=(n, n))\n    W = W.maximum(W.T)\n\n    linker_indices: set[int] = set()\n\n    def split(node_indices: np.ndarray) -> None:\n        if len(node_indices) < min_partition_size:\n            return\n\n        idx = np.sort(node_indices)\n        sub_W = W[idx][:, idx].toarray()\n\n        degree = sub_W.sum(axis=1)\n        if np.count_nonzero(degree) < 2:\n            return\n\n        d_inv_sqrt = np.zeros_like(degree)\n        nonzero = degree > 0\n        d_inv_sqrt[nonzero] = 1.0 / np.sqrt(degree[nonzero])\n\n        # Normalized graph Laplacian: L_sym = I - D^-1/2 W D^-1/2.\n        L_sym = np.eye(len(idx)) - d_inv_sqrt[:, None] * sub_W * d_inv_sqrt[None, :]\n\n        evals, evecs = np.linalg.eigh(L_sym)\n        if len(evals) < 2:\n            return\n\n        fiedler_value = evals[1]\n        fiedler_vector = evecs[:, 1]\n        if fiedler_value > connectivity_threshold:\n            return\n\n        boundary_order = np.argsort(np.abs(fiedler_vector))[:boundary_size]\n        linker_indices.update(idx[boundary_order].tolist())\n\n        keep = np.ones(len(idx), dtype=bool)\n        keep[boundary_order] = False\n\n        partition_a = idx[keep & (fiedler_vector < 0)]\n        partition_b = idx[keep & (fiedler_vector >= 0)]\n\n        split(partition_a)\n        split(partition_b)\n\n    split(np.arange(n))\n    return np.array(sorted(linker_indices))\n```\n\n### Foldseek command\n\nFor the fragment clustering pass, this is the Foldseek command we used before the representative-level structural audit.\n\n```\n# Foldseek command for reproducibility\n$FOLDSEEK cluster \"$DB_DIR/fragDB\" \"$DB_DIR/fragCluDB\" \"$TMP_DIR\" \\\n  --threads \"$THREADS\" \\\n  -c 0.8 --cov-mode 0 \\\n  --tmscore-threshold 0.7 \\\n  --tmscore-threshold-mode 1 \\\n  --lddt-threshold 0.6 \\\n  --max-seqs 2000 \\\n  -e 0.001 \\\n  --alignment-type 2 \\\n  --cluster-reassign 1 \\\n  -v 2\n```\n\n## Citation\n\nPlease cite this work as:\n\n```\nArda Goreci, \"The Unreasonable Redundancy of Nature's Protein Folds\",\nLigo Biosciences Blog, May 20, 2026.\n```\n\nOr use the BibTeX citation:\n\n```\n@article{goreci2026naturesproteinfolds,\n  author = {Arda Goreci},\n  title = {The Unreasonable Redundancy of Nature's Protein Folds},\n  journal = {Ligo Biosciences Blog},\n  year = {2026},\n  month = may,\n  day = {20},\n  publisher = {Ligo Biosciences}\n}\n```\n\n", "url": "https://wpnews.pro/news/the-unreasonable-redundancy-of-nature-s-protein-folds", "canonical_source": "https://research.ligo.bio/posts/unreasonable-redundancy-of-natural-protein-folds/", "published_at": "2026-06-03 03:47:56+00:00", "updated_at": "2026-06-03 04:12:52.659842+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "neural-networks", "generative-ai", "ai-research"], "entities": ["DeepMind", "AlphaFold3", "Chai-2", "Latent-X2", "Nabla"], "alternates": {"html": "https://wpnews.pro/news/the-unreasonable-redundancy-of-nature-s-protein-folds", "markdown": "https://wpnews.pro/news/the-unreasonable-redundancy-of-nature-s-protein-folds.md", "text": "https://wpnews.pro/news/the-unreasonable-redundancy-of-nature-s-protein-folds.txt", "jsonld": "https://wpnews.pro/news/the-unreasonable-redundancy-of-nature-s-protein-folds.jsonld"}}