{"slug": "building-a-semantic-search-engine-and-open-status-classifier-over-the-14k", "title": "Building a Semantic Search Engine and Open-Status Classifier over the ResearchMath-14k Dataset", "summary": "Researchers at Amphora have built a semantic search engine and open-status classifier over the ResearchMath-14k dataset, a collection of research-level mathematics problems mined from arXiv. The system generates semantic embeddings, visualizes the problem landscape, clusters related problems, and trains a classifier to predict problem status from embeddings. This work enables efficient retrieval and classification of mathematical problems across different fields and open-status categories.", "body_md": "In this tutorial, we work with the[ amphora/ResearchMath-14k](https://huggingface.co/datasets/amphora/ResearchMath-14k)\n\n**dataset, a collection of research-level mathematics problems mined from arXiv. We load the dataset, inspect its structure, and explore how the problems are distributed across mathematical fields and open-status categories. We then move beyond basic analysis by extracting field-specific keywords, generating semantic embeddings, visualizing the problem landscape, clustering related problems, and building a simple search engine over the dataset. Also, we train a classifier to predict problem status from embeddings and detect closely related or near-duplicate problems.**\n\n```\n!pip -q install -U datasets sentence-transformers scikit-learn umap-learn \\\n   pandas matplotlib seaborn wordcloud 2>/dev/null\nimport warnings, numpy as np, pandas as pd\nwarnings.filterwarnings(\"ignore\")\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nsns.set_theme(style=\"whitegrid\", palette=\"deep\")\nSAMPLE_SIZE = 4000\nRANDOM_STATE = 42\nEMB_MODEL   = \"sentence-transformers/all-MiniLM-L6-v2\"\n```\n\nWe begin by installing the required libraries and importing the tools needed for analysis, visualization, embeddings, and data handling. We also set the main configuration values, including sample size, random seed, and embedding model. This gives us a clean setup before we start working with the ResearchMath dataset.\n\n``` python\nfrom datasets import load_dataset\nds = load_dataset(\"amphora/ResearchMath-14k\", split=\"test\")\ndf = ds.to_pandas()\nprint(\"Rows:\", len(df))\nprint(\"Columns:\", list(df.columns))\ndf.head(3)\nTEXT_COL = \"self_contained_problem\"\ndf = df[df[TEXT_COL].astype(str).str.len() > 20].reset_index(drop=True)\n```\n\nWe load the amphora/ResearchMath-14k dataset from Hugging Face and convert it into a pandas DataFrame. We inspect the number of rows, available columns, and a few sample records to understand the dataset structure. We then keep only problem statements of meaningful length so that subsequent analysis works on useful text.\n\n```\nprint(\"\\n--- open_status distribution ---\")\nprint(df[\"open_status\"].value_counts(dropna=False))\nprint(\"\\n--- taxonomy_level_1 (math fields) ---\")\nprint(df[\"taxonomy_level_1\"].value_counts())\nfig, axes = plt.subplots(1, 3, figsize=(20, 6))\ndf[\"open_status\"].value_counts().plot(\n   kind=\"bar\", ax=axes[0], color=\"steelblue\")\naxes[0].set_title(\"Problem status\"); axes[0].tick_params(axis=\"x\", rotation=30)\ndf[\"taxonomy_level_1\"].value_counts().plot(\n   kind=\"barh\", ax=axes[1], color=\"seagreen\")\naxes[1].set_title(\"Top-level math field\"); axes[1].invert_yaxis()\ndf[\"doc_len\"] = df[TEXT_COL].str.split().apply(len)\naxes[2].hist(df[\"doc_len\"].clip(upper=400), bins=40, color=\"indianred\")\naxes[2].set_title(\"Problem length (words, clipped @400)\")\nplt.tight_layout(); plt.show()\nct = pd.crosstab(df[\"taxonomy_level_1\"], df[\"open_status\"], normalize=\"index\")\nplt.figure(figsize=(10, 6))\nsns.heatmap(ct, annot=True, fmt=\".2f\", cmap=\"rocket_r\")\nplt.title(\"Fraction of each status within each field\")\nplt.tight_layout(); plt.show()\n```\n\nWe explore the dataset by checking how problems are distributed across open-status labels and mathematical fields. We visualize the status counts, field counts, and problem lengths to quickly get an overview of the corpus. We also create a heatmap to see how open-status categories vary across different math fields.\n\n``` python\nfrom sklearn.feature_extraction.text import TfidfVectorizer\ndef top_terms_per_group(frame, group_col, text_col, k=8):\n   out = {}\n   for g, sub in frame.groupby(group_col):\n       if len(sub) < 20:\n           continue\n       vec = TfidfVectorizer(max_features=3000, stop_words=\"english\",\n                             ngram_range=(1, 2), min_df=3)\n       X = vec.fit_transform(sub[text_col])\n       scores = np.asarray(X.mean(axis=0)).ravel()\n       terms = np.array(vec.get_feature_names_out())\n       out[g] = terms[scores.argsort()[::-1][:k]].tolist()\n   return out\nfor field, terms in top_terms_per_group(df, \"taxonomy_level_1\", TEXT_COL).items():\n   print(f\"\\n{field:35s} -> {', '.join(terms)}\")\n```\n\nWe use TF-IDF to find the most important terms within each top-level mathematical field. We group the dataset by field and extract the strongest keywords or phrases that represent each group. This helps us understand what topics and terminology dominate different areas of research in mathematics.\n\n``` python\nfrom sklearn.feature_extraction.text import TfidfVectorizer\ndef top_terms_per_group(frame, group_col, text_col, k=8):\n   out = {}\n   for g, sub in frame.groupby(group_col):\n       if len(sub) < 20:\n           continue\n       vec = TfidfVectorizer(max_features=3000, stop_words=\"english\",\n                             ngram_range=(1, 2), min_df=3)\n       X = vec.fit_transform(sub[text_col])\n       scores = np.asarray(X.mean(axis=0)).ravel()\n       terms = np.array(vec.get_feature_names_out())\n       out[g] = terms[scores.argsort()[::-1][:k]].tolist()\n   return out\nfor field, terms in top_terms_per_group(df, \"taxonomy_level_1\", TEXT_COL).items():\n   print(f\"\\n{field:35s} -> {', '.join(terms)}\")\n```\n\nWe sample the dataset and convert each mathematical problem into a semantic embedding using a SentenceTransformer model. We reduce the embeddings into two dimensions using UMAP, or PCA if UMAP is unavailable, and visualize the problem landscape by field. We then apply K-Means clustering and compare the resulting clusters with the human-labeled taxonomy using ARI and NMI.\n\n``` python\nfrom sentence_transformers import util\ndef search(query, k=5):\n   q = model.encode([query], normalize_embeddings=True)\n   sims = util.cos_sim(q, emb)[0].cpu().numpy()\n   idx = sims.argsort()[::-1][:k]\n   print(f'\\n=== Query: \"{query}\" ===')\n   for rank, i in enumerate(idx, 1):\n       row = work.iloc[i]\n       print(f\"\\n[{rank}] sim={sims[i]:.3f} | {row['taxonomy_level_1']} \"\n             f\"| status={row['open_status']}\")\n       print(\"   \", row[TEXT_COL][:260].replace(\"\\n\", \" \"), \"...\")\nsearch(\"rational points on hyperelliptic curves\")\nsearch(\"multiplicativity of maximal output p-norm of a quantum channel\")\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import classification_report, ConfusionMatrixDisplay\ny = work[\"open_status\"].values\nXtr, Xte, ytr, yte = train_test_split(\n   emb, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y)\nclf = LogisticRegression(max_iter=2000, class_weight=\"balanced\", C=2.0)\nclf.fit(Xtr, ytr)\npred = clf.predict(Xte)\nprint(\"\\n=== open_status classifier (embeddings + logistic regression) ===\")\nprint(classification_report(yte, pred))\nfig, ax = plt.subplots(figsize=(7, 6))\nConfusionMatrixDisplay.from_predictions(\n   yte, pred, ax=ax, cmap=\"Blues\", xticks_rotation=45,\n   normalize=\"true\", values_format=\".2f\")\nax.set_title(\"open_status confusion matrix (row-normalized)\")\nplt.tight_layout(); plt.show()\nsims = util.cos_sim(emb, emb).cpu().numpy()\nnp.fill_diagonal(sims, 0)\ni, j = np.unravel_index(sims.argmax(), sims.shape)\nprint(f\"\\nMost similar pair (cos={sims[i, j]:.3f}):\")\nfor n in (i, j):\n   print(f\"\\n  paper_id={work.iloc[n]['paper_id']} | \"\n         f\"{work.iloc[n]['taxonomy_level_1']}\")\n   print(\"   \", work.iloc[n][TEXT_COL][:240].replace(\"\\n\", \" \"), \"...\")\nprint(\"\\nDone. Set SAMPLE_SIZE=None at the top to run on the full 14.1k rows.\")\n```\n\nWe build a semantic search function that retrieves the most similar research problems for a given query. We then train a classifier on the embeddings to predict each problem’s open-status label. Finally, we compute similarity across all embedded problems to detect the closest pair and identify near-duplicate or strongly related problem statements.\n\nIn conclusion, we have a complete workflow for analyzing research-level mathematical problems using modern NLP and machine learning tools. We started with dataset exploration, then used TF-IDF, sentence embeddings, dimensionality reduction, clustering, semantic search, and classification to understand the corpus’s structure from multiple angles. It gives us a practical way to study how mathematical problems are grouped, how similar problems can be retrieved, and how embeddings can support both exploratory analysis and supervised prediction tasks.\n\nCheck out the ** Full Codes with Notebook. **Also, feel free to follow us on\n\n**and don’t forget to join our**[Twitter](https://x.com/intent/follow?screen_name=marktechpost)\n\n**and Subscribe to**\n\n[150k+ ML SubReddit](https://www.reddit.com/r/machinelearningnews/)**. Wait! are you on telegram?**\n\n[our Newsletter](https://www.aidevsignals.com/)\n\n[now you can join us on telegram as well.](https://t.me/machinelearningresearchnews)Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? [Connect with us](https://forms.gle/wbash1wF6efRj8G58)\n\nSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.\n\n- Sana Hassan\n- Sana Hassan\n- Sana Hassan\n- Sana Hassan", "url": "https://wpnews.pro/news/building-a-semantic-search-engine-and-open-status-classifier-over-the-14k", "canonical_source": "https://www.marktechpost.com/2026/06/04/building-a-semantic-search-engine-and-open-status-classifier-over-the-researchmath-14k-dataset/", "published_at": "2026-06-04 22:24:10+00:00", "updated_at": "2026-06-04 23:04:45.027934+00:00", "lang": "en", "topics": ["machine-learning", "natural-language-processing", "ai-research"], "entities": ["arXiv", "amphora/ResearchMath-14k", "sentence-transformers", "all-MiniLM-L6-v2", "Hugging Face", "UMAP", "scikit-learn"], "alternates": {"html": "https://wpnews.pro/news/building-a-semantic-search-engine-and-open-status-classifier-over-the-14k", "markdown": "https://wpnews.pro/news/building-a-semantic-search-engine-and-open-status-classifier-over-the-14k.md", "text": "https://wpnews.pro/news/building-a-semantic-search-engine-and-open-status-classifier-over-the-14k.txt", "jsonld": "https://wpnews.pro/news/building-a-semantic-search-engine-and-open-status-classifier-over-the-14k.jsonld"}}