Building a Semantic Search Engine and Open-Status Classifier over the ResearchMath-14k Dataset

Researchers at Amphora have built a semantic search engine and open-status classifier over the ResearchMath-14k dataset, a collection of research-level mathematics problems mined from arXiv. The system generates semantic embeddings, visualizes the problem landscape, clusters related problems, and trains a classifier to predict problem status from embeddings. This work enables efficient retrieval and classification of mathematical problems across different fields and open-status categories.

In this tutorial, we work with the amphora/ResearchMath-14k https://huggingface.co/datasets/amphora/ResearchMath-14k dataset, a collection of research-level mathematics problems mined from arXiv. We load the dataset, inspect its structure, and explore how the problems are distributed across mathematical fields and open-status categories. We then move beyond basic analysis by extracting field-specific keywords, generating semantic embeddings, visualizing the problem landscape, clustering related problems, and building a simple search engine over the dataset. Also, we train a classifier to predict problem status from embeddings and detect closely related or near-duplicate problems. pip -q install -U datasets sentence-transformers scikit-learn umap-learn \ pandas matplotlib seaborn wordcloud 2 /dev/null import warnings, numpy as np, pandas as pd warnings.filterwarnings "ignore" import matplotlib.pyplot as plt import seaborn as sns sns.set theme style="whitegrid", palette="deep" SAMPLE SIZE = 4000 RANDOM STATE = 42 EMB MODEL = "sentence-transformers/all-MiniLM-L6-v2" We begin by installing the required libraries and importing the tools needed for analysis, visualization, embeddings, and data handling. We also set the main configuration values, including sample size, random seed, and embedding model. This gives us a clean setup before we start working with the ResearchMath dataset. python from datasets import load dataset ds = load dataset "amphora/ResearchMath-14k", split="test" df = ds.to pandas print "Rows:", len df print "Columns:", list df.columns df.head 3 TEXT COL = "self contained problem" df = df df TEXT COL .astype str .str.len 20 .reset index drop=True We load the amphora/ResearchMath-14k dataset from Hugging Face and convert it into a pandas DataFrame. We inspect the number of rows, available columns, and a few sample records to understand the dataset structure. We then keep only problem statements of meaningful length so that subsequent analysis works on useful text. print "\n--- open status distribution ---" print df "open status" .value counts dropna=False print "\n--- taxonomy level 1 math fields ---" print df "taxonomy level 1" .value counts fig, axes = plt.subplots 1, 3, figsize= 20, 6 df "open status" .value counts .plot kind="bar", ax=axes 0 , color="steelblue" axes 0 .set title "Problem status" ; axes 0 .tick params axis="x", rotation=30 df "taxonomy level 1" .value counts .plot kind="barh", ax=axes 1 , color="seagreen" axes 1 .set title "Top-level math field" ; axes 1 .invert yaxis df "doc len" = df TEXT COL .str.split .apply len axes 2 .hist df "doc len" .clip upper=400 , bins=40, color="indianred" axes 2 .set title "Problem length words, clipped @400 " plt.tight layout ; plt.show ct = pd.crosstab df "taxonomy level 1" , df "open status" , normalize="index" plt.figure figsize= 10, 6 sns.heatmap ct, annot=True, fmt=".2f", cmap="rocket r" plt.title "Fraction of each status within each field" plt.tight layout ; plt.show We explore the dataset by checking how problems are distributed across open-status labels and mathematical fields. We visualize the status counts, field counts, and problem lengths to quickly get an overview of the corpus. We also create a heatmap to see how open-status categories vary across different math fields. python from sklearn.feature extraction.text import TfidfVectorizer def top terms per group frame, group col, text col, k=8 : out = {} for g, sub in frame.groupby group col : if len sub < 20: continue vec = TfidfVectorizer max features=3000, stop words="english", ngram range= 1, 2 , min df=3 X = vec.fit transform sub text col scores = np.asarray X.mean axis=0 .ravel terms = np.array vec.get feature names out out g = terms scores.argsort ::-1 :k .tolist return out for field, terms in top terms per group df, "taxonomy level 1", TEXT COL .items : print f"\n{field:35s} - {', '.join terms }" We use TF-IDF to find the most important terms within each top-level mathematical field. We group the dataset by field and extract the strongest keywords or phrases that represent each group. This helps us understand what topics and terminology dominate different areas of research in mathematics. python from sklearn.feature extraction.text import TfidfVectorizer def top terms per group frame, group col, text col, k=8 : out = {} for g, sub in frame.groupby group col : if len sub < 20: continue vec = TfidfVectorizer max features=3000, stop words="english", ngram range= 1, 2 , min df=3 X = vec.fit transform sub text col scores = np.asarray X.mean axis=0 .ravel terms = np.array vec.get feature names out out g = terms scores.argsort ::-1 :k .tolist return out for field, terms in top terms per group df, "taxonomy level 1", TEXT COL .items : print f"\n{field:35s} - {', '.join terms }" We sample the dataset and convert each mathematical problem into a semantic embedding using a SentenceTransformer model. We reduce the embeddings into two dimensions using UMAP, or PCA if UMAP is unavailable, and visualize the problem landscape by field. We then apply K-Means clustering and compare the resulting clusters with the human-labeled taxonomy using ARI and NMI. python from sentence transformers import util def search query, k=5 : q = model.encode query , normalize embeddings=True sims = util.cos sim q, emb 0 .cpu .numpy idx = sims.argsort ::-1 :k print f'\n=== Query: "{query}" ===' for rank, i in enumerate idx, 1 : row = work.iloc i print f"\n {rank} sim={sims i :.3f} | {row 'taxonomy level 1' } " f"| status={row 'open status' }" print " ", row TEXT COL :260 .replace "\n", " " , "..." search "rational points on hyperelliptic curves" search "multiplicativity of maximal output p-norm of a quantum channel" from sklearn.linear model import LogisticRegression from sklearn.model selection import train test split from sklearn.metrics import classification report, ConfusionMatrixDisplay y = work "open status" .values Xtr, Xte, ytr, yte = train test split emb, y, test size=0.25, random state=RANDOM STATE, stratify=y clf = LogisticRegression max iter=2000, class weight="balanced", C=2.0 clf.fit Xtr, ytr pred = clf.predict Xte print "\n=== open status classifier embeddings + logistic regression ===" print classification report yte, pred fig, ax = plt.subplots figsize= 7, 6 ConfusionMatrixDisplay.from predictions yte, pred, ax=ax, cmap="Blues", xticks rotation=45, normalize="true", values format=".2f" ax.set title "open status confusion matrix row-normalized " plt.tight layout ; plt.show sims = util.cos sim emb, emb .cpu .numpy np.fill diagonal sims, 0 i, j = np.unravel index sims.argmax , sims.shape print f"\nMost similar pair cos={sims i, j :.3f} :" for n in i, j : print f"\n paper id={work.iloc n 'paper id' } | " f"{work.iloc n 'taxonomy level 1' }" print " ", work.iloc n TEXT COL :240 .replace "\n", " " , "..." print "\nDone. Set SAMPLE SIZE=None at the top to run on the full 14.1k rows." We build a semantic search function that retrieves the most similar research problems for a given query. We then train a classifier on the embeddings to predict each problem’s open-status label. Finally, we compute similarity across all embedded problems to detect the closest pair and identify near-duplicate or strongly related problem statements. In conclusion, we have a complete workflow for analyzing research-level mathematical problems using modern NLP and machine learning tools. We started with dataset exploration, then used TF-IDF, sentence embeddings, dimensionality reduction, clustering, semantic search, and classification to understand the corpus’s structure from multiple angles. It gives us a practical way to study how mathematical problems are grouped, how similar problems can be retrieved, and how embeddings can support both exploratory analysis and supervised prediction tasks. Check out the Full Codes with Notebook. Also, feel free to follow us on and don’t forget to join our Twitter https://x.com/intent/follow?screen name=marktechpost and Subscribe to 150k+ ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58 Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions. - Sana Hassan - Sana Hassan - Sana Hassan - Sana Hassan