Document-to-JSON Pipeline for Academic CVs

Ouvert Ouvert le juillet 13, 2025
Contact principal
CtrlCV
Toronto, Ontario, Canada
Co-founder, Product
(2)
1
Projet
60 heures par apprenant.e.s
Apprenant.e.s
Canada
Niveau Avancé

Portée du projet

Catégories
Intelligence artificielle Technologie de l'information Développement de logiciels
Compétences
test suite parsing command-line interface large language modeling text extraction json prompt engineering reliability maintainability conversational ai
Détails

The goal of this project is to build a robust two-stage pipeline that extracts clean text from academic CVs (PDF and DOCX) and transforms it into structured JSON using AI and large language models (LLMs). This project supports CtrlCV’s core functionality by allowing users to upload existing CVs and automatically populate their structured academic profile.


The project combines two key objectives:


  1. Text Extraction – Accurately extract and clean raw text from uploaded CV documents, removing noise such as headers, footers, and formatting artifacts.
  2. AI-Based Structuring – Use prompt engineering and LLMs to classify and convert the extracted text into well-formed JSON objects that follow CtrlCV’s academic schema (e.g., sections like Education, Publications, Experience).


The emphasis will be on reliability, maintainability, and future extensibility, including privacy-safe design and compatibility with downstream systems.

Livrables

The final deliverables should include:


- A working end-to-end script or lightweight backend module that:

  • Accepts PDF and DOCX CVs as input
  • Extracts and cleans the raw text
  • Sends the text to an LLM for classification and structuring
  • Outputs clean JSON conforming to the CtrlCV schema


- Sample prompts and schema documentation used in the AI parsing stage


- A test suite with at least 3–5 real-world CV samples to demonstrate accuracy and robustness


- Clear documentation including:

  • Setup and usage instructions
  • Explanation of tool/library choices
  • Instructions for adapting the system to different AI providers (e.g., Azure OpenAI)


- (Bonus) A simple UI or CLI tool for uploading a CV and previewing the structured output

Mentorat
Outils et/ou ressources

Donner accès aux outils, logiciels et ressources nécessaires pour la réalisation du projet.

Réunions régulières

Enregistrements programmés pour discuter des progrès, relever les défis et fournir des commentaires.

À propos de l'entreprise

Entreprise
Toronto, Ontario, Canada
2 - 10 employé.es
It & computing, Technology

CtrlCV is an AI-powered academic CV generation tool designed to reduce the administrative burden for researchers applying for grants, jobs, and academic reviews. It offers intelligent parsing, clean formatting, and dynamic generation of CVs across multiple required formats.