Document-to-JSON Pipeline for Academic CVs
Contact principal

Portée du projet
Catégories
Intelligence artificielle Technologie de l'information Développement de logicielsCompétences
test suite parsing command-line interface large language modeling text extraction json prompt engineering reliability maintainability conversational aiThe goal of this project is to build a robust two-stage pipeline that extracts clean text from academic CVs (PDF and DOCX) and transforms it into structured JSON using AI and large language models (LLMs). This project supports CtrlCV’s core functionality by allowing users to upload existing CVs and automatically populate their structured academic profile.
The project combines two key objectives:
- Text Extraction – Accurately extract and clean raw text from uploaded CV documents, removing noise such as headers, footers, and formatting artifacts.
- AI-Based Structuring – Use prompt engineering and LLMs to classify and convert the extracted text into well-formed JSON objects that follow CtrlCV’s academic schema (e.g., sections like Education, Publications, Experience).
The emphasis will be on reliability, maintainability, and future extensibility, including privacy-safe design and compatibility with downstream systems.
The final deliverables should include:
- A working end-to-end script or lightweight backend module that:
- Accepts PDF and DOCX CVs as input
- Extracts and cleans the raw text
- Sends the text to an LLM for classification and structuring
- Outputs clean JSON conforming to the CtrlCV schema
- Sample prompts and schema documentation used in the AI parsing stage
- A test suite with at least 3–5 real-world CV samples to demonstrate accuracy and robustness
- Clear documentation including:
- Setup and usage instructions
- Explanation of tool/library choices
- Instructions for adapting the system to different AI providers (e.g., Azure OpenAI)
- (Bonus) A simple UI or CLI tool for uploading a CV and previewing the structured output
Donner accès aux outils, logiciels et ressources nécessaires pour la réalisation du projet.
Enregistrements programmés pour discuter des progrès, relever les défis et fournir des commentaires.
À propos de l'entreprise
CtrlCV is an AI-powered academic CV generation tool designed to reduce the administrative burden for researchers applying for grants, jobs, and academic reviews. It offers intelligent parsing, clean formatting, and dynamic generation of CVs across multiple required formats.
Contact principal
