We are seeking a highly skilled Scraping and Search Engine Developer to join our growing team.
This role focuses on building robust data extraction pipelines and developing advanced search engine capabilities for legal applications. You will be responsible for creating and implementing solutions to gather, clean, and organise data from diverse sources, including web scraping, documents (PDF, DOC), and email, while working closely with legal professionals to understand business requirements and ensure compliance.
A core responsibility will be leveraging vector databases and search technologies to build a sophisticated search engine tailored for legal use cases.
The ideal candidate possesses strong programming skills in Python and JavaScript, extensive experience with vector search technologies, and the ability to collaborate effectively with legal teams to translate business needs into technical solutions.
Search Engine Development: Build and maintain a comprehensive search engine optimised for legal use cases, incorporating both traditional search and semantic search capabilities. Implement advanced query processing, relevance scoring, and result ranking algorithms.
Data Extraction & Web Scraping: Design, develop, and maintain custom web scraping solutions to extract data from specific websites, ensuring adherence to defined formats and metadata requirements. This includes handling dynamic content, anti-scraping measures, and data validation. Ingest data from various sources (web scrapes, documents, email) into our data warehouse/database.
Vector Search Implementation: Implement and manage vector databases (e.g., Pinecone, Chroma, Weaviate, Milvus, OpenSearch) to enable semantic search and similarity matching. This includes embedding generation, index management, query optimisation, and ensuring high-performance vector search capabilities.
Legal Domain Collaboration: Work closely with lawyers and legal professionals to understand business requirements, legal terminology, and domain-specific needs. Translate legal use cases into technical specifications and ensure the search engine meets legal industry standards and workflows.
Data Structuring & Organisation: Structure and organise data based on legal professionals' input and requirements, ensuring data is properly categorised, tagged, and optimised for legal search scenarios.
Document Processing: Develop solutions for extracting structured data from unstructured legal documents (e.g., PDFs, DOCs, contracts, case files) using techniques such as OCR, parsing, and specialized legal document extraction libraries.
Email Data Handling: Develop processes for parsing and extracting relevant information from email data, including attachments and body content, particularly focusing on legal communications and documentation.
Database Management: Design and maintain efficient and scalable database schemas to store and manage legal data with proper indexing for search optimisation.
Data Quality & Monitoring: Implement data quality checks and monitoring systems to ensure data integrity, accuracy, and reliability, with particular attention to legal data compliance requirements.
Performance Optimisation: Identify and implement opportunities to optimise search engine performance, data pipeline efficiency, and query response times.
Documentation: Create and maintain comprehensive documentation for data pipelines, search algorithms, data schemas, and extraction processes.
Stay Current: Keep up to date with the latest trends and technologies in search engine development, vector databases, and legal technology solutions.
Experience: 2+ years of hands-on experience in search engine development, web scraping and related fields.
Required: Proficient in Python and JavaScript (Node.js)
Preferred: Experience with C/C++/C# is a plus
Vector Search Expertise (MUST HAVE): - Proven experience with vector databases and semantic search implementation. Experience with embedding models, similarity search, and vector index optimization.
Search Engine Development (MUST HAVE): Strong understanding of search engine principles, indexing, query processing, and relevance algorithms.
OpenSearch: Experience with OpenSearch for search engine development and data indexing.
Web Scraping Expertise: Proven experience with web scraping frameworks (e.g., Puppeteer, Scrapy, Beautiful Soup) and handling anti-scraping measures.
Database Experience (MUST HAVE): Experience with relational databases (e.g., PostgreSQL, MySQL) and/or NoSQL databases (e.g., MongoDB).
Communication Skills: Strong ability to communicate with legal professionals and translate business requirements into technical solutions.