About the job: Scrapinghub is looking for a Data Engineer to work closely with our Data Scientist team and provide assistance with dataset collection, cleaning and post-processing. You’ll also help with writing tools for working with data as required for Data Science projects. You will work with one of the most advanced and comprehensive web crawling and scraping infrastructures in the world, leveraging massive data sets with cutting edge technology.
Job Responsibilities:
Create data tools for Data Science team members that assist them in building and optimizing our product.
Assemble large datasets that meet requirements set by the Data Science team, including creating web crawlers.
Be proactive in bringing forth new ideas and solutions to problems
Be a strong team player and share knowledge freely and easily with your co-workers
Write software for post-processing and cleaning of the data, taking part in data analysis if required
Automate manual processes, optimize data delivery, improve architecture for greater scalability
Work on integration of Data Science components into our larger systems
Handle mid-size and large datasets (200GB+)
Job Requirements:
Due to business requirements,the successful candidate must be based in Ireland.
Candidates based in the EU but willing to relocate to Ireland will also be considered.
Python experience (3+ years)
5+ years of software development experience
Good command of Linux
Front-end development experience required for creating and supporting internal tools
Back-end development experience required for creating and supporting internal tools: Python web frameworks (like twisted, aiohttp, django, flask), databases.
Understanding of the web technologies: JavaScript, HTML, CSS, HTTP
Strong analytics skills related to working with unstructured datasets
Excellent written English
Bonus points for:
Strong web crawling and web scraping skills: Scrapy knowledge, browser automation experience. Splash experience is a plus.
Experience handling mid-size and large datasets, organizing their parallel processing
Good spoken English
Strong record of open source activity

