At Digital Science we are looking for a Software developer for harvesting science related data from the internet to contribute to our Dimensions platform. You will be part of an experienced and well skilled technical team, with a clear vision of the technological and engineering goal within the exciting setting of an international and agile company. The development language is Python. If you are not (yet) a professional Python developer but have an affinity for harvesting data and enthusiasm to really get into Python, you will be taught on the job by your fellow team mates.
With Dimensions, Digital Science launched an innovative research data and tool infrastructure, broadening the view of the research landscape after decades of focus on the publication/citation complex. The guiding principle, to deliver context, was to take different data sets out of their silos to create a heavily interlinked overarching dataset that described the whole research lifecycle: from funding input (grants), through research outputs (publications) and translation / application of research results (clinical trials, patents), to attention (altmetric and citations) and finally to policy-level impact (mentions of research results in policy papers).
In total, Dimensions today contains more than 128 million documents with more than 4 billion connections between these records. For more information please visit or try the free version of the Dimensions app at . Dimensions has offices in Germany, Romania, US and UK, serving clients globally.
Harvest websites using raw http requests and automating browser using tools like Selenium or Scrapy, using all kind of web based APIs (REST, SOAP, ...)
Implement batch jobs to retrieve data via common web protocols like HTTP, FTP, …
Extract data from various source formats by implementing heuristics to extract data
Being confronted with very different document formats, reaching from not very well formed HTML code or PDF documents to standard file formats like XML, JSON, CSV and using the default tools to process them (like XPATH)
Store extracted data in sql databases (mainly PostgreSQL) in a generic format
Integrate code into our data pipeline driving our whole data processing infrastructure.
We are looking for:
Relevant software development experience (preferably in Python language)
Basic Linux working experience and willingness to improve it
Ability to work on intricate details without losing the big picture.
Experience with Amazon Web Services or eager to learn about it
Nice to have experience with application containers (preferably Docker)
Experience in distributed version control systems (git)
Understanding of Agile methodologies
Must be a self-learner, possessing inherent inquisitiveness
Good problem solving and analytical skills
Strong interpersonal, communications, and organizational skills
Minimum Bachelor degree in Computer Science or related field, or equivalent
What We Offer
Be part of an international team distributed all over the globe
Relaxed work environment that values innovation, initiative, and energy
On a rainy day you can choose to work remotely, so most communication happens via video calls using Google Hangout
Competitive salary based on experience
Flexible working hours
Hand pick your hardware