Editor’s note: This post is the second of a series of three capturing the result of recent interview/discussions I had with Robert Fink of Palantir. The conversation was wide ranging, hitting on topics of design, development environments and a bit on the philosophy of enterprise tech. Several common themes emerged in those topic areas, including ways that Palantir has been leveraging open approaches to data architectures, system design and even developer environments. The first post focused on open data architectures. This second post continues the discussion of open approaches including open source software .– bg
BG: What is the relationship between open architecture, open data formats, and open-source software?
RF: Open data formats and open-source libraries are the lingua franca of open platforms. Take Hadoop as an example: developed as an open-source alternative to Google’s proprietary MapReduce and GFS systems (thankfully Google published research papers describing them in much detail), the Hadoop ecosystem today covers effectively 100% of the “big data” market in terms of data storage systems like HDFS and S3, data formats like Parquet, and compute systems like Apache Spark. The relationship between HDFS and S3 makes for an interesting case study: both are distributed storage systems, one available at no cost for on-prem deployments and the other available as a paid service from Amazon. Critically, both implement the same Hadoop FileSystem API and are thus interchangeable as far as downstream applications like Spark are concerned. Really a perfect example of the open platform idea! Foundry directly inherits this flexibility: we are happy to work with and write data to HDFS and in S3 interchangeably.
BG: CTOvision has tracked the megatrend of open-source software since our founding almost a decade ago, and we have watched the rise of open-source solutions for operating systems, applications, development environments, databases and entire suites of enterprise IT tools. Even the big proprietary software houses like Microsoft are embracing open-source. As a contributor to the community I’d appreciate your insights into the state of open-source. Can you give us your context, by first telling us how you define open-source software?
RF: At its most basic level, open-source software is code that anyone can inspect, modify, use, and distribute. To protect the open-source nature of the code, the community has agreed on several licensing models and other protocols but in every case the basic point is the same, open-source software can be inspected and improved by anyone who follows the community processes.
As for the state of the open-source community, GitHub alone reported 24 million developers across its repositories in 2017. Some projects have thousands of individual contributors—more than any one company could ever field. Clearly the state of the community is strong, and it keeps getting stronger.
Open-source software has commoditized entire industries, from operating systems like BSD and Linux (components of which run on nearly all computers and smart phones), to database technologies like PostgresSQL (a drop-in replacement for much of the core database functionality of commercial giants like Oracle), to developer productivity tools like Eclipse (which for instance led Microsoft to offer its formerly very expensive Microsoft Studio toolset for free now). Historically, most open-source software was created and maintained by individuals or academic institutions. Today, major products like Cassandra, Typescript, or Kubernetes were incepted in commercial institutions, and then released to the public as open-source software. Entire development teams at companies like Facebook, Google, and Palantir are devoted exclusively to open-source projects. I am personally very excited about this development because it accelerates the availability of new technologies across the board.
BG: So if Palantir Foundry is an open data platform, you must use open-source software?
RF: Our data integration and analysis platform is built around open-source databases, including Cassandra, Elasticsearch and Postgres, data storage and processing infrastructure like Hadoop and Apache Spark, and open formats for API definitions like JSON and HTTP. Our build and test infrastructure tools are implemented with standard tools like Gradle and Yarn. Of course we also actively contribute open-source projects. For example, Palantir and our customers are directly interested in fixing bugs and performance issues in the OSS software we rely on. We have learned over the years that the most fair and effective model is to contribute fixes directly to the respective projects (rather than maintaining internal forks), and that such contributions can be made most efficiently if we have standing relationships and rapport with the relevant communities.
BG: And where did things go sideways, what lessons have you learned?
RF: Yeah, we definitely made some stupid mistakes along the way. Since the source code is available, it’s very tempting to take shortcuts by forking the software, by changing it behind closed doors, and by adding private features. This seems appealing in the short-term because it allows to make bespoke functionality available very quickly. The long-term cost of such forks is immense, though: in the long run, the open-source community is going to outpace you by making the open-source version faster and more secure and you’ll want to find a way to merge your changes back into the upstream project. The further you have deviated, the more painful and the longer this process will be. In one case, it has taken us years to unwind our well-intentioned internal changes and revert to the vanilla upstream distribution.
A second challenge is to find the right timing and engagement mode for open-source contributions. These projects are large human communities with often idiosyncratic, almost political processes and norms. As an individual or institutional contributor, you need to devote time and energy to community building, to finding consensus amongst stake holders, to aligning external and internal expectations, timelines, and commitments.
Our next post in this series will dive deeper into how Robert and his team produce code designed from the very beginning to be open, including hitting on the open development tools in use at Palantir.