Ash: Hi Athan, this is Ashley.
Athan: Hey, how are you doing?
Athan: Okay, where would you like me to begin?
Ash: Well, maybe you could just start by telling me a bit about what you do for a living and your background.
How do you fund yourself if you’re working on open source full time - are you sponsored or something like that, how does it work?
Athan: I primarily live off savings. I receive some funding through Patreon, which is a form of crowdfunding. In terms of contributors, Phillip, who is the main other contributor, is a graduate student at Carnegie Mellon University. He’s funded through grants. I’ll also occasionally perform freelance work. Fortunately, I worked for a couple of successful startups in Silicon Valley, so I have enough financial cushion to be able to do this.
Ash: So, did you actually start the stdlib library? Are you one of founding members?
That led to some open source work, which I worked on for about 2 years and, through that work, Phillip, who is the main other collaborator on stdlib, began working with me. We wrote well over 1,000 NPM modules that did small little things, such as special functions or Matlab like functionality. Over time our approach was not particularly sustainable. At the time, small modules were the “craze”, and the “everything is its own repository” approach just wasn’t scalable for us, as we were investing a lot more time writing tooling to help manage all these packages than actually working on core library functionality. Eventually, we consolidated everything into a mono-repo which was the genesis for stdlib. We’ve been working on stdlib since March of 2016.
For us, it’s very important that you have one standardised implementation that you can run your algorithms against. Now, with regard to Node.js, the reason that, we don’t think that this should be included in Node.js is that we believe that node should keep its core small. That ensures flexibility and allows people to choose exactly what libraries they want and need and how they want to choose them. We don’t necessarily think that bringing in all stdlib, which may have BLAS bindings, GPU bindings, and additional things that, for your typical HTTP application, you don’t care about and not important to many Node.js use cases.
The reality is that most people don’t need to have all these bindings for the basic networking applications they are going to deploy in the cloud. A good chunk of stdlib functionality would simply not be of their concern. So, for us, we think that a separation here between stdlib and Node.js is perfectly fine.
In a similar vein, consider Python, which does have a larger standard library but you don’t have NumPy or SciPy included in Python, right? In Python, you seem to have a similar separation of concerns, where people should be able to bring these things in as they need them. So, while the people at the standards level and within the Node.js community have been supportive, we have no plans for any kind of official inclusion or official support from either of those two.
So, for us, we think there’s a lot more low hanging fruit that needs to be addressed in terms of basic data structures or basic algorithms and basic floating-point implementations before even thinking about those other, more specialised, use cases, such as BigInt and BigFloat.
Athan: Ah, well.
Ash: Other people have asked you this question?
Many times, when people are working on their local machines, your web application is not necessarily utilising all the local resources that it can. For example, nowadays, people have more than one core, so, if you’re savvy about it, you can utilise that in browser-based applications to kind of parallelize some of your compute.
Or, you can take advantage of the fact that IndexedDB in web browsers can access up to half of a person’s hard drive, so you can store data locally and do downsampling and upsampling of data as need be, within web browsers.
Ash: That’d be a native plug-in, is that right?
Athan: Yes, using Node.js terminology, native bindings take the form of “native add-ons”. We include such bindings in stdlib, but these bindings are still very much a work in progress.
Once you have core data structures and bindings, you can start adding additional functionality such as Apache Arrow or Pandas DataFrames, which enable higher level abstractions and more advanced use cases. But we’re just not there, given the amount of work that goes into actually building these things.
The “dirty secret” behind Python, MATLAB, R, etc is that they all have bindings to, e.g., BLAS and LAPACK and then expose various higher level APIs which invoke lower level functionality. In essence, scientific computing environments re-use a lot of these old Fortran and C/C++ libraries. Nevertheless, creating such bindings and writing such wrappers involves considerable work. And it’s not just a matter of creating a file that calls lower level libraries. If you’re going to do these things in an idiomatic way and need to take into account how, for example, the web works, then you need to do a little bit of reinventing the wheel and re-implementing functionality from scratch.
But it’s not just enough to create just a single binding to some large native library. If, for example, you want to bundle functionality exposed in the native library in your web application, you wouldn’t be able to do that right now. The reason being that these things are either written in Fortran, or they’re in C/C++, or they’re calling down to something that’s hardware optimised. Even with WebAssembly, you’re not going to be able to run any hardware optimised code in a web browser. In the case of BLAS and LAPACK, you’re going to need vanilla reference implementations, and, if you want to create small bundles (or even binaries in the case of Node.js), you need to create individual packages and modules. In which case, you need to break down these large libraries into individualised components, so that they can be consumed in a very ad-hoc fashion. And, it’s not going to be the case that you’re just going to be able to go, okay, well, what I’ll do is I’ll install NumPy and I’ll simply hook into their existing native implementations and bindings. This is not going to work because Python does things in a Pythonic way, with special consideration to Python types, syntaxes, and APIs. And even if someone did manage to port NumPy, SciPy, and other libraries to, e.g., WebAssembly, doing so will introduce considerable bloat because of how those libraries are implemented (e.g., considerable glue code mapping native constructs to Python abstractions).
To summarize, if you want to be able to break these libraries down into individual components, you’re not going to be able to. You’re just going to end up shipping a huge binary, leading to increased network costs, and users being annoyed because your application takes too long to load. In short, there are many considerations that you have to take into account to do this well, that takes into account the nature of the web, and that takes into account the nature of how people build modern applications. It’s a different kind of programming model and a different kind of design model than what authors of numerical and scientific computing libraries are typically used to.
Ash: What are we missing though? Under Node.js we can already build cross platform, native plug-ins right? Are we missing some sort of a communication protocol?
Athan: Yeah, yeah, that’s definitely the case. It’s a bit of a chicken and egg type situation. In order for people to do amazing things, like cool demos to attract attention, there needs to be libraries. But in order for there to be libraries, people need to be doing cool things and want to contribute.
So, there is a little bit of a dearth right now of kind of technical expertise in order to be able to do these things well. Like, you’ll see people on npm publish some kind of one off module to do whatever. But it’s typically a terrible implementation. It’s just not accurate, it’s not robust, or they are not using stable algorithms or whatever. This can be, this is from all the way down, all the way up.
Ash: Do you see that happening? Do you see more qualified data scientists coming into our arena?
Athan: Yeah, I think, yes, yes, yes but it’s slow right. There is people that are interested in doing machine learning and these kinds of things, and you see like efforts like deep learn, which is now Tensorflow, so you are seeing corporate buy in, into the power of the web.
You’ll get laughed at, right, because people have all these preconceptions. So, it’s a bit hard. I mean there are people, obviously the data visualisation community is huge and web technologies are used in companies across the board. You go to Uber, they have a huge data visualisation team, really smart people. They’re all using web technologies, whether it’s d3 or whatever the stuff that Nicholas Belmonte is doing.
So, it’s slow. You get some academic buy in, you get some people that are doing for example, like Stinsello that is trying to do a web based spreadsheet program which, has a little more power than your traditional spreadsheet. You have people that recognise this stuff and recognise the power of the web. But you don’t have a critical mass of people at the present moment that recognise the power of the web platform or a willingness to invest resources in it.
Ash: So, where do you see the stdlib library going in the future? Where does it fit? Is the audience data scientists or is it broader than that?
Athan: Yeah, I think it’s broader than that. Our perspective is that we came at this from a scientific computing background. My PhD is in physics, I did a lot of time series machine learning, analysis etc. So, that’s our bent and our bias.
But we also recognise that it’s not just that, it’s a bit more encompassing, how we think about the project. We want to create a particular set of libraries that has consistent documentation, that has consistent benchmarks, examples etc. And this goes for everything from low level utilities to your lodash, ramda, etc. And then to doing more hard core NLP, machine learning etc.
We see those things as not necessarily being disparate things that oh, I’m only going to use this library just for, because it has ND-arrays. We think that these things actually work together, in the sense that if you’re going to be doing any kind of data analysis, you’re going to need CSV parsers or you’re probably going to want to use streaming libraries, or you’re going to need CLI utilities to pluck various things out of JSON data structures.
You’re going to need to be able to do some functional operations on these data structures, whatever it might be. So, we see it as everything is pretty much fair game to us. As long as we see that, that it fits within the narrative of providing a standard library that people can use for whatever application that may be. So, yeah our bent is toward the scientific computing community, but we don’t necessarily exclude ourselves from implementing just basic utilities for data transformation or whatever.
Athan: I guess there are a few people that I follow on Twitter and things like that. I keep pretty close tabs on like my Github activity feed and I know key people within the community that were doing at some point or another, scientific computing work.
So, well I find through those kind of weak links, resources and then if there are repositories I just follow them. So, we have a huge laundry list internally of various projects that have been started either many, many years ago or new ones that people were working on. What I would say is that yeah, certain things like this will probably fall through the cracks. I don’t know how much it’s worth it, how much time really needs to be invested in this, because there’s a lot of noise. There are so many crappy implementations out there, so from our standpoint were hadn’t really done anything for like a year and a half to really promote the project.
We’d go and we’d talk at conferences and wherever else, and we’d plug it that way. But it wasn’t, at least from our standpoint, we were trying to create as much low level functionality and we weren’t looking for broad adoption. Because we recognise that we need a certain kind of critical mass in order for the library to actually show it’s potential.
So, it’s only been relatively recently that we’ve thought about actually pushing and promoting this project within the broader community. And that’s since we’ve kind of kept a low profile, but there are other people in the community that are doing interesting things. Obviously like all the Tensorflow guys. Ryan Dahl, obviously was doing propel, but then that got shut down because Google scooped him. And then there’s the kind of canonical ones like Math.js, but they’re not really doing anything innovative.
You know, to be quite honest there is not that many interesting projects happening, there is just not. People will do one off things and it’s all typically featured around deep learning at the moment. So, if you’re interested in that, there’s like Keras-js, there’s the Tokyo ML people. There’s Synaptic, which is now Neptactic and the deep learn JS that is now Tensaflow. That’s where a lot of the eye candy and activity is. Not a lot of activity is happening on doing actual, good, solid computer science.
Ash: Do you think there is anyway to coordinate the community to bring these things together and get stuff moving in the right direction?
Athan: Yeah, maybe. I think it’s quite difficult. It’s a little bit like finding a needle in the haystack, because you need the people. Obviously there are a lot of well intentioned people, but they don’t necessarily have the knowledge or the wherewithal to recognise what’s a good implementation from a bad implementation.
They’ll go and they’ll find something on Stack Overflow and that will be it. They won’t actually go and look at the academic literature and figure out okay, this is how people actually really implement these things. Also with open source work there is often ego involved. So, they are more often than not, if they’re going to devote their time, they are going to devote their time to something that they do. Of course there are a lot of people that contribute to open source. They do one off drive by, kind of contributions. But they are not necessarily investing a lot of time.
The people that are investing time are more likely to want to do their own projects and have their own way of doing it and they’re not really willing to contribute to other projects. So, it’s a bit hard to recruit people, to get people to kind of consolidate around a particular project. What I think will have to happen is that a project, whatever it is, maybe that’s Tensorflow, maybe that’s something else, will just have to win and people and that’s the one that we’re all going to jump on.
I don’t know if there is anything that can be done at the present moment to say like hey, let’s create this committee that’s now going to architect this thing. I think it’s way too soon to be able to even start thinking about that. My guess is we won’t see any kind of consolidation, maybe for another year, two years from now. Where people go okay, this is legitimate.
There are these cool demos that people are doing, like machine learning. Some people have written some books on it, they’ve shown the potential and they’re saying okay now, now is the time, this is the most promising one and now we’re going to pull our eggs in the basket. At the present moment, there is no project out there that is like that. There is just not.
Ash: Cool, we might wrap it up now. Is there anything else you wanted to say about standard library?
Athan: No not really. Our current road map is working on low level numeric stuff. So, this is low level implementation, BLAS bindings, LAPACK bindings etc. So, while the fundamental building blocks are needed to create libraries for machine learning, deep learning etc. That’s probably going to be the next few months of work. There are other things, like in data visualisation and such, that we might work on and focus on but it’s really the fundamental building blocks that people need to build these larger, more performing libraries.
But it’s difficult. I think one of the biggest things that needs to happen in order for the community to start recognising it’s potential is you need some kind of corporate buy-in. There needs to be some kind of like funding into this space. Otherwise it’s not sustainable, it’s not sustainable for people like me to be able to live off my savings and create these things in their garage kind of thing. There needs to be some kind of model for supporting initiatives in this space, in order for them to actually grow. Because the scope of the problem is so big, that it’s not something you just knock off in a weekend.
It’s something that takes from months to years. We’ve been working on this stuff now for two years. And to create something like a numpy or a boost or whatever, it takes a lot of effort. And there needs to be some kind of way for supporting those efforts. Whether that’s corporate sponsorship or grant funding etc.
Ash: Maybe that’s where a committee or some sort of foundation could come in useful? Maybe for evaluating these projects? Looking at the ones with the most potential and potentially allocating funding to those ones.
Athan: Yeah, but those bodies are really few and far between. Even look at the JS foundation. They provide mentorship, they don’t actually provide any kind of financial backing. You have things like the Moore Foundation, but or NumFOCUS, but those things are typically for more established projects. And there is nothing currently super well established, or in widespread usage. So, this is difficult. There is this gap right now. You don’t have anything like seed or angel stage funding for open source work. And there is not a lot of interest yet in academia for sponsoring this type of thing.
Athan: So, I think that wraps up. I don’t know if you have any other questions or anything?