Introduction
“Should I use GitHub for my projects?” – I’m often asked this question by aspiring data scientists. There’s only one answer to this – “Absolutely!”.
GitHub is an invaluable platform for data scientists looking to stand out from the crowd. It’s an online resume for displaying your code to recruiters and other fellow professionals. The fact that GitHub hosts open-source projects from the top tech behemoths like Google, Facebook, IBM, NVIDIA, etc. is what adds to the gloss of an already shining offering.
If you’re a beginner in data science, or even an established professional, you should have a GitHub account. And to save you the time of looking for the most interesting repositories out there (and there are plenty), I am delighted to scour the platform and bring them straight to you in this monthly series.
This month’s collection comes from a variety of use cases – computer vision (object detection and segmentation), PyTorch implementation of Google AI’s record-breaking BERT framework for NLP, extracting the latest research papers with their summaries, among others. Scroll down to start learning!
Why do we include Reddit discussions in this series? I have personally found Reddit an incredibly rewarding platform for a number of reasons – rich content, top machine learning/deep learning experts taking the time to propound their thoughts, a stunning variety of topics, open-source resources, etc. I could go on all day, but suffice to say I highly recommend going through these threads I have shortlisted – they are unique and valuable in their own way.
You can check out the top GitHub repositories and Reddit discussions (from April onwards) we have covered each month below:
GitHub Repositories
Faster R-CNN and Mask R-CNN in PyTorch 1.0
Computer vision has become so incredibly popular these days that organizations are rushing to implement and integrate the latest algorithms in their products. Sounds like a pretty compelling reason to jump on the bandwagon, right?
Of course, object detection is easily the most sought-after skill to learn in this domain. So here’s a really cool project from Facebook that aims to provide the building blocks for creating segmentation and detection models using the their popular PyTorch 1.0 framework. Facebook claims that this is upto two times faster than it’s Detectron framework, and comes with pre-trained models. Enough resources and details to get started!
I encourage you to check out a step-by-step introduction to the basic object detection algorithms if you need a quick refresher. And if you’re looking to get familiar with the basics of PyTorch, check out this awesome beginner-friendly tutorial.
Tencent ML Images (Largest Open-Source Multi-Label Image Database)
This repository is a goldmine for all deep learning enthusiasts. Intrigued by the heading? Just wait till you check out some numbers about this dataset: 17,609,752 training and 88,739 validation image URLs, which are annotated with up to 11,166 categories. Incredible!
This project also include a pre-trained Resnet-101 model, which has so far achieved a 80.73% accuracy on ImageNet via transfer learning. The repository contains exhaustive details and code on how and where to get started. This is a significant step towards making high quality data available to the community.
Oh, and did I mentioned that these images are annotated? What are you waiting for, go ahead and download it NOW!
PyTorch Implementation of Google AI’s BERT (NLP)
Wait, another PyTorch entry? Just goes to show how popular this framework has become. For those who haven’t heard of BERT, it’s a language representation model that stands for Bidirectional Encoder Representations from Transformers. It sounds like a mouthful, but it has been making waves in the machine learning community.
BERT has set all sorts of new benchmarks in 11 natural language processing (NLP) tasks. A pre-trained language model being used on a wide range of NLP tasks might sound outlandish to some, but the BERT framework has transformed it into reality. It even emphatically outperformed humans on the popular SQuAD question answering test.
This repository contains the PyTorch code for implementing BERT on your own machine. As Google Brain’s Research Scientist Thang Luong tweeted, this could well by the beginning of a new era in NLP.
In case you’re interested in reading the research paper, that’s also available here. And in case you’re eager (like me) to see the official Google code, bookmark (or star) this repository.
Extracting Latest Arxiv Research Papers and their Abstracts
How can we stay on top of the latest research in machine learning? It seems we see breakthroughs on an almost weekly basis and keeping up with them is a daunting, if not altogether impossible, challenge. Most top researchers post their full papers on arxiv.org so is there any way of sorting through the latest ones?
Yes, there is! This repository uses Python (v3.x) to return the latest results by scraping arxiv papers and summarizing their abstracts. This is an really useful tool, as it will help us stay in touch with the latest papers and let us pick the one(s) we want to read. As mentioned in the repository, you can run the below command to search for a keyword:
$ python3 sotawhat.py "[keyword]" [number of results]
The script returns five results by default if you fail to specify how many instances you want.
DeepMimic
I always try to include at least one reinforcement learning repository in these lists – primarily because I feel everyone in this field should be aware of the latest advancements in this space. And this month’s entry is a fascinating one – motion imitation with deep reinforcement learning.
This repository in an implementation of the “DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills” paper presented at SIGGRAPH 2018. Quoting from the repository, “The framework uses reinforcement learning to train a simulated humanoid to imitate a variety of motion skills”. Check out the above project link which includes videos and code on how to implement this framework on your own.
Bonus: AdaNet by Google AI
I just couldn’t leave out this incredibly useful repository. AdaNet is a lightweight and scalable TensorFlow-based framework for automatically learning high-quality models. The best part about it is that you don’t need to intervene too much – the framework is smart and flexible enough for building better models.
You can read more about AdaNet here. Google, as usual, does a great job of explaining complex concepts.
Reddit Discussions
What Developments can we Expect in Machine Learning in the Next 5 Years?
Ah, the question on everybody’s mind. Will autoML be ruling the roost? How will the hardware have advanced? Will there finally be official rules and policies around ethics? Will machine learning have integrated itself into the very fabric of society? Will reinforcement learning finally have found a place in the industry?
These are just some of the many thoughts propounded in this discussion. Individuals have their own predictions about what they expect and what they want to see, and this discussion does an excellent job of combining the two. The conversation varies between technical and non-technical topics so you have the luxury of choosing which ones you prefer reading about.
Advice for a Non-ML Engineer who Manages Machine Learning Researchers
Interesting topic. We’ve seen this trend before – a non-ML person is assigned to lead a team of ML experts and it usually ends in frustration for both parties. Due to various reasons (time constraints being top of that list), it often feels like things are at an impasse.
I implore all project managers, leaders, CxOs, etc. to take the time and go through this discussion thread. There are some really useful ideas that you can implement in your own projects as soon as possible. Getting all the technical and non-technical folks on the same page is a crucial cog in the overall project’s success so it’s important that the leader(s) sets the right example.
Topic Ideas for Machine Learning Projects
Looking for a new project to experiment with? Or need ideas for your thesis? You’ve landed at the right place. This is a collection of ideas graduate students are working on to hone and fine tune their machine learning skills. Some of the ones that stood out for me are:
- Predicting the trajectory of pedestrians
- Estimating weather phenomena through acoustics (using signal processing and machine learning)
- Using deep learning to improve the hearing aid speech processing pipeline
This is where Reddit becomes so useful – you can pitch your idea in this discussion and you’ll receive feedback from the community on how you can approach the challenge.
Why do Machine Learning Papers have Such Terrible Math?
This one is a fully technical discussion as you might have gathered from the heading. This is an entirely subjective question and answers vary depending on the level of experience the reader has and how well the researcher has put across his/her thoughts. I like this discussion because there very specific examples of linked research papers so you can explore them and form your own opinion.
It’s a well known (and accepted) fact that quite a lot of papers have math and findings all cobbled together – not everyone has the patience, willingness or even the ability to present their study in a lucid manner. It’s always a good idea to work on your presentation skills while you can.
The Disadvantages of the Hype Around Machine Learning
How do established professionals feel when their field starts getting tons of attention from newbies? It’s an interesting question that potentially spans domains, but this thread focuses on machine learning.
This is not a technical discussion per se, but it’s interesting to note how top data scientists and applied machine learning professionals feel about the recent spike in interest in this field. The discussion, with over 120+ comments, is rich in thought and suggestions. Things get especially interesting when the topic of how to deal with non-technical leaders and team members comes up. There are tons of ideas to steal from here!
End Notes
This year really has seen some amazing research being open-sourced. Regardless of what happens after Microsoft’s official takeover of GitHub, it remains the primary platform for collaboration among programmers, developers and data scientists. I implore everyone reading this to start using GitHub more regularly, even if it’s just for browsing the latest repositories.
Which GitHub repository and/or Reddit discussion stood out for you? Are there any other libraries or frameworks you feel I should have included in this article? Let me know in the comments section below.