Ubuntu, multiple Git accounts, Docker, and more: My Workspace Setup as a Data Science Intern

date
Feb 27, 2023
slug
workspace-setup
author
status
Public
tags
Data Science
Tech
summary
type
Post
thumbnail
6.png
updatedAt
Mar 13, 2023 11:42 PM
I have just finished my first week as a Data Science Intern at a startup in Paris. I think it’s a good time to document my workspace setup for both my future work and those who are in the same situation as I was.
As our data team is fairly small (3 people including me), there is no standard way of setting up a working space that I need to follow. I have been using MacBooks for years, even with my previous internship. This time, I was given a Windows machine (Lenovo ThinkPad P52).
Although desktop setup varies based on the dev environment that your team is using, the below steps are very basic and applicable most of the time. Serve this article as a quick guide for your first morning at the company.
My “workspace setup stack”
My “workspace setup stack”
 

 

Ubuntu

Doing data science tasks on Windows operating systems is painful (at least for me). There are a bunch of other reasons why you should use Linux instead of Windows for programming that have been covered somewhere on the internet.
Among a huge number of Linux distros, Ubuntu is probably one of the most widely used. I choose to install Ubuntu alongside Windows instead of using WSL/VirtualBox or replacing the original OS because:
  • Dual boot gives higher performance compared to WSL and VirtualBox
  • I will only use Ubuntu most of the time, so there is no advantage of using WSL or VirtualBox (in contrast to someone who has to use Windows applications a lot(.
  • Navigating through file systems and accessing hardware resources with WSL/VirtualBox sometimes are not straightforward.
  • I don’t want to completely delete Windows although it will save me a lot of memory. The reason is that I may need to run Windows software in the future and the available disk space seems enough for me (my laptop has 512GB SSD and I allocate ~300GB for Ubuntu).
The easiest way to install Ubuntu is to use a flash drive (USB). Just follow the official documentation, the process should not take more than an hour.

Multiple Git accounts

There is a high chance that you will use a company email for your work, and you will need to use that email to create a GitHub (or GitLab/BitBucket) to access the team’s codebase. At the same time, you may want to use your personal GitHub account for your own work. So, it’s a good idea to have more than one git account on your working computer.
This section is specifically written for GitHub and Ubuntu, but the steps for other platforms should be very similar. Here are the detailed steps:
Generate new key pairs:
Create key pairs for 2 git accounts by:
$ ssh-keygen -t rsa -b 4096 -f ~/.ssh/github-work
$ ssh-keygen -t rsa -b 4096 -f ~/.ssh/github-personal
Set up the configuration file
Create a config file in the ~/.ssh directory by:
$ (umask 077; touch ~/.ssh/config)
Open the config file with a text editor and key in this information:
Host github.com
  User git IdentityFile ~/.ssh/github-work

Host github.com-personal
  HostName github.com
  User git
  IdentityFile ~/.ssh/github-personal
Add your ssh keys to GitHub
Go to https://github.com/settings/keys with the corresponding GitHub accounts and add a new SSH key. You can get your public keys by:
$ cat ~/.ssh/github-work.pub
and
$ cat ~/.ssh/github-personal.pub
Workflow
When you have multiple Git accounts on your computer, most of the time you have to explicitly specify which one you are using.
With the same configuration file as above, when you want to clone a repo using your personal account, instead of running:
$ git clone git@github.com:yourusername/repo-name.git
You need to run:
$ git clone git@github.com-personal:yourusername/repo-name.git
The same thing applies when you add/update remote urls.
One more thing to take note of is that you need to set your correct identity (username and email) for each repo. You can do it by:
$ git config user.name "your-username"
$ git config user.email "your-email@domain"
More information can be found on this StackOverflow thread.

Docker

I don’t use Docker frequently for small-scale data science tasks unless I need to deploy ML endpoints somewhere. However, I make use of containers a lot for my side projects and sometimes I may need to collaborate with the Software Engineering team (more likely in a small startup), who use Docker. So, it’s beneficial to have Docker set up properly.
You need to uninstall old Docker versions by:
$ sudo apt-get remove docker docker-engine docker.io containerd runc
Then, you can install Docker and Docker Compose by running:
$ sudo apt-get update
$ sudo apt-get install docker-ce docker-ce-cli containerd.io
$ sudo apt-get install docker-compose
Verify that both Docker and Docker Compose are installed properly by:
$ docker --version
$ docker-compose --version
Other installation methods are available here.

Google Chrome

Although Google Chrome is famous for consuming a lot of RAM, it is still my favorite web browser. Run the following commands to update the package list, download the latest version of Google Chrome, and install the package:
$ sudo apt-get update
$ wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
$ sudo apt install ./google-chrome-stable_current_amd64.deb
You can launch Google Chrome by searching for it in Application Center or simply entering this into the terminal:
$ google-chrome

Vim

I use terminal-based text editor a lot for interacting with git or writing simple bash/python scripts. nano is a simple text editor that comes with a standard Ubuntu installation, but I prefer Vim as it is more powerful.
You can install Vim with the following commands:
$ sudo apt update
$ sudo apt install vim
And check if Vim is installed by:
$ vim --version
More on how to configure Vim and install plugins can be found here.

VS Code

Although I mostly use Jupyter Notebook (on a web browser) and terminal-based text editors for data science tasks, VS Code is still my top choice for general programming.
You can install VS Code with the following commands:
$ sudo apt-get update
$ sudo apt install software-properties-common apt-transport-https wget
$ wget -q https://packages.microsoft.com/keys/microsoft.asc -O- | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main"
$ sudo apt install code
And verify the installation by:
$ code --version
I have seen a lot of people opening VS Code by typing $ code into the terminal and using the GUI to select the directory that they want to work with. It will be a lot faster if you navigate to the directory using a terminal first, then open VS Code by:
$ code .

Virtual environment

You must have been familiar with the concepts of environments if you work with Python a lot. If not, check out this article.
There are a lot of ways to manage environments. conda is a very popular package manager and environment management system, but I prefer handling things manually with the Python venv module.
You can start by installing the python3-venv package
$ sudo apt install python3-venv
Then, you can create a virtual environment by
$ python3 -m venv /path/to/new/virtual/environment
Personally, I like storing the environment inside the working. However, you can try storing all environments in one directory (that is easy to be called), so it’s more convenient to activate one environment when working with different projects.
Above is the basic setup of my workspace as a Data Science Intern. I would love to know how you set up your desktop differently in the response section. I will update along the way if my preference changes.