Thursday, December 26, 2024
Google search engine
HomeData Modelling & AITop 32 Bash Commands for Data Scientists

Top 32 Bash Commands for Data Scientists

A command line is a valuable tool for productivity in daily Data Science activities. As Data Scientists, we are adept at using Jupyter Notebooks and RStudio to obtain, scrub, explore, model, and interpret data (OSEMN process). From Pandas to Tidyverse, messy data is handled very effectively and effortlessly to provide input to machine learning algorithms for modeling purposes. However, simple operations such as sorting dataframes and filtering rows are given a condition, or creating complex data pipelines for large datasets can be performed just as quickly using the Bash command-line interface. First released in 1989, Bash (a.k.a., Bourne Again Shell) is an essential part of a data scientist’s toolkit, but one that is not popularly taught in data science bootcamps, Master’s programs, and even online courses.

Bash Commands | data scientists

Source: Unsplash

This article will introduce you to the fantastic world of Bash, beyond the basic commands that are commonly used, such as printing a working directory using pwd, changing directories using cd, listing items in a folder using ls, copying things using cp, moving items using mv, deleting items using rm among others. After going through the article, you will have become familiar with in-built data wrangling commands available in Bash, ready at your disposal. 

Bonus: A Cheatsheet has also been provided for quick reference to the commands and how they work! Some miscellaneous commands are included so make sure to check it out!

Basics of Bash

A Command-Line Interface (CLI) allows users to write commands to instruct the computer to perform specific tasks. “Shell” is a CLI and is so called because it restricts the outer layer of the system from the inner operating system, i.e., the kernel. Shell is tasked with reading the commands, interpreting them, and directing the operating system to perform those tasks. 

Each command is preceded by a dollar sign ($) called the prompt. Only the $ sign is used in the examples in this article because the prompt is irrelevant to the actual commands and changes when you go to another directory, and it can also be customized. The command structure follows this sequence: command -options arguments. 

32 General Bash Commands

Let’s look at some simple commands in Bash:

  1. Which – outputs the full path of the command specified in the argument 
  2. rm – removes permanently any file (not folder)
  3. rm -rf – recursively removes any file or folder permanently 
  4. ls – lists directories and files
  5. mv – moves files and directories from one folder to another
  6. cp – copies files and directories from one folder to another
  7. mkdir – creates new directories
  8. curl – downloads and uploads the data using FTP, HTTP, HTTPS and SFTP protocols
  9. top – monitors running processes and memory being used by them
  10. cd – changes the current directory
  11. pwd – prints the path of the working directory
  12. sudo – allows executing the command as a superuser
  13. history – prints the list of past commands executed in Bash
  14. clear – clears the screen
  15. find – finds the files that have specific characteristics mentioned as an argument to the command
  16. man – displays the user manual of any command
  17. type – identifies whether the command is a built-in shell command, alias, keyword, or subroutine
  18. pip – installs packages from PyPI and is most frequently used for installing Python packages
  19. tar – archiving utility that can be used to compress files
  20. whoami – displays the user ID (local account used to log in)
  21. hostname -i – displays the hostname (name of the machine) 
  22. date – displays the date and time
  23. cal – displays the calendar with the current date highlighted as per system settings
  24. uname -r – displays the OS version
  25. uptime – displays how long the system has been running
  26. reboot – reboots the system
  27. free – shows the amount of free and used-up memory space
  28. df – shows the amount of disk space available
  29. exit – used to exit the terminal
  30. echo $0 – displays the current shell
  31. lscpu – shows the CPU details
  32. cat /etc/shells – displays all available shells in the system, such as Bourne shell (sh), Korn shell (ksh), Bash, C shell, etc.

Data Processing Bash Commands

Let us review the suite of commands in Bash that Data Scientists use. We will use two datasets, the first being Apple’s 40-year stock history from January 1, 1981, till December 31, 2020, that can be downloaded from Yahoo Finance here. The second dataset is a custom dataset as below.

Bash Commands
Custom Dataset

1. wc command

wc command for word count returns the number of lines, words and characters in a file in this order.

Bash Commands

Print number of lines, words, and characters of a file using wc command

You can use the input redirection operator “<” if you do not wish to see the result’s file path.

Data Processing bash commands

Print a file’s number of lines, words, and characters without showing the file path using the wc command.

Furthermore, use options -l to return the number of lines, -w to return the number of words, and -m to produce several characters.

return the number of lines

Print the number of lines, words, and characters using wc command with specified options

2. head command

The head command returns top 10 lines in the file by default.

Print the first 10 lines of the file using head command

To see first n lines, pass in the -n option stating the number of lines.

Print the first 2 lines of the file using head command

Print the first 5 lines of the file using head command

3. tail command

The tail command returns bottom 10 lines in the file by default.

Print the last 10 lines of the file using tail command

Similar to head command, to see last n lines, pass in the -n option.

Print the last 3 lines of the file using tail command

4. cat command

cat for concatenate is a multi-purpose command used for creating files and viewing file content among its other uses. To view an existing file’s content, simply pass the file path as the argument. cat returns all the lines in the file.

Displaying first 5 lines of entire file output returned by cat command

To keep track of the line numbers, -n option can be used.

track of the line numbers

Displaying first 5 lines of entire file output returned by cat command, showing line numbers using -n option

To create a new file, use the output redirection operator “>” after cat and specify filename. Add your content in provided space, and press CTRL+D to exit editor.

Provide content - Bash Comments

Create a new file and add content using cat command

To append to an existing file, use the append operator “>>” after cat and specify filename. As before, add the content to be appended, and press CTRL+D to exit editor.

Use the append operator

Append to an existing file using cat command

data scientists | bash commands

Display the contents of NewFile.csv after creation and appending using cat command

5. sort command

The sort command is used to sort contents of a file, by the ASCII order of blank first, then digits, then uppercase letters followed by lowercase letters.

Let us sort the custom dataset for understanding this command. By default, the sort command sorts in ascending order and acts lexicographically on the first character in each line in the dataset (I, 1, 7, 9, 2, 4). Lexicographic sorting means that “29” comes before “4,5”. However, since we have a comma-separated file, we want sort to act on columns, and by default, to act on the first column of (ID, 1312, 7891, 9112, 2236, 4561). Thus, we pass in the delimiter option -t and the comma delimiter.

Bash Commands

Sorting the first column using default sort command with -t option

Notice that the header row is shifted to the end, since uppercase letters come after digits in the ASCII order. To sort numerically, we should use -n option. This ensures that sorting is only done numerically rather than lexicographically.

Bash Commands

Sorting the first column numerically using sort command with -n option

Notice now that the header row is unaffected. To reverse sort the dataset, pass in the option -r. The output is the reverse of the numerical sorting output.

Reverse sort the dataset

Reverse sorting the first column using sort command with -r option

To sort a particular column, pass in the -k option with the column number. Here, let us sort on age in ascending order.

ascending order - Bash Commands

Sorting age column using sort command with -k option

Let us also see an example of sorting non-numeric columns, such as the last column of “major”.

sorting non-numeric columns

Sorting non-numeric column using sort command

The sort command can also be used to sort month columns using -M option, check if column is already sorted using -c option, remove duplicates and sort using -u option.

6. tr command

The tr command stands for “translate”, and is used for translating and deleting characters. It reads only from standard input and shows the output on standard output.

Here, we will introduce the pipe operator “|” that passes the standard output of one command as standard input into another command, like a pipeline. Let us again use the custom dataset for understanding this command.

To convert uppercase characters to lowercase, pass in the first argument as “[:upper:]” and second argument as “[:lower:]”, and vice-versa. Alternatively, first argument can be “[A-Z]” and second one will then be “[a-z]”.

data scientists | bash commands

Converting uppercase characters to lowercase using tr command

To translate the comma-separated file into a tab-separated file, use tr command with first argument as “,” and second argument as “\t”.

tr command

Converting csv file format to a tsv file format using tr command

To delete a character in a file, use the delete option -d with the tr command. The operation is case-sensitive.

data scientists | bash commands

Deleting the character “S” using tr command with -d option

Notice that the character “S” is deleted from the entire file. Similarly to remove all uppercase letters, use character string “[:upper:]”, to remove all digits, use character string “[:digit:]” and so on.

character string

Deleting all digit characters using tr command with -d option

To delete everything except a character, use the complement option -c and the delete option -d with the tr command.

tr command

Delete all characters except uppercase letters using tr command with -c and -d options

To replace multiple continuous occurrences of character with single event, use the squeeze repeats option -s with the tr command giving only one argument as input.

data scientists | bash commands

Keeping a single character instance of “2” using tr command with -s option

To replace all single and multiple continuous occurrences of a character with another character, use the squeeze repeats option -s with the tr command giving two arguments as input. Note that all numerous endless events are also replaced with the single character.

data scientists | bash commands

Replacing all single and multiple occurrences of “2” with “h” using tr command with -s option

7. paste command

The paste command joins two files horizontally using a tab delimiter by default.

The -d option can be used to specify a custom delimiter. Let’s concatenate the two datasets with a comma delimiter and see the first 6 rows using the pipe operator “|”.

data scientists | bash commands

Concatenating two datasets horizontally using the paste command with -d option specified as a comm

8. uniq command

The uniq command detects and filters out duplicate rows in a file.

Let us use the custom dataset, and append two duplicate lines to the file first.

data scientists | bash commands

Appending duplicate rows to file using append operator “>>” with cat command

This is how the dataset looks like now.

data scientists | bash commands

View of data file after adding duplicate line items

Now, let’s see the number of lines with their count using -c option.

data scientists | bash commands

Find the count of each line item using uniq command with -c option

Notice that the detection of duplicate entries is case-sensitive. To ignore case, use the -i option.

data scientists | bash commands

Find the count of each line item ignoring case using uniq command with -c and -i options

Other valuable options with the uniq command include -u option that returns unique line items, and -d option that returns only duplicate line items.

9. grep command

grep stands for “global regular expression print” and is Bash’s in-built utility for searching line items matching a regular expression.

Let us search for all rows which have “John” using grep.

grep

Searching for line items matching regular expression using grep command

Since grep is case-sensitive, we can use -i option to ignore the case for matching.

case for matching

Searching for line items matching regular expression (case-insensitive) using grep command with -i option

The number of lines containing “John” can be returned using the count option -c.

can be returned

Counting number of lines matching regular expression using grep command with -c option

To match whole words instead of a substring using grep command, use the word option -w. To demonstrate, let’s first append a new row using cat command.

Appending a new row in data file using append operator “>>” with cat command

data scientists | bash commands

Let’s see the default output of grep command.

data scientists | bash commands

Default search for regular expression using grep command

To search only for whole word of “ohn” instead of all substrings, let’s now use the word option -w.

Searching for whole words of regular expression using grep command

To keep track of the line numbers of line items returned by grep command, use the -n option.

data scientists | bash commands

Print line numbers for lines matched by regular expression using grep command with -n option

10. cut command

The cut command is used to cut and extract sections from each file line.

The field option -f must be used to return a particular column. The field counter for the option starts from 1 and not 0 for the first column onwards.

Return a field in a data file using cut command with the -f option

As we can see, Bash isn’t able to identify the columns, thus the delimiter option -d must used in conjunction with it.

Return a field in a data file using cut command with the -d and -f options

Let’s look at a more complex example of the cut command using the Apple stock prices dataset. Specifically, we want to see the columns – Date, High, Low, Volume – of the first 10 data rows in the file. To do this, we first get first 11 rows (including header) using head command as standard output, and pipe it into the cut command. Note that the field option -f gets multiple column numbers as input.

data scientists

Returning a subset of rows and columns using head and cut commands with pipe operator

Other Useful Bash Commands

The “!$” unique character in Bash is used to designate the last argument of the preceding command. CTRL + R is used for reverse searching for commands through the Bash session history.

To understand it better, let’s see an example of “!$” character.

data scientists

Returning standard output of data file using cat command

Now, say we want to look at only the first 3 rows in the file. Instead of repeatedly mentioning the entire file path in my new command, I can type “head -3 !$”. The “!$” unique character will automatically take in the path.

data scientists

Printing the first 3 rows using “!$” unique character with head command

CTRL + R for reverse-i-search is beneficial for searching through any old and long command you’d written and want to bring up again. The command searches recursively starting at the last matched command, and moves up the history. Furthermore, the characters typed in get incrementally compared with the previous commands.

Reverse searching for grep commands in the session history

Reverse searching for grep command with -i option having “John” value in the session history

The entire command history of the session can be seen using history command if manual search needs to be done.

Bash Commands CheatSheet

Essential of Bash Commands - Cheat Sheet

Conclusion

Bash command line for data scientists is a very useful tool for some quick data analysis, without launching any integrated development environment. All the commands become more powerful tools when combined with Input/Output redirection (“<”, “>”, “>>”) and pipe (“|”) utilities of Bash. Experiment with these utilities and find efficient ways of wrangling with your data. Make sure to get your hands dirty to leverage Bash for your data needs!

Read more articles on our blog!

Frequently Asked Questions

Q1. What are Bash commands?

A. Bash commands are executed in the Bash shell, a popular command-line interface and scripting language used in Unix-like operating systems. They allow users to interact with the operating system, execute programs, manipulate files, and perform various system-related tasks.

Q2. What is the list command for Bash?

A. The “ls” command is used in Bash to list the contents of a directory. It displays the files and directories in the current working directory or any specified directory, along with their details, such as permissions, size, and timestamps.

Q3. How do I start a Bash command?

A. To start a Bash command, you need to open a terminal or command prompt, depending on your operating system. Once the terminal is open, you can type and execute Bash commands directly by typing the command followed by pressing the Enter key.

Q4. What is Gitbash used for?

A. Git Bash is a command-line interface for Windows that provides a Unix-like environment, including a Bash shell and a collection of Git commands. It allows Windows users to use Git version control system and execute Bash commands, making it easier to work with Git repositories and perform command-line operations.

guest_blog

02 Jun 2023

RELATED ARTICLES

Most Popular

Recent Comments