Learn SQL via the Command-Line Shell

Structured Query Language (SQL) databases are some of the most widely deployed pieces of software around the world. One of the most popular database engines is SQLite. Unlike others, SQLite does not implement a separate server process. Instead, it reads and writes directly to disk files. A complete SQL database with tables, indices etc. is therefore self-contained in a single file abbreviated with the .sqlite extension. The resulting file is cross-platform compatible, and can thus be deployed on different machines with varying architectures.

The SQLite project provides a command-line tool, that can be used to interact with a SQLite database. In this article, we will make use of it to practice some commonly used SQL commands.

Installation

Ubuntu

$ sudo apt install sqlite3
$ sqlite3 -version
## 3.37.2 2022-01-06 13:25:41 872ba256cbf61d9290b571c0e6d82a20c224ca3ad82971edc46b29818d5dalt1

Windows

Users can download the appropriate precompiled binaries from the official website. Double-clicking the .exe file should start the terminal window, and the rest of the commands will be similar as shown below.

Sample database

For our use, we will download the NIPS (Neural Information Processing System) papers database from Kaggle. The data and associated files can also be found on my GitHub repository.

Practicing SQL

Loading the database

sqlite> .open nips_papers.sqlite
sqlite> .databases
main: /home/vikas/Desktop/Julia_training/SQL_data_analysis/nips_papers.sqlite r/w

The above example shows the use of a dot-command (.open, .databases etc.), which is used by the sqlite3 program to perform special actions. The full list can be seen using the .help command.

Check schema

sqlite> .schema
CREATE TABLE papers (
    id INTEGER PRIMARY KEY,
    year INTEGER,
    title TEXT,
    event_type TEXT,
    pdf_name TEXT,
    abstract TEXT,
    paper_text TEXT);
CREATE TABLE authors (
    id INTEGER PRIMARY KEY,
    name TEXT);
CREATE TABLE paper_authors (
    id INTEGER PRIMARY KEY,
    paper_id INTEGER,
    author_id INTEGER);
CREATE INDEX paperauthors_paperid_idx ON paper_authors (paper_id);
CREATE INDEX paperauthors_authorid_idx ON paper_authors (author_id);

As we can see, our database contains three tables - papers, authors and paper_authors. Individual fields and their data types are also listed.

Change output format

We can view the current output format, and change it to a new one using the .mode command. I find the table mode to be the most useful for viewing query results in the terminal.

sqlite> .mode
current output mode: list
sqlite> .mode table
sqlite> .mode
current output mode: table

Select rows from a table

sqlite> SELECT year, title FROM papers LIMIT 5;
+------+---------------------------------------------------------------------+
| year |                                title                                |
+------+---------------------------------------------------------------------+
| 1987 | Self-Organization of Associative Database and Its Applications      |
| 1987 | The Capacity of the Kanerva Associative Memory is Exponential       |
| 1987 | Supervised Learning of Probability Distributions by Neural Networks |
| 1987 | Constrained Differential Optimization                               |
| 1987 | Towards an Organizing Principle for a Layered Perceptual Network    |
+------+---------------------------------------------------------------------+

Note that SQL commands are NOT case-sensitive. It's just convenient to use UPPER CASE so that it becomes easy to distinguish the commands from other text. The LIMIT command simply limits the number of lines that are displayed in the query result (5 in this case).

Select rows with a condition

Show year, title only for the year 2017:

sqlite> SELECT year, title FROM papers WHERE year = 2017 LIMIT 5;
+------+---------------------------------------------------------------------------------------------+
| year |                                            title                                            |
+------+---------------------------------------------------------------------------------------------+
| 2017 | Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning                |
| 2017 | Concentration of Multilinear Functions of the Ising Model with Applications to Network Data |
| 2017 | Deep Subspace Clustering Networks                                                           |
| 2017 | Attentional Pooling for Action Recognition                                                  |
| 2017 | On the Consistency of Quick Shift                                                           |
+------+---------------------------------------------------------------------------------------------+

Commands can also be spread across multiple lines for better readability. The query can be terminal using the semi-colon (;).

sqlite> SELECT year, title FROM papers
   ...> WHERE year = 2010
   ...> LIMIT 5;
+------+---------------------------------------------------------------------------------------+
| year |                                         title                                         |
+------+---------------------------------------------------------------------------------------+
| 2010 | CUR from a Sparse Optimization Viewpoint                                              |
| 2010 | Getting lost in space: Large sample analysis of the resistance distance               |
| 2010 | Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting |
| 2010 | Learning via Gaussian Herding                                                         |
| 2010 | Smoothness, Low Noise and Fast Rates                                                  |
+------+---------------------------------------------------------------------------------------+

Select rows with a condition over a range

sqlite> SELECT year, title FROM papers
   ...> WHERE year BETWEEN 2010 AND 2011;
+------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| year |                                                                       title                                                                       |
+------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| 2010 | CUR from a Sparse Optimization Viewpoint                                                                                                          |
| 2010 | Getting lost in space: Large sample analysis of the resistance distance                                                                           |
| 2010 | Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting 
...
| 2011 | Action-Gap Phenomenon in Reinforcement Learning                                                                                                   |
| 2011 | Divide-and-Conquer Matrix Factorization                                                                                                           |
| 2011 | Contextual Gaussian Process Bandit Optimization                                                                                                   |
+------+---------------------------------------------------------------------------------------------------------------------------------------------------+

Select rows with matching strings

sqlite> SELECT year, title FROM papers
   ...> WHERE title LIKE '%matrix%'
   ...> OR    title LIKE '%Kanerva%'
   ...> LIMIT 10;
+------+-----------------------------------------------------------------------------------------------------------------------------------+
| year |                                                               title                                                               |
+------+-----------------------------------------------------------------------------------------------------------------------------------+
| 1987 | The Capacity of the Kanerva Associative Memory is Exponential                                                                     |
| 1987 | Capacity for Patterns and Sequences in Kanerva's SDM as Compared to Other Associative Memory Models                               |
| 1987 | An Optimization Network for Matrix Inversion                                                                                      |
| 1988 | GEMINI: Gradient Estimation Through Matrix Inversion After Noise Injection                                                        |
| 1988 | Statistical Prediction with Kanerva's Sparse Distributed Memory                                                                   |
| 1989 | Neural Network Weight Matrix Synthesis Using Optimal Control Techniques                                                           |
| 1989 | Predicting Weather Using a Genetic Memory: A Combination of Kanerva's Sparse Distributed Memory with Holland's Genetic Algorithms |
| 1993 | Monte Carlo Matrix Inversion and Reinforcement Learning                                                                           |
| 1998 | A High Performance k-NN Classifier Using a Binary Correlation Matrix Memory                                                       |
| 2000 | Algorithms for Non-negative Matrix Factorization                                                                                  |
+------+-----------------------------------------------------------------------------------------------------------------------------------+

Reading queries from a file

Queries can be grouped together in a file, and read directly by the sqlite3 program. This makes it easy to execute a long sequence of commands without having to retype everything. For example, group_by_year.sql contains the following queries:

$ cat group_by_year.sql 
SELECT COUNT(title), year
FROM papers
WHERE year BETWEEN 2010 AND 2015
GROUP BY year;

Here, we make use of GROUPBY to count the number of paper titles per year between 2010 and 2015. Let's see what happens when we run the input file using the .read command:

sqlite> .read group_by_year.sql
+--------------+------+
| COUNT(title) | year |
+--------------+------+
| 292          | 2010 |
| 306          | 2011 |
| 368          | 2012 |
| 360          | 2013 |
| 411          | 2014 |
| 403          | 2015 |
+--------------+------+

Additional filtering using HAVING

Let's add HAVING COUNT(title) > 400 to the previous query.

sqlite> .read group_by_year.sql
+--------------+------+
| COUNT(title) | year |
+--------------+------+
| 411          | 2014 |
| 403          | 2015 |
+--------------+------+

Queries across multiple tables

For similar columns in different tables, various types of JOINS can be used to combine and filter data. This can be easily seen by once again checking the schema of the database.

sqlite> .schema
CREATE TABLE papers (
    id INTEGER PRIMARY KEY,
    year INTEGER,
    title TEXT,
    event_type TEXT,
    pdf_name TEXT,
    abstract TEXT,
    paper_text TEXT);
CREATE TABLE authors (
    id INTEGER PRIMARY KEY,
    name TEXT);
CREATE TABLE paper_authors (
    id INTEGER PRIMARY KEY,
    paper_id INTEGER,
    author_id INTEGER);
CREATE INDEX paperauthors_paperid_idx ON paper_authors (paper_id);
CREATE INDEX paperauthors_authorid_idx ON paper_authors (author_id);

papers table contains the paper IDs, but the corresponding author IDs are listed in another table paper_authors. Names of the authors are present in authors table. To match author names to paper titles and year = 2012, we can try the following join:

$ cat find_authors.sql 
SELECT papers.year, papers.title, paper_authors.author_id, authors.name
FROM papers
INNER JOIN paper_authors ON papers.id = paper_authors.paper_id
INNER JOIN authors ON authors.id = paper_authors.author_id
WHERE year = 2012
LIMIT 10;

sqlite> .read find_authors.sql
+------+----------------------------------------------------------------------------------+-----------+--------------------+
| year |                                      title                                       | author_id |        name        |
+------+----------------------------------------------------------------------------------+-----------+--------------------+
| 2012 | Locally Uniform Comparison Image Descriptor                                      | 5507      | Andrew Ziegler     |
| 2012 | Locally Uniform Comparison Image Descriptor                                      | 5508      | Eric Christiansen  |
| 2012 | Locally Uniform Comparison Image Descriptor                                      | 5509      | David Kriegman     |
| 2012 | Locally Uniform Comparison Image Descriptor                                      | 2723      | Serge J. Belongie  |
| 2012 | Learning from Distributions via Support Measure Machines                         | 5654      | Krikamol Muandet   |
| 2012 | Learning from Distributions via Support Measure Machines                         | 1361      | Kenji Fukumizu     |
| 2012 | Learning from Distributions via Support Measure Machines                         | 5655      | Francesco Dinuzzo  |
| 2012 | Learning from Distributions via Support Measure Machines                         | 1472      | Bernhard Sch?lkopf |
| 2012 | Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery | 4951      | Ehsan Elhamifar    |
| 2012 | Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery | 3341      | Guillermo Sapiro   |
+------+----------------------------------------------------------------------------------+-----------+--------------------+

Additional filtering can easily be used, for example adding WHERE year = 2012 AND title LIKE '%Distributions%' to find_authors.sql gives the following result:

sqlite> .read find_authors.sql
+------+----------------------------------------------------------------------------------------------------+-----------+--------------------+
| year |                                               title                                                | author_id |        name        |
+------+----------------------------------------------------------------------------------------------------+-----------+--------------------+
| 2012 | Learning from Distributions via Support Measure Machines                                           | 5654      | Krikamol Muandet   |
| 2012 | Learning from Distributions via Support Measure Machines                                           | 1361      | Kenji Fukumizu     |
| 2012 | Learning from Distributions via Support Measure Machines                                           | 5655      | Francesco Dinuzzo  |
| 2012 | Learning from Distributions via Support Measure Machines                                           | 1472      | Bernhard Sch?lkopf |
| 2012 | Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum  | 6476      | Zhuo Wang          |
| 2012 | Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum  | 1786      | Alan A. Stocker    |
| 2012 | Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum  | 6477      | Daniel D. Lee      |
+------+----------------------------------------------------------------------------------------------------+-----------+--------------------+

Check for the existence of a record

We can use the EXIST operator as shown below:

$ cat check_if_exists.sql 
SELECT year, title, id
FROM papers
WHERE EXISTS (SELECT author_id FROM paper_authors WHERE papers.id = paper_authors.paper_id AND 
author_id BETWEEN 1001 AND 5001)
LIMIT 10;

sqlite> .read check_if_exists.sql
+------+------------------------------------------------------------------------------------------------------+-----+
| year |                                                title                                                 | id  |
+------+------------------------------------------------------------------------------------------------------+-----+
| 1987 | PATTERN CLASS DEGENERACY IN AN UNRESTRICTED STORAGE DENSITY MEMORY                                   | 21  |
| 1988 | Fixed Point Analysis for Recurrent Networks                                                          | 181 |
| 1988 | Does the Neuron "Learn" like the Synapse?                                                            | 184 |
| 1990 | Can neural networks do better than the Vapnik-Chervonenkis bounds?                                   | 401 |
| 1991 | A Contrast Sensitive Silicon Retina with Reciprocal Synapses                                         | 466 |
| 1991 | Estimating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Dimension Methods | 489 |
| 1992 | Automatic Learning Rate Maximization by On-Line Estimation of the Hessian's Eigenvectors             | 589 |
| 1992 | A Neural Network that Learns to Interpret Myocardial Planar Thallium Scintigrams                     | 600 |
| 1992 | Using Aperiodic Reinforcement for Directed Self-Organization During Development                      | 606 |
| 1992 | Biologically Plausible Local Learning Rules for the Adaptation of the Vestibulo-Ocular Reflex        | 704 |
+------+------------------------------------------------------------------------------------------------------+-----+

Union

The UNION operator can be used to combine columns from different tables. The following conditions must also be satisfied:

Every SELECT statement must have the same number of columns
Data types of the columns must be the same
Columns in every SELECT statement must be in the same order

sqlite> SELECT id FROM papers
   ...> UNION
   ...> SELECT paper_id FROM paper_authors
   ...> LIMIT 10;
+----+
| id |
+----+
| 1  |
| 2  |
| 3  |
| 4  |
| 5  |
| 6  |
| 7  |
| 8  |
| 9  |
| 10 |
+----+

Conclusion

There are a lot more SQL operations that can be used to construct more complex queries. I found this guide to be very useful. It's also possible to read SQLite data directly into a DataFrame object. I will explore that in another post.

I hope you learned something new today. Don't forget to bookmark this post and share it within your network. Thank you for your time!