If you have ever dreaded ‘breaking the build,’ Git is the safety net you never knew you needed — or another version control system (VCS), but this article is written with Git in mind.

While many Git tutorials exist, I try to do something a bit different - so I hope this article offers a fresh perspective. It may not be intuitive why version control systems are important, so let’s start with that.

Motivation

Let’s attempt to solve the following problem:

Define a snapshot as a version of the codebase that users may want to revisit.

Given the assumption that the changes between snapshots are not too drastic, how would you ensure that:

All the snapshots are stored in a space efficient manner.
You can revisit any previous snapshot, i.e. move backward from a later snapshot m to an earlier one n.
You can advance to a more recent snapshot, i.e. move forward from an earlier snapshot m to a later one n.

Naive Approach

Store all the snapshots in full. This approach does not scale well.

A More Calculated Approach

Let’s focus on the first problem for now. A much better approach is to store the initial snapshot and then only the changes between all the snapshots. This is more efficient because of the assumption that changes between snapshots are not too drastic.

For convenience, let’s define a format called patch to store the differences between two consecutive snapshots and patch application as applying the changes contained in it.

With these definitions in mind, this approach can be represented by the following diagram:

Diagram showing snapshots and patches in version control

Note that this also solves Problem 3: We can obtain Snapshot n from Snapshot m, where m < n, by applying patches m+1..n.

For example, we can obtain Snapshot 2 from Snapshot 0 by applying patches 1 and 2 as shown below:

Diagram showing patch application from Snapshot 0 to 2

Now, that leaves us with Problem 2. We can modify our current solution by making sure the patches store the changes in a reversible format, i.e., we can reverse the changes.

This approach is represented by:

Diagram showing reversible patches (rpatches)

If we call reversing the changes contained in a patch as applying a rpatch (shorthand for reverse patch, not official terminology), this solves Problem 2 as follows: We can obtain Snapshot n from Snapshot m, where m > n, by applying rpatches n..m+1.

For example, Let’s say we are at Snapshot 3 and we want to go back to Snapshot 1, the flow would look like this:

Diagram showing patch reversal from Snapshot 3 to 1

This is the classic VCS solution. There are some significant downsides with this approach, such as patch chain application being slow for sufficiently many snapshots. So Git does a little optimization trick: Instead of always storing patches, it stores complete copies of small enough files.

For larger files or repositories with many versions, Git optimizes storage by using delta compression, which is basically the patch mechanism, but with a few caveats.

The assumption that changes between versions are not too drastic is important for Git’s efficiency. This makes Git less ideal for situations where files change significantly between versions, such as with large binary files.

Now that we understand how snapshots and patches work, let’s look at the structure of a Git repository.

Structure of a Git Repository

A Git repository just refers to a codebase that uses Git. In Git terminology, our snapshots are called commits, each described by a commit message. Git divides modified files into the following areas:

Staging Area:
The staging area contains changes that have been staged to be included in your next commit. This is implemented via the Git index.
Working Area:
The working area contains all the files in your project as they currently exist on disk, including unmodified files, modified but unstaged files, and untracked files. For our convenience, we will treat untracked files as a separate category and exclude them from our definition of the working area moving forward.
Untracked Area:
Not an official Git term, but refers to all newly added files that have never been part of Git’s history. These files are officially called untracked files.

These are the commands used to move files between the various areas:

1. `git add`

This moves modified and untracked files to the staging area. It’s straightforward but powerful.

git add {FILEPATH}/{FILE}   # Stage a single file
git add {DIRPATH}           # Stage all files in a directory

2. `git restore`

This reverts modified files to their committed version and does nothing for untracked files.

With the --staged flag, it moves modified files from the staging area back to the working area, unstaging them. For untracked files that were staged, it reverts them back to the untracked area, i.e., changes them back to be untracked files.

git restore [FLAGS] {FILEPATH}/{FILE}           # Restore a file to the latest commit
git restore [FLAGS] {DIRPATH}                   # Restore all files in a directory
git restore [FLAGS] --staged {FILEPATH}/{FILE}  # Unstage a file

3. `git clean`

Removes files and directories in the untracked area. Use

-n for a dry run (see what would be removed),
-i for interactive prompts,
-f to actually delete files.

git clean [FLAGS] [UNTRACKED_FILES_PATH]

4. `git commit`

Finalizes your staged changes by recording them in a new commit. By default, this opens your configured text editor for a commit message.

Use the -m flag to provide a message directly from the command line:

git commit # Opens your configured text editor for entering the commit message
git commit -m "Your commit message" # Uses "Your commit message" directly

Note: Ignoring Files

There are files you’ll never want to commit (e.g., build outputs, temporary files). To ensure they’re always ignored, add their patterns to a .gitignore file in your repository and commit this file. Git will recognize these files as ignored and prevent them from being staged or committed, helping you keep your repository clean.

The following diagram summarizes this section:

Diagram of Git repository structure: untracked, staging, and working areas

Git branches

Let’s say you want to work on multiple orthogonal features at the same time. You will quickly figure out this is not very convenient with just one sequence of commits.

For example, consider the situation where one of the commits for one feature is faulty. How would you remove just this commit, without drastically impacting other features that depend on it? The natural solution is to maintain a separate sequence of commits for each feature. This is what git branch implements.

Note that these series of commits have a common component — the sequence of commits before the split. We can think of the common component as a “trunk” and the diverging series of commits forming “branches”, hence the terminology. There’s a special authoritative branch called main, which is what other branches get consolidated with.

Credits to https://www.atlassian.com/git/tutorials/using-branches for the image.

Diagram showing branching from a trunk in Git

These are the commands related to branching:

git branch                  # Lists all the current branches
git branch branch-name      # Creates branch-name
git branch -D branch-name   # Deletes branch-name
git checkout branch-name    # Switches to branch-name
git switch -c branch-name   # Newer command for switching

After switching to a new branch, the process of adding new commits to this branch is exactly the same. At some point, you would finish implementing your feature and want to consolidate your changes with main.

There are two ways of doing this: merge and rebase. The difference between the two is subtle; merge preserves the branches in Git history, while rebase rewrites history to seem like the branches never existed.

In practice, this means that after a branch gets rebased, you are forced to work with a new copy of the commit. Therefore, you should never rebase shared branches. In the case of a merge, you can revisit and modify the original commit. Merge also creates an extra commit called a merge commit, which describes the merged commits.

In the process of consolidation, there would be changes that cannot be resolved automatically. These changes are called merge conflicts, and should be dealt with human intervention and much care.

Credits to https://www.atlassian.com/git/tutorials/merging-vs-rebasing for the images.

Merge

Diagram showing a merge in Git

Rebase

Diagram showing a rebase in Git

Git remote

Up to this point, we have been thinking of Git on a local system. But well, Git was designed to work with the Internet. This is where Git remotes come in.

You can add Git repositories on remote systems as Git remotes. You can then pull changes from branches in this remote repo or push your changes to the appropriate branch in the remote. These are the relevant commands:

1. `git remote add`

This adds a remote repository with the name as the provided argument

git remote add origin <remote-url> # you can now use `origin` to refer to remote-url

2. `git clone`

This allows you to clone a remote repo onto your local machine

git clone <remote-url> # Copies remote url to your local machine and sets it up as origin

3. `git pull`

This does two things. First, it fetches the changes from the appropriate remote branch, and then it does merge or rebase based on the flags.

git pull origin <remote-branch>            # Fetches and merges changes from remote branch
git pull --rebase origin <remote-branch>   # Fetches and rebases changes from remote branch

4. `git push`

This allows you to push your changes to the remote URL. Note that this requires the local branch to be in sync with the remote branch; otherwise, the push would be rejected. There’s a very dangerous flag --force, that would overwrite the remote repo. Use it with much caution.

git push origin <local-branch>:<remote-branch> # Pushes your changes from local branch to origin/remote-branch
git push origin branch                         # Pushes changes from branch to origin/branch
git push --force origin branch                 # Overwrites the history in origin/branch

We have covered all the things that I believe are essential to Git. For further reading, please consult Atlassian Git Tutorials and Git documentation

Motivation#

Naive Approach#

A More Calculated Approach#

Structure of a Git Repository#

1. git add#

2. git restore#

3. git clean#

4. git commit#

Note: Ignoring Files#

Git branches#

Merge#

Rebase#

Git remote#

1. git remote add#

2. git clone#

3. git pull#

4. git push#