Background

Inroduction

I learnt about the Linux Kernel Mentorship program from a good friend of mine, Shiv Karthik. At that point, I had just upstreamed my first patch, which took me 6 months.

I did not want this to be the case for every single kernel patch I submitted (especially the simple ones), so I decided to try my chances at the Linux Kernel Mentorship program.

If you just want to look at my patches, click here

The main challenges I faced when trying to contribute to the kernel were:

Not knowing what I could contribute
Not knowing where to start reading the kernel code
Lack of understanding of mailing list etiquette

I am aware these are very common challenges. And thankfully, the mentorship exists to resolve them.

I would highly recommend becoming familiar with C before starting the mentorship — you don’t want that to be your bottleneck.

Getting Started

The prerequisite tasks are very effective at what they aim to achieve. If you do them, you should have the basics of kernel development down.

From the prereq tasks, I learnt three very important things:

Creating a proper patch suitable for mailing to the mailing lists
Compiling the linux kernel and understanding what the artifacts do
Using decode_stacktrace.sh

Apart from this, I learnt about syzkaller and decided to focus on syzkaller bugs during my mentorship. Two patches I submitted during this period were accepted.

Structure of the mentorship:

Our mentors were initially Shuah Khan and David Hunter. We later got a new mentor, Khalid Aziz, because the batch size was unusually large this fall at 74 people.

The main media of communication were discord and e-mail. Apart from this, we had a Zoom meeting (Office Hours) every Wednesday at 8:00 AM MST. This is where we could discuss our queries with our mentors. These were very helpful.

Our mentors also reviewed and corrected mistakes in our patches on LKML, which was incredibly helpful.

Our goal was to get five non-spelling/grammar patches in. Those didn’t count, as they are easy to find and do not improve your understanding of the kernel.

My patches

We were asked to pick two subsystems and I decided to go with net and mm. I learnt a little too late that I was not ready for mm, although net turned out to be just the right level of complexity for me.

Here are my commits on mainline, i.e., patches that have been upstreamed.

git log --author="I Viswanath" --since=“2019-01-01” --pretty --format=oneline

8d93ff40d49d70e05c82a74beae31f883fe0eaf8 net: usb: lan78xx: fix use of improperly initialized dev->chipid in lan78xx_reset
e9f35294e18da82162004a2f35976e7031aaf7f9 ptp: Add a upper bound on max_vclocks
958baf5eaee394e5fd976979b0791a875f14a179 net: usb: Remove disruptive netif_wake_queue in rtl8150_set_multicast
59ccb8176bd7e826d47962e891b460284f6978f0 i2c: mux: Simplify boolean assignment in i2c_mux_alloc
79dfed097680084f3d4716fa2c5bc945233bd2c0 selftests/mm: use calloc instead of malloc in pagemap_ioctl.c
c3ff7f06c7876bc292cac1c7d4df3d0bfd74f3b7 i2c: Clarify behavior of I2C_M_RD flag

You can check all my patches with all the relevant metadata here.

Net patches

I decided to focus on net because it suited my skill level. I found all these bugs by going through the net subsystem page in syzkaller.

1. net: usb: Remove disruptive netif_wake_queue in rtl8150_set_multicast

I started with an absolute monster of a bug which I have described here in excruciating detail. In brief,

I added prints inside suspicious functions (netif_start_queue, netif_stop_queue and netif_wakeup_queue) and obtained the following log:

Diagram

Based on the log statements, I was eventually able to come up (with the help of people from netdev) with the following root cause:

// syzbot reported WARNING in rtl8150_start_xmit/usb_submit_urb.

// This happened because of the following sequence of events:

    rtl8150_start_xmit() {
            netif_stop_queue();
            usb_submit_urb(dev->tx_urb);
    }
    
    rtl8150_set_multicast() {
            netif_stop_queue();
            netif_wake_queue();             // wakes up TX queue before URB is done
    }
    
    rtl8150_start_xmit() {
            netif_stop_queue();
            usb_submit_urb(dev->tx_urb);    // double submission
    }

// rtl8150_set_multicast() should not be concerned with TX queue synchronization, as it's a ndo_set_rx_mode callback.

The core issue is that waking the TX queue too early allows start_xmit() to run again before the previous URB completes, causing a double submission.

rtl8150_set_multicast() had no reason to call TX queue synchronization functions, i.e., netif_stop_queue() and netif_wake_queue(). The solution was therefore to remove them from rtl8150_set_mulitcast().

2. ptp: Add a upper bound on max_vclocks

My second patch was thankfully much simpler. syzbot reported WARNING in max_vclocks_store. I looked at the code and came up with the hypothesis that the bug happened because max is too large for kcalloc here

diagram

I validated this by crafting the following command to trigger the bug:

echo x > /sys/devices/virtual/ptp/ptp0/max_vclocks # where x > KMALLOC_MAX_SIZE/(sizeof(int)) which computes to 1048576 on my system

My first attempt tried to fix this by adding validation in mm/slub.h. The mm reviewers pointed out that this was not the right place, so I moved the validation up into the caller, max_vclocks_store().

This is where I also learnt that RFC tags are not meant for normal bug-fix patches. In netdev, marking a patch RFC moves it out of the maintainers’ queue entirely. It is meant only for feature patches that are still being discussed. This patch was certainly not that.

Here’s the final accepted patch

3. net: usb: lan78xx: fix use of improperly initialized dev->chipid in lan78xx_reset

I discovered this bug while investigating a syzbot report.

Diagram

While reading through the code, I noticed that dev->chipid was being used in lan78xx_read_raw_eeprom() before it was initialized. At the time, I thought this was the bug syzkaller found, but the report had no reproducer, so I couldn’t test it.

Interestingly, this was a separate bug unrelated to the actual syzkaller report. The syzkaller bug itself was solved by a fellow mentee, Bhanu Seshu Kumar Valluri.

I validated the existence of this bug by using the reproducer meant for the syzkaller bug and identifying that dev->chipid was always 0 at the point of use.

Here’s the root cause:

// dev->chipid is used in lan78xx_init_mac_address before it's initialized:

lan78xx_reset() {
    lan78xx_init_mac_address()
        lan78xx_read_eeprom()
            lan78xx_read_raw_eeprom() // <- dev->chipid is used here

dev->chipid = ... // <- dev->chipid is initialized correctly here
}

This was a weird bug because KMSAN would have never detected it — alloc_etherdev() zero-initializes dev->chipid (using the GFP_ZERO flag).

The fix was just swapping the order of operations. Here’s the patch.

net: Split ndo_set_rx_mode into snapshot and deferred write (In Progress):

I spent a considerable amount of time trying to understand this particular reply.

Diagram

I have managed to come up with something that looks like it could work but I need to smooth out all the edges and fix all the possible regressions. When I do, I plan to document the process in detail. This is where decode_stacktrace really shines.

Misc Patches

Here are the other patches I submitted that don’t require as much detail:

4. i2c: Clarify behavior of I2C_M_RD flag

I have covered this here and this is the patch that motivated me to join this program.

I2C_M_RD documentation was incomplete and I decided to clarify how to specify a write transaction. Here is the final patch.

5. selftests/mm: use calloc instead of malloc in pagemap_ioctl.c

I found this while reading through deprecated.rst. I replaced malloc with calloc in pagemap_ioctl.c selftest for good measure.

6. i2c: mux: Simplify boolean assignment in i2c_mux_alloc

I found this when grokking through i2c code for patch opportunities. This patch simplifies boolean assignments of the form if (a) b = true; into the more compact expression b = !!a;

Bugs I chose not to attempt:

1. WARNING in kcm_write_msgs

I thought I understood the issue but the bug was not consistently reproducible. I would recommend not to work on bugs like this, at least as a beginner.

Here’s the link.

2. KASAN: slab-use-after-free Read in handle_tx (2)

This is a bug where I understand the issue but I’m not sure there’s a good way to fix it. Considering the driver is Orphaned, I decided that it was not worth the effort.

Here’s the link.

Advice for Future Applicants

If you aren’t familiar with C, get comfortable with it. This should be relevant even when you are working with Rust in the kernel.
Get used to the linux compilation process. That will allow you to iterate fast and get stuck on the actual bug instead of on the compilation.
Try to create a minimal reproducer (could be as simple as a bash command) for your bug. This wil help you truly understand and eventually solve the bug.
If you have any idea for a bug, just try writing/coding it down. Even a stupid idea can help you find the correct one.
The kernel maintainers are very approachable people. If you do everything to the best of your abilities and be honest with it, they will guide you. Conversely, intellectual laziness is not tolerated.
Remember that kernel maintainers are still humans at the end of the day. Here’s a very relevant video
When you are submitting a patch, try to think like a reviewer/maintainer.

Closing Remarks

This mentorship was an amazing experience and I am glad to have been a part of it. I will be working on my net refactor patch, continue contributing bug fixes to net and am thinking of maybe jumping into virtualization because that seems interesting to me.

I am very thankful to Shuah Khan, David Hunter and Khalid Aziz for hosting this program.

Background#

Inroduction#

Getting Started#

Structure of the mentorship:#

My patches#

Net patches#

1. net: usb: Remove disruptive netif_wake_queue in rtl8150_set_multicast#

2. ptp: Add a upper bound on max_vclocks#

3. net: usb: lan78xx: fix use of improperly initialized dev->chipid in lan78xx_reset#

net: Split ndo_set_rx_mode into snapshot and deferred write (In Progress):#

Misc Patches#

4. i2c: Clarify behavior of I2C_M_RD flag#

5. selftests/mm: use calloc instead of malloc in pagemap_ioctl.c#

6. i2c: mux: Simplify boolean assignment in i2c_mux_alloc#

Bugs I chose not to attempt:#

1. WARNING in kcm_write_msgs#

2. KASAN: slab-use-after-free Read in handle_tx (2)#

Advice for Future Applicants#

Closing Remarks#

Background

Inroduction

Getting Started

Structure of the mentorship:

My patches

Net patches

1. net: usb: Remove disruptive netif_wake_queue in rtl8150_set_multicast

2. ptp: Add a upper bound on max_vclocks

3. net: usb: lan78xx: fix use of improperly initialized dev->chipid in lan78xx_reset

net: Split ndo_set_rx_mode into snapshot and deferred write (In Progress):

Misc Patches

4. i2c: Clarify behavior of I2C_M_RD flag

5. selftests/mm: use calloc instead of malloc in pagemap_ioctl.c

6. i2c: mux: Simplify boolean assignment in i2c_mux_alloc

Bugs I chose not to attempt:

1. WARNING in kcm_write_msgs

2. KASAN: slab-use-after-free Read in handle_tx (2)

Advice for Future Applicants

Closing Remarks