DS@GT ARC

Spring CLEF 2026 Competition

Are you interested in doing research but need help knowing where to start? Are you interested in sinking your teeth into applicable, real-world datasets to solve challenging problems? Join the Data Science at Georgia Tech (DS@GT) club as we tackle information retrieval and machine learning competitions at CLEF 2026 in the Spring for our fifth year! The DS@GT competition team has won $10,000 worth of prizes across four working note competitions, and have over 30 accepted working note papers into workshop proceedings. For CLEF 2025 alone, we had 44 published authors! By joining us, you’ll gain valuable experience, build a network of like-minded individuals, and have the opportunity to participate in competitive research. You can read more about our achievements on our Impact page.

The CLEF 2026 competition begins in March and ends in May, with working note papers submitted by the end of May. To participate with the DS@GT-ARC CLEF 2026 team, you should meet at least one of the following criterias:

For those who are looking to earn academic credit for their research, we will also be offering an optional CS8903 Special Problems section for the first time. All participants of the course must submit a research proposal for the CLEF task they are interested in. Seats in the course will be limited, and completing the sign-up form does not guarantee enrollment. If you’re an Alumnus and would like to join our research group in the spring, you will be required to register for CS 8903 (or any other course) for PACE access. CS8903 may be taken as a 3 credits course or a 1 credit course.

We hope you find this exciting and complete the sign up form to begin team formation! Once you have completed the sign up form, a member of our team will reach out with next steps. Lab leads will be assigned first, and each lab lead will reach out to potential lab members.

Feel free to share this page and join us in #applied-research-competitions by joining us on the DS@GT Slack: https://linktr.ee/datasciencegt

DS@GT ARC - CLEF Competition Schedule

Monthly general meetings to share ideas and share progress between labs. Each lab will have biweekly meetings (exact meeting frequency and meeting times will be coordinated by each lab lead). The rough general schedule for the spring semester is as follows:

Late November to December - Begin Lab team and Task teams formation

January - Finalize Lab and Task teams formation and Kickoff, Task Review, Research Planning and Literature Reviews

February - Preliminary Experiments

March - Datasets Released, Research Begins

April - Research Continues

May - Final Submissions for the Competition

End of May - Working Note Papers 1st Submission

June - Working Notes Feedback and Final Paper Submission

Team Structure

Available Labs and Lab Leads

While all labs currently have opportunities for new members to join. We will highlight the labs when at full capacity:

Self-Assessment

These questions capture the big idea that many of our teams will be exploring:

  1. Dataset: Using the 20 newsgroups dataset, create a subset of 4 groups of 25 examples each.
  2. What is transfer learning? What is unsupervised learning? What is an embedding space? Demonstrate the usage of Huggingface transformers to embed each post in the newsgroup dataset.
  3. When would you use the cosine distance over the Euclidean distance when measuring distance in an embedding space? Write an assertion demonstrating the triangle inequality with embedding vectors.
  4. What is a k-NN graph? Demonstrate the construction of a 3-NN graph using a ball tree using an edge list representation. How many edges are in the graph? What is the maximum number of edges in the graph?
  5. What is Precision@k? What is NCDG? Why would you use one over the other? Find the five nearest neighbors of an item in the set as an ordered list. Compute scores between two random lists in the set. Compute the score between lists in the same group. Compute the score between lists in two different groups.
  6. What is learning to rank? Create a dataset composed of items and their neighborhood lists. Demonstrate learning to rank on the dataset with XGBoost.

Try thinking through each question and task, and self-assess your ability to solve this problem. Consider solving this problem in a notebook and timing yourself, given access to the library documentation. To save you the copy and paste, here’s how ChatGPT answered: https://chatgpt.com/share/b41a40f2-e831-4df0-b051-02b629e1bd9b

Experiences

It would be helpful if you have practical software experience, at a bare minimum:

You might find “The Missing Semester of Your CS Education” useful if you feel weak in the software aspect of building IR/ML systems: The Missing Semester of Your CS Education

Videos

Here is a video where we shared our trip to CLEF 2025 in Madrid, Spain during one of the meetings in the Fall. Ten members were able to attend in-person and present their approaches and results.

Points of Contact:

Murilo Gustineli murilogustineli@gatech.edu

Ritesh Mehta rmehta307@gatech.edu

Questions?

Please checkout the FAQ page and feel free to contact us if you have more questions!