More Annotations

Favourite Annotations

Text

CODE AS CRAFT, ETSY'S ENGINEERING BLOG This is the strategy our auto-bidder adopts: it sets a seller’s bid based on the value they can expect from an ad click. Expected Value of an Ad Click = Predicted Post-Click Conversion Rate x Expected Order Value. To understand the intuition behind this formula, consider a seller who has a listing priced at $100. HOW WE BUILT A CONTEXT-SPECIFIC BIDDING SYSTEM FOR ETSYSEE MORE ON

CODEASCRAFT.COM

CLOUD JEWELS: ESTIMATING KWH IN THE CLOUD HOW ETSY PREPARED FOR HISTORIC VOLUMES OF HOLIDAY TRAFFIC The Challenge. For Etsy, 2020 was a year of unprecedented and volatile growth. (Simply on a human level it was also a profoundly tragic year, but that’s a different article.) Our site traffic leapt up in the second quarter, when lockdowns went into widespread effect, by an amount it normally would have taken several years to achieve. THE CAUSAL ANALYSIS OF CANNIBALIZATION IN ONLINE PRODUCTSSEE MORE ON

CODEASCRAFT.COM

JUGGLING MULTIPLE ELASTICSEARCH INSTANCES ON A SINGLE HOSTSEE MORE ON

CODEASCRAFT.COM

HOW ETSY MANAGES HTTPS AND SSL CERTIFICATES FOR CUSTOM In April of 2016 Etsy launched Pattern, a new product that gives Etsy sellers the ability to create their own hosted e-commerce website. With an easy-setup experience, modern and stylish themes, and guest checkout, sellers can merchandise and manage their brand identity outside of the Etsy.com retail marketplace while leveraging all of Etsy’s e-commerce tools. CODE AS CRAFT, ETSY'S ENGINEERING BLOG This is the strategy our auto-bidder adopts: it sets a seller’s bid based on the value they can expect from an ad click. Expected Value of an Ad Click = Predicted Post-Click Conversion Rate x Expected Order Value. To understand the intuition behind this formula, consider a seller who has a listing priced at $100. HOW WE BUILT A CONTEXT-SPECIFIC BIDDING SYSTEM FOR ETSYSEE MORE ON

CODEASCRAFT.COM

JUGGLING MULTIPLE ELASTICSEARCH INSTANCES ON A SINGLE HOSTSEE MORE ON

CODEASCRAFT.COM

vibrant

INCREASING EXPERIMENTATION ACCURACY AND SPEED BY USING At Etsy, we strive to nurture a culture of continuous learning and rapid innovation. To ensure that new products and functionalities built by teams — from polishing the look and feel of our app and website, to improving our search and recommendation algorithms — have a positive impact on Etsy’s business objectives and success metrics, virtually all product launch decisions are vetted THE CAUSAL ANALYSIS OF CANNIBALIZATION IN ONLINE PRODUCTS In practice, we can try different candidate mediators and use the analysis to confirm which one is the mediator. Figure 1: Directed Acyclic Graph (DAG) Note: It illustrates the causal mediation in recommendation A/B test. However, it is challenging to implement CMA

of

MUTATION TESTING: A TALE OF TWO SUITES Stryker’s default reporter even displays how it generated the mutants that survived so it’s easy to identify the gaps in the suite. In this case, two Conditional Expression mutants and a Logical Operator mutant survived. All together, Stryker supports roughly thirty possible mutation types, but that list can be whittled down for

faster test runs.

THE JOURNEY TO FAST PRODUCTION ASSET BUILDS WITH WEBPACK We’re proud to say that our Webpack-powered build system, responsible for over 13,200 assets and their source maps, finishes in four minutes on average. This fast build time is the result of countless hours of optimizing. What follows is our journey to achieving such speed, and what we’ve discovered along the way. BEING AN EFFECTIVE ALLY TO WOMEN AND NON-BINARY PEOPLE Being an ally is a constant learning experience. Being an ally isn’t a fixed state, it’s not a badge you earn (or take) and sew onto your sleeve and you’re an ally from then on. Being open to feedback and demonstrating that you’re willing to accept and learn from criticism

is vital.

EXECUTING A SUNSET

Executing a Sunset. We all know how exciting it is to build new products, the thrill of a pile of new ideas waiting to be tested, new customers to reach, knotty problems to solve, and dreams of upward-sloping graphs. But what happens when it is no longer aligned with the trajectory of the company. Often, the product, code, and

infrastructure

HOW ETSY SHIPS APPS

Ship. So we built a vessel that coordinates the status, schedule, communications, and deploy tools for app releases. Here’s how Ship helps: Keeps track of who committed changes to a release. Sends Slack messages and emails to the right people about the relevant events. Manages the state and schedule of all releases. WE INVITE EVERYONE AT ETSY TO DO AN ENGINEERING ROTATION The Engineering Rotation Program. The program is split into three parts: homework; an in-person class; then hands-on deployment. The code that participants change and deploy will add their photos to the Etsy about page. It’s a nice visual payoff, and lets new hires publicly declare themselves part of the Etsy team. CRUNCHING APPLE PAY TOKENS IN PHP Today we’re open sourcing part of our server-side solution, applepay-php, a PHP extension that verifies and decrypts Apple Pay payment tokens. Integrating with Apple Pay comes down to two main areas of development: device side and payment-processing side. On the device side, at a high level, your app uses the PassKit framework to

obtain an

CODE AS CRAFT, ETSY'S ENGINEERING BLOG This is the strategy our auto-bidder adopts: it sets a seller’s bid based on the value they can expect from an ad click. Expected Value of an Ad Click = Predicted Post-Click Conversion Rate x Expected Order Value. To understand the intuition behind this formula, consider a seller who has a listing priced at $100. ARCHIVE - CODE AS CRAFTSEE MORE ON CODEASCRAFT.COM THE CAUSAL ANALYSIS OF CANNIBALIZATION IN ONLINE PRODUCTSSEE MORE ON

CODEASCRAFT.COM

CLOUD JEWELS: ESTIMATING KWH IN THE CLOUD MUTATION TESTING: A TALE OF TWO SUITES DEVELOPING IN A MONOREPO WHILE STILL USING WEBPACK In practice, however, developers were asking for more from JavaScript and from their build systems. We started adopting React a few years prior using the then-available JSXTransform tool, which we added to our build system with a fair amount of wailing and gnashing of teeth. The result was a server that successfully, yet sluggishly, supported

JSX.

CODEASCRAFT.COM

HOW ETSY MANAGES HTTPS AND SSL CERTIFICATES FOR CUSTOM In April of 2016 Etsy launched Pattern, a new product that gives Etsy sellers the ability to create their own hosted e-commerce website. With an easy-setup experience, modern and stylish themes, and guest checkout, sellers can merchandise and manage their brand identity outside of the Etsy.com retail marketplace while leveraging all of Etsy’s e-commerce tools. CODE AS CRAFT, ETSY'S ENGINEERING BLOG This is the strategy our auto-bidder adopts: it sets a seller’s bid based on the value they can expect from an ad click. Expected Value of an Ad Click = Predicted Post-Click Conversion Rate x Expected Order Value. To understand the intuition behind this formula, consider a seller who has a listing priced at $100. ARCHIVE - CODE AS CRAFTSEE MORE ON CODEASCRAFT.COM THE CAUSAL ANALYSIS OF CANNIBALIZATION IN ONLINE PRODUCTSSEE MORE ON

CODEASCRAFT.COM

JSX.

CODEASCRAFT.COM

vibrant

ARCHIVE - CODE AS CRAFT Code as Craft Speaker Series: Rasmus Lerdorf, this Thursday. Posted by David Giffin on January 23, 2012.

ENGINEERING

The engineers who make Etsy make our living with a craft we love: software. This is where we'll write about our craft and our collective experience building and INCREASING EXPERIMENTATION ACCURACY AND SPEED BY USING At Etsy, we strive to nurture a culture of continuous learning and rapid innovation. To ensure that new products and functionalities built by teams — from polishing the look and feel of our app and website, to improving our search and recommendation algorithms — have a positive impact on Etsy’s business objectives and success metrics, virtually all product launch decisions are vetted BEING AN EFFECTIVE ALLY TO WOMEN AND NON-BINARY PEOPLE Being an ally is a constant learning experience. Being an ally isn’t a fixed state, it’s not a badge you earn (or take) and sew onto your sleeve and you’re an ally from then on. Being open to feedback and demonstrating that you’re willing to accept and learn from criticism

is vital.

ETSY’S DEBRIEFING FACILITATION GUIDE FOR BLAMELESS Etsy’s Debriefing Facilitation Guide for Blameless Postmortems. Posted by John Allspaw on November 17, 2016. In 2012, I wrote a post for the Code As Craft blog about how we approach learning from accidents and mistakes at Etsy. I wrote about the perspectives and concepts behind what is known (in the world of Systems Safety and

Human Factors

BLAMELESS POSTMORTEMS AND A JUST CULTURE Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of: what actions they took at what time, what effects they observed, expectations they had, assumptions they had made, and their understanding of timeline of events as they occurred. HOW DOES ETSY MANAGE DEVELOPMENT AND OPERATIONS? These teams are paired with a product manager and a designer, and there is some movement across teams as needed. All designers at Etsy code and product managers code at various levels, too. Ops and dev work really closely together, and we have one development team that is very ops-like and straddles both domains. Everyone in the company uses

IRC.

JUGGLING MULTIPLE ELASTICSEARCH INSTANCES ON A SINGLE HOST Elasticsearch is a distributed search engine built on top of Apache Lucene.At Etsy we use Elasticsearch in a number of different configurations: for Logstash, powering user-facing search on some large indexes, some analytics usage, and many internal applications. ANNOUNCING HOUND: A LIGHTNING FAST CODE SEARCH TOOL Announcing Hound: A Lightning Fast Code Search Tool. Today we are open sourcing a new tool to help you search large, complex codebases at lightning speed. We are calling this tool Hound. We’ve been using it internally for a few months, and it has become an indispensable tool that many engineers use every day. CODE AS CRAFT, ETSY'S ENGINEERING BLOG This is the strategy our auto-bidder adopts: it sets a seller’s bid based on the value they can expect from an ad click. Expected Value of an Ad Click = Predicted Post-Click Conversion Rate x Expected Order Value. To understand the intuition behind this formula, consider a seller who has a listing priced at $100. ABOUT - CODE AS CRAFT About. At Etsy, our mission is to enable people to make a living making things. The engineers who make Etsy make our living making something we love: software. We think of our code as craft — hence the name of the blog. Here we’ll write about our craft and our collective experience building and running Etsy, the world’s most

vibrant

THE CAUSAL ANALYSIS OF CANNIBALIZATION IN ONLINE PRODUCTSSEE MORE ON

CODEASCRAFT.COM

CLOUD JEWELS: ESTIMATING KWH IN THE CLOUD MUTATION TESTING: A TALE OF TWO SUITES HOW ETSY PREPARED FOR HISTORIC VOLUMES OF HOLIDAY TRAFFIC The Challenge. For Etsy, 2020 was a year of unprecedented and volatile growth. (Simply on a human level it was also a profoundly tragic year, but that’s a different article.) Our site traffic leapt up in the second quarter, when lockdowns went into widespread effect, by an amount it normally would have taken several years to achieve. BLAMELESS POSTMORTEMS AND A JUST CULTURE Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of: what actions they took at what time, what effects they observed, expectations they had, assumptions they had made, and their understanding of timeline of events as they occurred. BEING AN EFFECTIVE ALLY TO WOMEN AND NON-BINARY PEOPLESEE MORE ON

CODEASCRAFT.COM

CRUNCHING APPLE PAY TOKENS IN PHP Today we’re open sourcing part of our server-side solution, applepay-php, a PHP extension that verifies and decrypts Apple Pay payment tokens. Integrating with Apple Pay comes down to two main areas of development: device side and payment-processing side. On the device side, at a high level, your app uses the PassKit framework to

obtain an

WE INVITE EVERYONE AT ETSY TO DO AN ENGINEERING ROTATION The Engineering Rotation Program. The program is split into three parts: homework; an in-person class; then hands-on deployment. The code that participants change and deploy will add their photos to the Etsy about page. It’s a nice visual payoff, and lets new hires publicly declare themselves part of the Etsy team. CODE AS CRAFT, ETSY'S ENGINEERING BLOG This is the strategy our auto-bidder adopts: it sets a seller’s bid based on the value they can expect from an ad click. Expected Value of an Ad Click = Predicted Post-Click Conversion Rate x Expected Order Value. To understand the intuition behind this formula, consider a seller who has a listing priced at $100. ABOUT - CODE AS CRAFT About. At Etsy, our mission is to enable people to make a living making things. The engineers who make Etsy make our living making something we love: software. We think of our code as craft — hence the name of the blog. Here we’ll write about our craft and our collective experience building and running Etsy, the world’s most

vibrant

THE CAUSAL ANALYSIS OF CANNIBALIZATION IN ONLINE PRODUCTSSEE MORE ON

CODEASCRAFT.COM

obtain an

vibrant

ARCHIVE - CODE AS CRAFT Code as Craft Speaker Series: Rasmus Lerdorf, this Thursday. Posted by David Giffin on January 23, 2012. EVENTS - CODE AS CRAFT Speeding up the Web with PHP 7, Rasmus Lerdorf. Fluent Conf, San Francisco, CA. 22-24 April. Devops for Everyone, Katherine Daniels. Craft Conf, Budapest, Hungary. 30 April. The Dev/Ops Line is Fuzzy, or How to Work with your Ops Partners, Melissa Santos. DevopsDays NYC,

New York, NY.

SPEAKER SERIES

Etsy Code as Craft events are a semi-monthly series of guest speakers who explore a technical topic or computing trend, sharing both conceptual ideas and practical advice. All talks will take place at the Etsytorium on the 6th floor at 55 Prospect Street in beautiful

Brooklyn .

Human Factors

PERFORMANCE TUNING SYSLOG-NG Syslog-ng is a powerful tool, and has worked well, as long as we’ve paid a little attention to performance tuning it. This a collection of our favourite syslog-ng tuning tips. Our central syslog server, an 8-core server with 12Gb RAM, currently handles around PERSONALIZED RECOMMENDATIONS AT ETSY In this post we review some of the methods we use for making recommendations at Etsy. The MapReduce implementations of all these methods are now included in our open-source machine learning package “ Conjecture ” which was described in a previous post. Computing recommendations basically consists of two stages. MOBILE_APPS_DASHBOARD mobile_apps_dashboard. Posted by Nassim Kammah on February 26, 2014. Posted by Nassim Kammah on February 26, 2014. CRUNCHING APPLE PAY TOKENS IN PHP Today we’re open sourcing part of our server-side solution, applepay-php, a PHP extension that verifies and decrypts Apple Pay payment tokens. Integrating with Apple Pay comes down to two main areas of development: device side and payment-processing side. On the device side, at a high level, your app uses the PassKit framework to

obtain an

HOW ETSY MANAGES HTTPS AND SSL CERTIFICATES FOR CUSTOM In April of 2016 Etsy launched Pattern, a new product that gives Etsy sellers the ability to create their own hosted e-commerce website. With an easy-setup experience, modern and stylish themes, and guest checkout, sellers can merchandise and manage their brand identity outside of the Etsy.com retail marketplace while leveraging all of Etsy’s e-commerce tools. CODE AS CRAFT, ETSY'S ENGINEERING BLOG This is the strategy our auto-bidder adopts: it sets a seller’s bid based on the value they can expect from an ad click. Expected Value of an Ad Click = Predicted Post-Click Conversion Rate x Expected Order Value. To understand the intuition behind this formula, consider a seller who has a listing priced at $100. ABOUT - CODE AS CRAFT About. At Etsy, our mission is to enable people to make a living making things. The engineers who make Etsy make our living making something we love: software. We think of our code as craft — hence the name of the blog. Here we’ll write about our craft and our collective experience building and running Etsy, the world’s most

vibrant

THE CAUSAL ANALYSIS OF CANNIBALIZATION IN ONLINE PRODUCTSSEE MORE ON

CODEASCRAFT.COM

obtain an

vibrant

THE CAUSAL ANALYSIS OF CANNIBALIZATION IN ONLINE PRODUCTSSEE MORE ON

CODEASCRAFT.COM

obtain an

vibrant

New York, NY.

SPEAKER SERIES

Brooklyn .

Human Factors

obtain an

HOW ETSY MANAGES HTTPS AND SSL CERTIFICATES FOR CUSTOM In April of 2016 Etsy launched Pattern, a new product that gives Etsy sellers the ability to create their own hosted e-commerce website. With an easy-setup experience, modern and stylish themes, and guest checkout, sellers can merchandise and manage their brand identity outside of the Etsy.com retail marketplace while leveraging all of Etsy’s e-commerce tools. CODE AS CRAFT, ETSY'S ENGINEERING BLOG This is the strategy our auto-bidder adopts: it sets a seller’s bid based on the value they can expect from an ad click. Expected Value of an Ad Click = Predicted Post-Click Conversion Rate x Expected Order Value. To understand the intuition behind this formula, consider a seller who has a listing priced at $100. ABOUT - CODE AS CRAFT About. At Etsy, our mission is to enable people to make a living making things. The engineers who make Etsy make our living making something we love: software. We think of our code as craft — hence the name of the blog. Here we’ll write about our craft and our collective experience building and running Etsy, the world’s most

vibrant

THE CAUSAL ANALYSIS OF CANNIBALIZATION IN ONLINE PRODUCTSSEE MORE ON CODEASCRAFT.COMEXAMPLES OF CANNIBALIZATION IN MARKETINGBRAND CANNIBALIZATIONCANNIBALIZATION DEFINITIONDEFINE PRODUCT CANNIBALIZATIONEXAMPLES OF BRAND CANNIBALIZATIONHOW TO MEASURE PRODUCT

CANNIBALIZATION

CODEASCRAFT.COM

obtain an

vibrant

CANNIBALIZATION

CODEASCRAFT.COM

obtain an

vibrant

New York, NY.

SPEAKER SERIES

Brooklyn .

Human Factors

obtain an

HOW ETSY MANAGES HTTPS AND SSL CERTIFICATES FOR CUSTOM In April of 2016 Etsy launched Pattern, a new product that gives Etsy sellers the ability to create their own hosted e-commerce website. With an easy-setup experience, modern and stylish themes, and guest checkout, sellers can merchandise and manage their brand identity outside of the Etsy.com retail marketplace while leveraging all of Etsy’s e-commerce tools. CODE AS CRAFT, ETSY'S ENGINEERING BLOG This is the strategy our auto-bidder adopts: it sets a seller’s bid based on the value they can expect from an ad click. Expected Value of an Ad Click = Predicted Post-Click Conversion Rate x Expected Order Value. To understand the intuition behind this formula, consider a seller who has a listing priced at $100. ABOUT - CODE AS CRAFT About. At Etsy, our mission is to enable people to make a living making things. The engineers who make Etsy make our living making something we love: software. We think of our code as craft — hence the name of the blog. Here we’ll write about our craft and our collective experience building and running Etsy, the world’s most

vibrant

CANNIBALIZATION

CODEASCRAFT.COM

obtain an

vibrant

CANNIBALIZATION

CODEASCRAFT.COM

obtain an

vibrant

New York, NY.

SPEAKER SERIES

Brooklyn .

Human Factors

obtain an

CODE AS CRAFT

* Speaker Series

* Events

* About

* Archive

*

* Speaker Series

* Events

* About

* Archive

BRINGING PERSONALIZED SEARCH TO ETSY Posted by Lucia Yu on October 29, 2020 / No Responses The Etsy marketplace brings together shoppers and independent sellers from all over the world. Our unconventional inventory presents unique challenges for product search, given that many of our listings fall outside of standard e-commerce categories. With more than 80 million listings and 3.7 million sellers, Etsy relies on machine learning to help users browse creative, handmade goods in their search results. But what if we could make the search experience even better with results tailored to each user? Enter personalized search results. Search results for “tray” _(above) default results, (below) a user who recently interacted with

leather goods_

When a user logs into the marketplace and searches for items, they signal their preferences by interacting with listings that pique their interest. In personalization, our algorithms train on these signals and learn to predict, per user, the most relevant listings. The resulting personalized model lets individuals’ taste shine through in their search results without compromising performance. Personalization enhances the underlying search model by customizing the ordering of relevant items according to user preference. Using a combination of historical and contextual features, the search ranking model learns to recognize which items have a greater alignment with an individual’s taste. In the following sections, we describe the Etsy search architecture and pipeline, the features we use to create personalized search results, and the performance of this new model. Finally, we reflect on the challenges from launching our first personalized search model and look ahead to future iterations of personalization. Please note that some sections of the post are more technical and assume knowledge of machine learning basics from the reader. ETSY SEARCH ARCHITECTURE The search pipeline is separated into two passes: candidate set retrieval and ranking. This ensures that we are returning low-latency results — a crucial component of a good search system. Because the latter ranking step is computationally expensive, we want to use it wisely. So from millions of total listings, the candidate set retrieval step selects the top one thousand items for a query by considering tags, titles, and other seller-provided attributes. This allows us to run the ranking algorithm over _less than one percent_ of all listings. In the end, the ranker will place the most relevant items at the top of the search results page. Our search ranking algorithm is an ensemble model that uses a gradient boosted decision tree with pairwise formulation. For personalization, we introduce a new set of features that allow us to model user preferences. These features are included in addition to existing features, such as listing price and recency. And as much as we try to avoid impacting latency, the introduction of these personalized features creates new challenges in serving online results. Because the features are specific to each user, the cache utilization rate drops. We address this and other challenges in a later section. PERSONALIZED USER REPRESENTATIONS The novelty of personalized search results lies in the new features we pass to the ranker. We categorize personalization features into two groups: historical and contextual features. Historical features refer to singular data points about a user’s profile that can succinctly describe shopping habits and behaviors. Are they modern consumers of digital goods, or are they hunting and gathering vintage pieces? Do they carefully deliberate on each individual purchase, or do they follow their lightning-strike gut instinct? We can gather these insights from the number of digital or vintage items purchased and average number of site visits. Historical user features help us put the “person” in personalization. Search results for “lamp” _(above) default results, (below) a user who recently interacted with

epoxy resin items_

In contrast to these numerical features, data can also be represented as a vector, or a list of numbers. For personalization, we refer to these vectored features as contextual features because the listing vector represents a listing _with respect to the context of all other listings_. In fact, there are many ways to represent a listing as a vector but we use term frequency–inverse document frequency (Tf-Idf), item-interaction embeddings and interaction-based graph embeddings. If you’re unfamiliar with any of these methods, don’t worry! We’ll be diving deeper into the specific vector generation algorithms. So how do we capture a user’s preferences from a bunch of listing vectors? One method is to average all the listings a user has clicked on to represent them. In other words, the user contextual vector is simply the average of all the interacted listings’ contextual

vectors.

We gather historical and contextual features from across our mobile web, desktop and mobile application platforms. This allows us to maximize the amount of information our model can use to personalize search result rankings. THE MANY WAYS USERS SHOW US LOVE In addition to clicks on a listing from search results, a user has a few other ways to connect with sellers’ items on the marketplace. After a user searches for an item, they can _favorite_ items in the search results page and save them to their own curated collections, they can add an item to their _cart_ while they continue to browse, and once they are satisfied with their selection they can _purchase_

the item.

Each of these interactions has distinct characteristics which help our model generalize and generate more accurate predictions. Clicks are by far the most popular way for buyers to engage with listings, and through sheer quantity provide for a rich source of material to model user behaviors. On the other end, purchase interactions occur less frequently than clicks but contain stronger indications of relevance of an item to the user’s search query. THE HEART OF PERSONALIZATION Now, let’s get to the crux of personalization and dig deeper into user contextual features. Tf-Idf vectors consider listings from a textual standpoint, where words in the seller-provided attributes are weighted according to their importance. These attributes include listing titles, tags, and others. Each word’s importance is derived with respect to its frequency in the immediate listing text, but also the larger corpus of listing texts. This allows us to distinguish a listing from others and capture its unique qualities. When we average the last few months’ worth of listings a user has interacted with, we are averaging the weights of words in those listings to create a single Tf-Idf vector and represent the user. In other words, in Tf-Idf a listing is represented by its most important listing words and a user is represented as an average of those most important words. Diagram of interaction-based graph embedding _In this example of interaction-based graph embeddings, the queries “dollhouse” and “dolls” resulted in clicks on listing 824770513 on three and five occasions, respectively. _ Unlike Tf-Idf, an interaction-based graph embedding can capture the larger interaction context of query and listing journeys. Recall that interactions can be as clicks, favorites, add-to-carts or purchases from a user. Let’s say we have a query and some listings that are often clicked with that query. When we match words within the query to the words in the listing texts and weigh the words common to both of them, we can represent listings and queries with the same vocabulary. A common vocabulary is an important quality in textual representations because we can derive degrees of relatedness between queries to listings despite differences in length and purpose. Therefore, if a few different listings are all clicked as a result of the same query, we expect the embeddings for these listings to be similar. And similar to Tf-Idf, we can simply average the weights of words in the sparse vectors over some time frame. Whereas graph embeddings weave behavior from interaction logs into the vector representation, Tf-Idf only uses available listing text. Put more plainly, for graph embeddings users tell _us_ which queries and listings are related and we model this information by finding overlaps between their words. Diagram from _Learning Item-Interaction Embeddings for User

Recommendations_

However, focusing on a single interaction type within an embedding can be limiting. In reality, users can have a combination of different interactions within a single shopping session. Item-interaction

vectors

can learn multiple interactions in the same space. Created by our very own data scientists here at Etsy, item-interaction vectors build upon the methods of word2vec where words occurring in the same context share a higher vector

similarity.

The implementation of item-interaction vectors is simple but elegant: we replace words and sentences with item-interaction token pairs and sequences of interacted items. A token pair is formulated as _(item, interaction-type)_ to represent how a user interacted with a specific item or listing. And an ordered list of these tokens represents the sequence of what and how a user interacted with various listings in a session. As a result, item-interaction token pairs that appear in the same context will be considered similar. Because these listings embeddings are dense vectors, we can easily find similar listings via distance metrics. To summarize item-interaction vectors, similar to interaction-based graph embeddings we let the users guide us in learning which listings are similar. But rather than deriving listing similarities from query and listing relationships, we infer similarity if listings appear in the same sequences of interactions. PUTTING IT ALL TOGETHER Let’s take stock of what we have to work with: recent or lifetime look back windows, three types of contextual features (Tf-Idf, graph embedding, item-interaction), and four types of user behavior interactions (click, favorite, add-to-cart, purchase). Mixing and matching these together, we have a grand total of 24 contextual vectors to represent a single user in order to rank items for personalized search results. For example, we can combine an overall time window, item-interaction method, and “favorite” interactions to generate an item-interaction vector that represents a user’s all-time favorite listings. Search results for “necklace charms blue” _(above) default results, (below) a user who recently interacted with

eye charms_

In personalized search ranking, when a user enters a query we still do a coarse sweep of the inventory and grab top-related items to a query in candidate set retrieval. But in the ranking of items, we now include our new features. Recall that decision trees take input features in the form of integers or decimals. To satisfy this requirement, we can pass user historical features straight through to the tree or create new features by combining them with other features beforehand. To include user contextual features in the ranking, we have to compute similarity metrics between users’ contextual vectors and the candidate listing vectors from the candidate retrieval step. We derive Jaccard similarity and token overlap for sparse vectors and cosine similarity for dense vectors. From these metrics we understand which candidate listings are more similar to listings a user has previously interacted with. However, this metric alone is not sufficient to determine a final ranking. Decision trees take these inputs and learn how each feature impacts whether an item will be purchased. We feed user historical features, similarity measures, and other non-personalized features into the tree so it can learn to rank listings from most relevant to least. The expectation is that the most relevant listings are the items a user is more likely to purchase. PERSONALIZED SEARCH PERFORMANCE In online A/B experiments we compared this personalized model with a control and observed an improvement in ranking performance from a purchase’s normalized discounted cumulative gain (NDCG). NDCG captures the goodness of a ranking. If, on average, users purchase items ranked higher on the page, this ranking would have a high purchase NDCG. In our experiments, we observed that the NDCG for personalization was especially high for users that have recently and/or often interacted with the marketplace. Search results for “print maxi dress” _(above) default results, (below) a user who recently interacted with

African prints_

Users also click around less in personalized results, index to fewer pages, and buy more items compared to the control model. This indicates that users are finding what they want faster with the personalized variant. Overall, personalization features play an important role relative to the existing features for our decision tree. Generally speaking, the importance gain of a feature in a decision tree indicates how much a feature contributes to better predictions. In personalization, contextual features prove to be a strong influence in determining a good final ranking. For users with a richer history of interactions, we provide _even better_ personalized results. This is confirmed by a greater number of purchases online and increased NDCG offline for users that have recently purchased items more than once. Vector representations from recent time windows had greater feature importance gain compared to lifetime vectors. This means that users’ recent interactions with listings give a better indication of what the user wants to purchase. Out of the three user contextual feature types, the text-based Tf-Idf vectors tend to have higher feature importance gain. This might suggest that ranking items based on seller-provided attributes given a query is the best way to help users find what they are looking for. We also identify users’ clicked and favorited items as more important signals compared to past cart adds or purchases. This might indicate that if a user purchased an item once, they have less utility for highly similar items later. CHALLENGES & CONSIDERATIONS * Mentioned in the beginning of the post, serving personalized results online for individual users introduces LATENCY CHALLENGES. Typically we rank items in the second pass in real-time and rely on caching to save ordered rankings per query to reduce latency in online serving. But because personalized results are specific to individual users, we have to rank listings for every user and query pair. Cache utilization decreases due to the exploding amount of information we need to store in order to account for each personalized result, which can impact the user experience. * Understanding further the appropriate situations to deploy personalized search results can improve a user’s experience. For example, when shopping for gifts for your grandmother, would we really want to tailor the results to _your_ taste preferences? There are many ways to achieve this, such as learning the DEGREE OF PERSONALIZATION with respect to the query but there might be tradeoffs to latency and

training time.

* For NEW USERS who haven’t interacted with many listings on our marketplace, we can have an alternative approach for populating default vector values. One possible method would be to set the default vector for new users as the average of all user vectors. However, it’s been shown that personalized results are better when a user’s individual preferences are very different from the group’s average

preference.

* Personalization should take into account users’ PRIVACY CHOICES. This can take many forms, such as removing personalization information over time, providing user preferences as to personalization, anonymizing personalization information, or considering the sensitivity of information in how it is used for personalization.

CONCLUSION

In this post, we have covered how Etsy achieves personalized search ranking results. Our models learn which listings should rank higher for a user based on their own history as well as others’ history. These features are encapsulated with user historical features and contextual features. Since launching personalization, users have been able to find items they liked more easily, and they often come back for more. At Etsy, we’re focused on connecting our vibrant marketplace of 3.7 million sellers with shoppers around the world. With the introduction of personalized ranking, we hope to maintain course in our mission to keep commerce human.

NO COMMENTS

IMPROVING OUR DESIGN SYSTEM THROUGH DARK MODE Posted by Stephanie Sharp on October 21, 2020 / 1 Comment Etsy recently launched Dark Mode in our iOS and Android buyer apps. Since Dark Mode was introduced system-wide last year in iOS 13 and Android 10, it has quickly become a highly requested feature by our users and an industry standard. Benefits of Dark Mode include reduced eye strain, accessibility improvements, and increased battery life. For the Design Systems team at Etsy, it was the perfect opportunity to test the limits of the new design system in our apps. In 2019, we brought Etsy’s design system, Collage, to our iOS and Android apps. Around the same time, Apple announced Dark Mode in iOS 13. By implementing Dark Mode, the Design Systems team could not only give users the flexibility to customize the appearance of the Etsy app to match their preferences, but also test the new UI components app-wide and increase adoption of Collage in the process.

SEMANTIC COLORS

Without semantic colors, Dark Mode wouldn’t have been possible. Semantic colors are colors that are named relative to their purpose in the UI (e.g. primary button color) instead of how they look (e.g. light orange color). Collage used a semantic naming convention for colors from the beginning. This made it relatively easy to support dynamic colors, which are named colors that can have different values when switching between light and dark modes. For example, a dynamic primary text color might be black in light mode and white in Dark

Mode.

Dynamic semantic colors opened up the possibility for Dark Mode, but they also led to a more accessible app for everyone. On iOS, we also added support for the Increased Contrast accessibility feature which increases the contrast between text and backgrounds to improve legibility. Any color in the Etsy app can now have up to four values for light/dark modes and regular/increased contrast.

COLOR GENERATION

To streamline the process for adding new colors, we created a script on iOS that generates all of our color assets and files. With the growing complexity of dynamic colors, having a single source of truth for color definitions is important. On iOS, our source of truth is a property list (a key-value store for app data) of color names and values. We created a script that automatically runs when the app is built and generates all the files we need to represent colors: an asset catalog, convenience variables for accessing colors in code, and a source file for the unit tests. Adding a new color is as simple as adding a line to the property list, and all the relevant files are updated for you. This approach has reduced the time it takes to add a new color and eliminated the risk of inconsistencies across the

codebase.

On iOS, a script reads from a property list to generate the color asset catalog and convenience variables. RETHINKING ELEVATION Another design change we made for Dark Mode was rethinking how we represent elevation in our UI components. In light mode, it’s common to add a shadow around your view or dim the background to show that one view is layered above another. In Dark Mode, these approaches aren’t as effective and the platform convention is to slightly lighten the background of your view instead. The Etsy app uses shadows and borders extensively to indicate levels of elevation. For Dark Mode, we removed shadows entirely and used borders much more sparingly. Instead, we followed iOS and Android platform conventions and introduced elevated background colors into our design system. Semantic colors came to the rescue again and we were easily able to use our regular background color in light mode while applying a lighter color in Dark Mode on views that needed it, such as listing

cards.

_Examples of elevated cards in light and dark modes_

CHOOSE YOUR THEME

There is no system-level Dark Mode setting available on older versions of Android, but it can be enabled for specific apps that support Themes (an Android feature that allows for UI customization and provides the underlying structure for Dark Mode). This limitation turned into an opportunity for us to provide more customization options for all our users. In both our iOS and Android apps you can personalize the appearance of the Etsy app to your preferences. So if you want to keep your phone in light mode but the Etsy app needs to be easy on the eyes for late night shopping, we’ve got you covered. DARK MODE IN WEB VIEWS Another obstacle to overcome was our use of web views, a webpage that is displayed within a native app. Web views are used in a handful of places in our iOS and Android apps, and we knew that for a great user experience they needed to work seamlessly in Dark Mode as well. Thankfully, the web engineers on the Design Systems team jumped in to help and devised a solution to this problem. Using the Sass !default syntax for variables, we were able to define default color values for light mode. Then we added Dark Mode variable overrides where we defined our Dark Mode colors. If the webpage is being viewed from within the iOS or Android app with Dark Mode enabled, we load the Dark Mode variables first so the default (light mode) color variables aren’t used because they’ve already been defined for Dark Mode. This approach is easy to maintain and performant, avoiding a long list of style overrides for Dark Mode. A BETTER DESIGN SYSTEM Implementing Dark Mode was no small task. It took months of design and engineering effort from the Design Systems team, in collaboration with apps teams across the company. A big thank you to Patrick Montalto

, Kate Kelly

, Stephanie Sharp

, Dennis Kramer

, Gabe Kelley

, Sam Sherman

, Matthew Spencer

, Netty

Devonshire , Han Cho

and Janice Chang

. In the end, our users not only got the Dark Mode they’d been asking for, but we also developed a more robust and accessible design system in the process.

1 COMMENT

MUTATION TESTING: A TALE OF TWO SUITES Posted by Nathan Thompson on August 17, 2020 / 4 Comments In January of 2020 Etsy engineering embarked upon a highly anticipated initiative. For years our frontend engineers had been using a JavaScript test framework that was developed in house. It utilized Jasmine for assertions and syntax, PhantomJS for the test environment, and a custom, built-from-scratch test runner written in PHP. This setup no longer served our needs for a multitude of reasons: * It was slow and memory intensive to run * It was expensive to maintain, let alone enhance * PhantomJS had been archived nearly two years before It was time to reach for an industry standard tool with a strong community and a feature list JavaScript developers have come to expect. We settled on Jest because it met all of those criteria and naturally complemented the areas of our site that are built with React. Within a few months we had all of the necessary groundwork in place to begin a large-scale effort of migrating our legacy tests. This raised an important question — would our test suite be as effective at catching regressions if it was run by Jest as opposed to our legacy test runner? At first this seemed like a simple question. Our legacy test said: expect(a).toEqual(b); And the migrated Jest test said: expect(a).toEqual(b); So weren’t they doing the same thing? Maybe. What if one checked shallow equality and the other checked deep equality? And what about assertions that had no corollaries in Jest? Our legacy suite relied on

jasmine.ajax and

jasmine-jquery , and we would need to propose alternatives for both modules when migrating our tests. All of this opened the door for subtle variations to creep in and make the difference between catching and missing a bug. We could have spent our time poring through the source code of Jasmine and Jest to figure out if these differences really existed, but instead we decided to use Mutation Testing to find them for us. WHAT IS MUTATION TESTING? Mutation Testing allows developers to score their test suite based on how many potential bugs it can catch. Since we were testing our JavaScript suite we reached for Stryker , which works roughly the same as any other Mutation Testing framework. Stryker analyzes our source code, makes any number of copies of it, then mutates those copies by programmatically inserting bugs into them. It then runs our unit tests against each “mutant” and sees if the suite fails or passes. If all tests pass, then the mutant has survived. If one or more tests fail, then the mutant has been killed. The more mutants that are killed, the more confidence we have that the suite will catch regressions in our code. After testing all of these potential mutations, Stryker generates a score by dividing the number of mutants that were killed by the total number generated: Output from running Stryker on a single file Stryker’s default reporter even displays how it generated the mutants that survived so it’s easy to identify the gaps in the suite. In this case, two Conditional Expression mutants and a Logical Operator mutant survived. All together, Stryker supports roughly thirty possible mutation types, but that list can be whittled down for faster

test runs.

THE EXPERIMENT

Since our hypothesis was that the implementation differences between Jasmine and Jest could affect the Mutation Score of our legacy and new test suites, we began by cataloging every bit of Jasmine-specific syntax in our legacy suite. We then compiled a list of roughly forty test files that we would target for Mutation Testing in order to cover the full syntax catalog. For each file we generated a Mutation Score for its legacy state, converted it to run in our new Jest setup, and generated a Mutation Score again. Our hope was that the new Jest framework would have a Mutation Score as good as or better than our

legacy framework.

By limiting the scope of our test to just a few dozen files, we were able to run all mutations Stryker had to offer within a reasonable timeframe. However, the sheer size of our codebase and the sprawling dependency trees in any given feature presented other challenges to this work. As I mentioned before, Stryker copies the source code to be mutated into separate sandbox directories. By default, it copies the entire project into each sandbox, but that was too much for Node.js to handle in our repository: Error when opening too many files Stryker allows users to configure an array of files to copy over instead of the entire codebase, but doing so would require us to know the full dependency tree of each file that we hoped to test ahead of time. Instead of figuring that out by hand, we wrote a custom Jest

file resolver

specifically

for our Stryker testing environment. It would attempt to resolve source files from the local directory structure, but it wouldn’t fail immediately if they weren’t found. Instead, our new resolver would reach outside of the Stryker sandbox to find the file in the original directory structure, copy it into the sandbox, and re-initiate the resolution process. This method saved us time for files that had very expansive dependency trees. With that in hand, we pressed forth with our experiment.

THE RESULT

Ultimately we found that our new Jest framework had a _worse_ Mutation Score than our legacy framework.

…WAIT, WHAT?

It’s true. On average, tests run by our legacy framework received a 55.28% Mutation Score whereas tests run by our new Jest framework received a 54.35%. In one of our worst cases, the legacy test earned a 35% while the Jest test picked up a measly 16%. ANALYZING THE RESULT Once we began seeing lower Mutation Scores on a multitude of files, we put a hold on the migration to investigate what sort of mutants were slipping past our new suite. It turned out that most of what our new Jest suite failed to catch were String Literal mutations in our Asynchronous Module Definitions: Mutant generated by replacing a dependency definition with an empty

string

We dug into these failures further and discovered that the real culprit was how the different test runners compiled our code. Our legacy test runner was custom built to handle Etsy’s unique codebase and was tightly coupled to the rest of our infrastructure. When we kicked off tests it would locate all relevant source and test files, run them through our actual webpack build process, then load the resulting code into PhantomJS to execute. When webpack encountered empty strings in the dependency definition it would throw an error and halt the test, effectively catching the bug even if there were no tests that actually relied on that dependency. Jest, on the other hand, was able to bypass our build system using its file resolver and a handful of custom mappings and transformers. This was one of the big draws of the migration in the first place; decoupling the tests from the build process meant they could execute in a fraction of the time. However, the module we used in Jest to manage dependencies was much more lenient than our actual build system, and empty strings were simply ignored. This meant that unless a test actually relied on the dependency, our Jest setup had no way to alert the tester if it was accidentally left out. Ultimately we decided that this sort of bug was acceptable to let slide. While it would no longer be caught during the testing phase, the code would still be rejected by the build phase of our CI pipeline, thereby preventing the bug from reaching Production. As we proceeded with the migration we encountered a handful of other cases where the Mutation Scores were markedly different, one of which is particularly notable. We happened upon an asynchronous test

that used a

done() callback to signify when the test should exit. The test was malformed in that there were _two_ done() callbacks with assertions between them. In Jest this was no big deal; it happily executed the additional assertions before ending the test. Jasmine was much more strict though. It stopped the test immediately when it encountered the first callback. As a result, we saw a significant jump in Mutation Score because mutants were suddenly being caught by the dangling assertions. This validated our suspicion that implementation differences between Jasmine and Jest could affect which bugs were caught and which slipped through. THE FUTURE OF MUTATION TESTING AT ETSY Over the course of this experiment we learned a ton about our testing frameworks and Mutation Testing in general. Stryker generated more than 3,800 mutations for the forty or so files that were tested, which equates to roughly ninety-five test runs per file. In all transparency, that number is likely to be artificially low as we ruled out some of the files we had initially identified for testing when we realized they generated many hundreds of mutations. If we assume our calculated average is indicative of all files and account for how long it takes to run our entire Jest suite, then we can estimate that a single-threaded, full Mutation Test of our entire JavaScript codebase would take about five and a half _years_ to complete. Granted, Stryker parallelizes test runs out of the box, and we could potentially see even more performance gains using Jest’s findRelatedTests feature to narrow down which tests are run based on which file was mutated. Even so, it’s difficult to imagine running a full Mutation Test on any regular cadence. While it may not be feasible for Etsy to test every possible mutant in the codebase, we can still gain insights about our testing practices by applying Mutation Testing at a more granular level. A manageable approach would be to generate a Mutation Score automatically any time a pull request is opened, and focus the testing on only the files that changed. Posting that information on the pull request could help us understand what conditions will cause our unit tests to fail. It’s easy to write an overly-lenient test that will pass no matter what, and in some ways that’s more dangerous than having no test at all. If we only look at Code Coverage, such a test boosts our numbers giving us a false sense of security that bugs will be caught. Mutation Score forces us to confront the limitations of our suite and encourages us to test as effectively as possible.

4 COMMENTS

HOW TO PICK A METRIC AS THE NORTH STAR FOR ALGORITHMS TO OPTIMIZE BUSINESS KPI? A CAUSAL INFERENCE APPROACH Posted by Xuan Yin on August

3, 2020 / 1 Comment

> This article draws on our published paper

>

> in KDD 2020 (Oral Presentation, > Selection Rate: 5.8%, 44 out of 756)

INTRODUCTION

It is common in the internet industry to develop algorithms that power online products using historical data. An algorithm that improves evaluation metrics from historical data will be tested against one that has been in production to assess the lift in key performance indicators (KPIs) of the business in online A/B tests. We refer to metrics calculated using new predictions from an algorithm and historical ground truth as offline evaluation metrics. In many cases, offline evaluation metrics are different from business KPIs. For example, a ranking algorithm, which powers search pages on Etsy.com, typically optimizes for relevance by predicting purchase or click probabilities of items. It could be tested offline (offline A/B tests) for rank-aware evaluation metrics, for example, normalized discounted cumulative gain (NDCG), mean reciprocal rank (MRR) or mean average precision (MAP), which are calculated using predicted ranks of ranking algorithms on the test set of historical purchase or click-through feedback of users. Most e-commerce platforms, however, deem sitewide gross merchandise sale (GMS) as their business KPI and test for it online. There could be various reasons not to directly optimize for business KPIs offline or use business KPIs as offline evaluation metrics, such as technical difficulty, business reputation, or user loyalty. Nonetheless, the discrepancy between offline evaluation metrics and online business KPIs poses a challenge to product owners because it is not clear which offline evaluation metric, among all available ones, is the north star to guide the development of algorithms in order to optimize business KPIs. The challenge essentially asks for the _causal effects_ of increasing offline evaluation metrics on business KPIs, for example how business KPIs would change for a 10% increase in an offline evaluation metric with all other conditions remaining the same (_ceteris paribus_). The north star should be the offline evaluation metric that has the greatest causal effects on business KPIs. Note that business KPIs are impacted by numerous factors, internal and external, especially macroeconomic situations and sociopolitical environments. This means, just from changes in offline evaluation metrics, we have no way to predict future business KPIs. Because we are only able to optimize for offline evaluation metrics to affect business KPIs, we try to infer the change in business KPIs, given all other factors, internal and external, unchanged, from changes in our offline evaluation metrics, based on historical data. Our task here is _causal inference_ rather than prediction. Our approach is to introduce online evaluation metrics, the online counterparts of offline evaluation metrics, which measure the performance of online products (see Figure 1). This allows us to decompose the problem into two parts: the first part is the consistency between changes of offline and online evaluation metrics, the second part is the _causality_ between online products (assessed by online evaluation metrics) and the business (assessed by online business KPIs). The first part is solved by the offline A/B test literature through counterfactual estimators of offline evaluation metrics. Our work focuses on the second part. The north star should be the offline evaluation metric whose online counterpart has the greatest causal effects on business KPIs. Hence, the question becomes how business KPIs would change for a 10% increase in an online evaluation metric _ceteris paribus_. Figure 1: The Causal Path from Algorithm Trained Offline to Online

Business

Note: Offline algorithms powers online products, and online products contribute to the business.

WHY CAUSALITY?

Why do we focus on causality? Before answering this question, let’s think about another interesting question: thirsty crow vs. talking parrot, which one is more intelligent (see Figure 2)? Figure 2. Thirsty Crow vs. Talking Parrot Note: The left painting is from _The Aesop for Children_, by Aesop, illustrated by Milo Winter

,

http://www.gutenberg.org/etext/19994 In _Aesop’s Fables_, a thirsty crow found a pitcher with water at the bottom. The water is beyond the reach of its beak. It intentionally dropped pebbles into the pitcher, which caused the water to rise to the top. A talking parrot cannot really talk. After being fed a simple phrase tons of times (big data and machine learning), it can only mimic the speech without understanding its

meaning.

The crow is obviously more intelligent than the parrot. The crow understood the causality between dropped pebbles and rising water and thus leveraged the causality to get the water. Beyond big data and machine learning (talking parrot), we want our artificial intelligence (AI) system to be as intelligent as the crow. After understanding the causality between evaluation metric lift and GMS lift, our system can leverage the causality, by lifting the evaluation metric offline, to achieve GMS lift online (see Figure 3). Understanding and leveraging causality are key topics in current AI research (see, e.g.,

Bergstein, 2020

).

Figure 3: Understanding and Leveraging Causality in Artificial

Intelligence

CAUSAL META-MEDIATION ANALYSIS Online A/B tests are popular to measure the causal effects of online product change on business KPIs. Unfortunately, they cannot directly tell us the causal effects of increasing offline evaluation metrics on business KPIs. In online A/B tests, in order to compare the business KPIs caused by different values of an online evaluation metric, we need to fix the metric at its different values for treatment and control groups. Take the ranking algorithm as an example. If we could fix online NDCG of the search page at 0.22 and 0.2 for treatment and control groups respectively, then we would know how sitewide GMS would change for a 10% increase in online NDCG at 0.2 _ceteris paribus_. However, this experimental design is impossible, because most online evaluation metrics depend on users’ feedback and thus cannot be directly manipulated. We address the question by developing a novel approach: causal meta-mediation analysis (CMMA). We model the causality between online evaluation metrics and business KPIs by dose-response function (DRF) in potential outcome framework. DRF originates from medicine and describes the magnitude of the response of an organism given different doses of a stimulus. Here we use it to depict the value of a business KPI given different values of an online evaluation metric. Different from doses of stimuli, values of online evaluation metrics cannot be directly manipulated. However, they could differ between treatment and control groups in experiments of treatments other than algorithms: user interface/user experience (UI/UX) design, marketing, and etc. This could be due to the “fat hand” nature of online A/B tests that a single intervention can change many causal variables at once. A change of the tested feature, which is not an algorithm, could induce users to change their engagement with algorithm-powered online products, so that values of online evaluation metrics would change. For instance, in an experiment of UI design, users might change their search behaviors because of the new UI design, so that values of online NDCG, which depend on search interaction, would change even though the ranking algorithm does not change (see Figure 5). The evidence suggests that online evaluation metrics could be mediators that partially transmit causal effects of treatments on business KPIs in experiments where treatments are not necessarily algorithm-related. Hence, we formalize the problem as the identification, estimation, and testing of mediator DRF. Figure 5: Directed Acyclic Graph of Conceptual Framework Our novel approach CMMA combines mediation analysis and meta-analysis to solve for mediator DRF. It relaxes common assumptions in causal mediation literature: sequential ignorability (in linear structural equation model) or complete mediation (in instrumental variable approach) and extends meta-analysis to solve causal mediation while the meta-analysis literature only learns the distribution of average treatment effects. We did extensive simulations, which show CMMA’s performance is superior to other methods in the literature in terms of unbiasedness and the coverage of confidence intervals. CMMA uses only experiment-level summary statistics (i.e., meta-data) of many existing experiments, which makes it easy to implement and to scale up. It can be applied to all experimental-unit-level evaluation metrics or any combination of them. Because it solves the causality problem of a product by leveraging experiments of all products, CMMA could be particularly useful in real applications for a new product that has been shipped online but has few A/B tests.

APPLICATION

We apply CMMA on the three most popular rank-aware evaluation metrics: NDCG, MRR, and MAP, to show, for ranking algorithms that power search products, which one has the greatest causal effect on sitewide GMS. USER-LEVEL RANK-AWARE METRICS We redefine the three rank-aware metrics (NDCG, MAP, MRR) at the user level. The three metrics are originally defined at the query level in the test collection evaluation of information retrieval (IR) literature. Because the search engine on Etsy.com is an online product for users, the computation could be adapted to the user level. We include search sessions of no interaction or no feedback into the metric calculation in accordance with online business KPI calculation in online A/B tests that always includes visits/users of no interaction or no feedback. Specifically, the three metrics are constructed as follows: * Query-level metrics are computed using rank positions on the search page and user conversion status (binary relevance). Queries of non-conversion have zero values. * User-level metric is the average of query-level metrics across all queries the user issues (including non-conversion associated queries). Users who do not search or convert have zero values. All three metrics are defined at rank position 48, the lowest position of the first page of search results in Etsy.com.

DEMO

To demonstrate CMMA, we randomly selected 190 experiments from 2018 and implemented CMMA on summary results of each experiment (e.g., the average user-level NDCG per user in treatment and control groups). Figure 6 shows results from CMMA. The vertical axis indicates elasticity, the percentage change of average GMS per user for a 10% increase in an online rank-aware metric with all other conditions that can affect GMS remaining the same. Lifts in all the three rank-aware metrics have positive causal effects on the average GMS per user. These don’t have the same performance; different values of different rank-aware metrics have different elasticities. For example, suppose current values of NDCG, MAP, and MRR of search page are 0.00210, 0.00156, and 0.00153 respectively, then a 10% increase in MRR will cause higher lifts in GMS than 10% increases in the other two metrics _ceteris paribus_, which is indicated by red dashed lines, and thus MRR should be the north star to guide our development of algorithms. Because all the three metrics have the same input data, they are highly correlated and thus the differences are small. As new IR evaluation metrics are continuously developed in the literature, we will implement CMMA for more comparison in the future. Figure 6: Elasticity from Average Mediator DRF Estimated by CMMA Note: Elasticity means the percentage change of average GMS per user for a 10% increase in an online rank-aware metric with all other conditions that can affect GMS remaining the same.

CONCLUSION

We implement CMMA to identify the north star of rank-aware metrics for search-ranking algorithms. It has helped product and engineering teams to achieve efficiency and success in terms of business KPIs in online A/B tests. We published CMMA on KDD 2020. Interested readers can refer to our paper

for more details.

1 COMMENT

CHAINING IOS MACHINE LEARNING, COMPUTER VISION, AND AUGMENTED REALITY TO MAKE THE MAGICAL REAL Posted by Jacob Van Order on June 23, 2020 / No Responses Etsy recently released a feature in our buyer-facing iOS app that allows users to visualize wall art within their environments. Getting the context of a personal piece of art within your space can be a meaningful way to determine whether the artwork will look just as good in your room as it does on your screen. The new feature uses augmented reality to bridge that gap, meshing the virtual and real worlds. Read on to learn how we made this possible using aspects of machine learning and computer vision to present the best version of Etsy sellers’ artwork in augmented reality. It didn’t even require a PhD.-level education or an expensive 3rd party vendor – we did it all with tools provided by iOS.

BUILDING A CHAIN

USING COMPUTERS TO SEE Early in 2019, I put together a quick proof of concept that allowed for wall art to be displayed on a vertical plane, which required a standalone image of the artwork filling the entire image. Oftentimes, though, Etsy sellers upload images that show their item in context, like on a living room wall, to show scale. This complicates the process because these listing images can’t be placed onto vertical planes in augmented reality as-is; they need to be reformatted and

cropped.

Two engineers, Chris Morris and Jake Kirshner developed a solution that used computer vision to find a rectangle within an image, perhaps a frame, and crop the image for use. Using the Vision

framework in iOS,

they were able to pull out the artwork we needed to place in 3D space. We found that trying to detect only one rectangle, as opposed to all, created performance wins and gave us the shape with greatest confidence by the system. Afterwards, we used Core Image

in order to crop

the image, adjusting for any perspective skew that might be present. Apple has an example using a frame buffer but can be applied to any UIImage. TO CROP OR NOT TO CROP As I mentioned before, some Etsy sellers upload their artwork as standalone images, while others depict their artwork in different environments. We wanted to present the former as-is, and we needed to crop the latter, but we had no way to automatically categorize the more than 5 million artwork listings available on our marketplace. To solve this, we used on-device machine learning provided by Core ML

. The team sifted

through more than 1,200 listings and sorted the images by those that should be cropped and those that should not be cropped. To create the machine learning model, we first used an iOS Playground

and, later, a Mac

application called Create ML

. The process was

as easy as dropping a directory with two subdirectories filled with correct images, “no_frames” and “frames”, into the application along with a corresponding smaller set of different images used to test the resulting model. Once this model was created and verified, we used VNCoreMLRequest

to

check a listing’s image and determine whether we should crop it or present it as-is. This type of model is known as image classification

.

We also investigated a different type of mode called object detection

, which _finds_

the existence and coordinates of a frame within an image. This technique had two downsides: training the model required laborious manual object marking for each image provided, and the resulting model, which would be included in our app bundle, would be well over 60mb vs. the 15_kb_ model for image classification. That’s right,

kilobytes.

TRANSLATING TWO DIMENSIONS TO THREE Once we had the process for determining whether the image needs to be reformatted, we used a combination of iOS’ SceneKit

and ARKit

to place the

artwork as a material on a rudimentary shape

. With

Apple focusing heavily on this space, we were able to find plenty of great examples and tutorials to get us started with augmented reality on iOS. We started with the easy-to-use RealityKit

framework, but

the iOS 13-only restriction was a blocker as we supported back to iOS

11 at the time.

The implementation in ARKit was relatively straightforward, technically, but working for the first time in 3D space vs. a flat screen, it was a challenge to develop a vocabulary and way of thinking about the physical space being altered by the virtual. It was difficult putting into words the difference between, for example, moving on a y-axis and how that differed from making the item scale in size. While this was eventually smoothed out with experience, we knew we had to keep this in mind for Etsy buyers, as augmented reality is not a common experience for most people. For example, how would we coach them through the fact that ARKit needs them to use the camera to scan the room to find the _edges_ of the wall in order to discern the vertical plane? What makes it apparent that they can tap on screen? In order to give our users an inclination of how to use this feature successfully, our designer, Kate Matsumoto , product manager, Han Cho , and copywriter, Jed Baker , designed an onboarding flow, based on user-testing, that walks our buyers through this new

experience.

WRAPPING IT ALL UP

Using machine learning to determine if we should crop an image or not, cropping it based on a strong rectangle, and presenting the artwork on a real wall was only part of the picture here. Assisted by Evan Wolf

and Nook Harquail

, we also dealt with complex problems including parsing item descriptions to gather dimension, raytraced hit-testing, and color averaging to make this feature feel as seamless and lifelike as possible for Etsy buyers. From here, we have plenty of ideas for continuing to improve this experience but in the meantime, I encourage you to consider the fantastic frameworks you have at your disposal, and how you can link them together to create an experience that seemed impossible just years ago. REFERENCED FRAMEWORKS:

* Vision

* Core Image

* Core ML

* Create ML

* SceneKit

* ARKit

* RealityKit

NO COMMENTS

KEEPING IT SUPPORT HUMAN DURING WFH Posted by Seth Liber on May

6, 2020 / 1 Comment

_Image: Human Connection, KatieWillDesign on Etsy

_

Hi! We’re the Etsy Engineering team that supports core IT and AV capabilities for all Etsy employees. Working across geographies has always been part of our company’s DNA; our globally distributed teams use collaboration tools like Google apps, Slack, and video conferencing. As we transitioned to a fully distributed model during COVID-19, we faced both unexpected challenges and opportunities to permanently improve our support infrastructure. In this post we will share some of the actions we took to support our staff as they spread out across the globe. DIGGING DEEPER ON OUR CORE VALUES KEEPING SUPPORT HUMAN Our team’s core objective is to empower Etsy employees to do their best work. We give them the tools they need and we teach, train, and support them to use those tools as best they can. We also document and share our work in the form of user guides and support run-books. With friendly interactions during support, we strive to embody Etsy’s mission to _Keep Commerce Human®_. Despite being further physically distributed, we found ways to increase human connections. * We launched DAILY TIPS THAT ARE DELIVERED IN SLACK, inspired by ideas from teams across Etsy, and we prioritized items that were most relevant to helping folks navigate their new work setups. * In addition to our existing monitored _#helpdesk_ Slack channel, we hosted LIVE, VIRTUAL HELPDESK hours on video. * We added DOWNLOADABLE SELF-SERVICE SOFTWARE in our internal IT tools that reduced the load on our front-line helpdesk team, and enabled employees to quickly resolve some common issues with easily digestible one-page guides.

STAYING CONNECTED

Before Etsy went fully remote, a common meeting setup included teams in multiple office locations connecting through video-conference-enabled conference rooms with additional remote participants dialing in. To better support the volume of video calls we needed with all employees WFH, we accelerated our planned video conferencing solution migration to Google Meet. We also quickly engineered solutions to integrate Google Meet, including making video conference details the default option in calendar invites and enabling add-ons that improve the video call experience. Within a month we had a 1000% increase in Google Meet usage and a ~60% drop off of the old

platform.

We also adapted our large team events, such as department-wide all hands meetings, to support a full remote experience. We created new “ask me anything” formats and shortened some meetings’ length. To make the meetings run smoothly, we added additional behind-the-scenes prep for speakers, gathered Q&A in advance, and created user guides so teams can self-manage complex events. CONTINUING PROJECT PROGRESS We reviewed our committed and potential project list and decided where we could prioritize focus, adapting to the new needs of our employees. Standing up additional monitoring tools allowed us to be even more proactive about the health of our end-point fleet. We also seized opportunities to take advantage of our empty offices to do work that would have otherwise been disruptive. We were able to complete firewall and AV equipment firmware upgrades (remotely, of course) in a fraction of the time it would have taken us with full offices. IN SUMMARY, SOME LEARNINGS COLLABORATION IS KEY Our team is very fortunate to have strong partners, buoyed by supportive leadership, operating in an inclusive company culture that highly values innovation. Much of our success in this highly unique situation is a result of multiple departments coming together quickly, sharing information as they received it, and being flexible during rapid change. For example, we partnered with our recruiting, human resources, and people development teams to adjust how we would onboard new employees, contractors, and short term temporary employees, ensuring we properly deployed equipment and smoothly introduced them

to Etsy.

RESPECT THE DIVERSITY OF WFH SITUATIONS We’ve dug deeper into ways to help all our employees work effectively at home. We’re constantly learning, but we continue to build a robust “how to work from home” guide and encourage transparency around each employee’s challenges so that we can help find solutions. Home networks can be a major point of friction and we’ve built out guides to help our employees optimize their network

and Wi-Fi setups.

EMPATHY FOR EACH OTHER Perhaps most of all, through this experience we’ve gained an increased level of empathy for our peers. We’ve learned that there are big differences between working from home for one day, being a full-time remote employee, and working in isolation during a global crisis. We’re using this awareness to rethink the future of our meeting behaviors, the technology in our conference rooms, and the way we engage with each other throughout the day, whether we’re in or

out of the office.

1 COMMENT

CLOUD JEWELS: ESTIMATING KWH IN THE CLOUD Posted by Emily Sommer , Mike Adler , John Perkins

, Joshua Thiel

, Hilary Young

, Chelsea Mozen

, Dany Daya

and Katie Sundstrom

on April 23, 2020 / 5

Comments

_Image: Lightning Storm Earrings, GojoDesign on Etsy

_

Etsy has been increasingly enjoying the perks of public cloud infrastructure for a few years, but has been missing a crucial feature: we’ve been unable to measure our progress against one of our key impact goals for 2025 — to reduce our energy intensity by

25%

.

Cloud providers generally do not disclose to customers how much energy their services consume. To make up for this lack of data, we created a set of conversion factors called _Cloud Jewels_ to help us roughly convert our cloud usage information (like Google Cloud usage data) into approximate energy used. We are publishing this research to begin a conversation and a collaboration that we hope you’ll join, especially if you share our concerns about climate change. This isn’t meant as a replacement for energy use data or guidance from Google, Amazon or another provider. Nor can we guarantee the accuracy of the rough estimates the tool provides. Instead, it’s meant to give us a sense of energy usage and relative changes over time based on aggregated data on how we use the cloud, in light of publicly-available information.

A LITTLE BACKGROUND

In the face of a changing climate, we at Etsy are committed to reducing our ecological footprint. In 2017, we set a goal of reducing the _intensity_ of our energy use by 25% by 2025, meaning we should use_ less energy in proportion to the size of our business_. In order to evaluate ourselves against our 25% energy intensity reduction goal, we have historically measured our energy usage across our footprint, including the energy consumption of servers in our data

centers.

In early 2020, we finished our two-year migration from our own physical servers in a colocated data center to Google Cloud. In addition to the massive increase in the power and flexibility of our computing capabilities, the move was a win for our sustainability efforts because of the efficiency of Google’s data centers. Our old data centers had an average PUE (Power Usage Effectiveness) of 1.39 (FY18 average across colocated data centers), whereas Google’s data centers have a combined average PUE of 1.10

. PUE

is a ratio of the total amount of energy a data center uses to how much energy goes to powering computers. It captures how efficient factors like the building itself and air conditioning are in the data

center.

While a lower PUE helps our energy footprint significantly, we need to be able to measure and optimize the amount of power that our servers draw. Knowing how much energy each of our workloads uses helps us make design and code decisions that optimize for sustainability. The Google Cloud team has been a terrific partner to us throughout our migration, but they are unable to provide us with data about our cloud energy consumption. This is a challenge across the industry: neither Amazon Web Services nor Microsoft Azure provide this information to customers. We have heard concerns that range from difficulties attributing energy use to individual customers to sensitivities around proprietary information that could reveal too much about cloud providers’ operations and financial position. We thought about how we might be able to estimate our energy consumption in Google Cloud using the data we do have: Google provides us with usage data that shows us how many virtual CPU (Central Processing Unit) seconds we used, how much memory we requested for our servers, how many terabytes of data we have stored for how long, and how much networking traffic we were responsible for. Our supposition was that if we could come up with general estimates for how many watt-hours (Wh) compute, storage and networking draw in a cloud environment, particularly based on public information, then we could apply those coefficients to our usage data to get at least a rough estimate of our cloud computing energy impact. We are calling this set of estimated conversion factors _Cloud Jewels_. Other cloud computing consumers can look at this and see how it might work with their own energy usage across providers and usage data. The goal is to help cloud users across the industry to help refine our estimates, and ultimately help us encourage cloud providers to empower their customers with more accurate cloud energy consumption

data.

METHODOLOGY

The sources that most influenced our methodology were the U.S. Data Center Energy Usage Report

,

The Data Center as a Computer , and the SPEC power report . We also spoke with industry experts Arman Shehabi

, Jon Koomey

, and Jon Taylor

, who suggested

additional resources and reviewed our methodology. We roughly assumed that we could attribute power used to: * running a virtual server (compute),

* memory (RAM),

* storage, and

* networking.

Using the resources we found online, we were able to determine what we think are reasonable, conservative estimates for the amount of energy that compute and storage tasks consume. We are aiming for a conservative over-estimate of energy consumed to make sure we are holding ourselves fully accountable for our computing footprint. We have yet to determine a reasonable way to estimate the impact of RAM or network usage, but we welcome contributions to this work! We are open-sourcing a script for others to apply these coefficients to their usage data, and the full methodology is detailed in our repository on

Github .

CLOUD JEWELS COEFFICIENTS The following coefficients are our estimates for how many watt-hours (Wh) it takes to run a virtual server and how many watt-hours (Wh) it takes to store a terabyte of data on HDD (hard disk drive) or SSD (solid-state drive) disks in a cloud computing environment:

2.10 WH PER VCPUH

0.89 WH/TBH FOR HDD STORAGE 1.52 WH/TBH FOR SSD STORAGE

ON CONFIDENCE

As you may note: we are using point estimates without confidence intervals. This is partly intentional and highlights the experimental nature of this work. Our sources also provide single, rough estimates without confidence intervals, so we decided against numerically estimating our confidence so as to not provide false precision. Our work has been reviewed by several industry experts and our energy and carbon metrics for cloud computing have been assured by PricewaterhouseCoopers LLP. That said, we acknowledge that this estimation methodology is only a first step in giving us visibility into the ecological impacts of our cloud computing usage, which may evolve as our understanding improves. Whenever there has been a choice, we have erred on the side of conservative estimates, taking responsibility for more energy consumption than we are likely using to avoid overestimating our savings. While we have limited data, we are using these estimates as a jumping-off point and carrying forth in order to push ourselves and the industry forward. We especially welcome contributions and opinions. Let the conversation begin! SERVER WATTAGE ESTIMATE At a high level, to estimate server wattage, we used a general formula for calculating server energy use over time: W = MIN + UTIL*(MAX – MIN) Wattage = Minimum wattage + Average CPU Utilization * (Maximum wattage – minimum wattage) To determine minimum and maximum CPU wattage, we averaged the values reported by manufacturers of servers that are available in the SPEC power database (filtered to servers that we deemed likely to be similar to Google’s servers), and we used an industry average server utilization estimate (45%) from the US Data Center Energy Usage

Report.

STORAGE WATTAGE ESTIMATE To estimate storage wattage, we used industry-wide estimates from the U.S. Data Center Usage Report. That report contains estimated average capacity of disks as well as average disk wattage. We used both those estimates to get an estimated wattage per terabyte. NETWORKING NON-ESTIMATE The resources we found related to networking energy estimates were for general internet data transfer, as opposed to intra data center traffic between connected servers. Networking also made up a significantly smaller portion of our overall usage cost, so we are assuming it requires less energy than compute and storage. Finally, as far as the research we found indicated, the energy attributable to networking is generally far smaller than that attributable to compute

and storage.

Source: Arman Shehabi, Sarah J Smith, Eric Masanet and Jonathan Koomey; _Data center growth in the United States: decoupling the demand for services from electricity use_

;

2018

APPLICATION TO USAGE DATA We aggregated and grouped our usage data by SKU then categorized it by which type of service was applicable (“compute”, “storage”, “n/a”), converted the units to hours and terabyte-hours, then applied our coefficients. Since we do not yet have a coefficient for networking or RAM that we feel confident in, we are leaving that data out for now. The experts we have consulted with are confident that our coefficients are conservative enough to account for our overall energy consumption without separate consideration for networking and RAM.

RESULTS

Applying our Cloud Jewels coefficients to our aggregated usage data and comparing the estimates to our former local data center actual kWh totals over the past two years indicates that our energy footprint in Google Cloud is smaller than it was on premises. It’s important to note that we are not taking into account networking or RAM, nor Google-maintained services like BigQuery, BigTable, StackDriver, or App Engine. However, overall, relatively speaking over time (assuming our estimates are even moderately close to accurate and verified to be conservative), we are on track to be using less overall energy to do more computing than we were two years ago (as our business has grown), meaning we are making progress towards our energy intensity reduction

goal.

We used historical data to estimate what our energy savings are since moving to Google Cloud. Assumes ~16% YoY growth in former colocated data centers and actual/expected ~23% YoY growth in cloud usage between 2019-20 and

beyond.

Our estimated savings over the five year period are roughly equivalent

to:

* ~20,000 sewing machines (running 24/7) * ~147,000 light bulbs (running 24/7) * ~1,200 dishwashers (running 24/7)

NEXT STEPS

We would next like to find ways to estimate the energy cost of network traffic and memory. There are also minor refinements we could make to our current estimates, though we want to ensure that further detail does not lead to false precision, that we do not overcomplicate the methodology, and that the work we publish is as generally applicable and useful to other companies as possible. Part of our reasoning for open-sourcing this work is selfish: WE WANT INPUT! We welcome contributions to our estimates and additional resources that we should be using to refine them. We hope that publishing these coefficients will help other companies who use cloud computing providers estimate their energy footprint. And finally we hope that efforts and estimations encourage more public information about cloud energy usage, and particularly help cloud providers find ways to determine and deliver data like this, either as broad coefficients for estimation or actual energy usage metrics collected from their internal monitoring.

5 COMMENTS

DEVELOPING IN A MONOREPO WHILE STILL USING WEBPACK Posted by Salem Hilal on April 6, 2020 / 4 Comments _When I talk to friends and relatives about what I do at Etsy, I have to come up with an analogy about what Frontend Infrastructure is. It’s a bit tricky to describe because it’s something that you don’t see as an end user; the web pages that people interact with are several steps removed from the actual work that a frontend infrastructure engineer does. The analogy that I usually fall to is that of a restaurant: the meal is a fully formed web page, the chefs are product engineers, and the kitchen is the infrastructure. A good kitchen should make it easy to cook a bunch of different meals quickly and deliciously. Recently, my team and I spent over a year swapping out our home-grown, __Require-js-based_ _ JavaScript build system for __Webpack_ _. Running with this analogy a bit, this project is like trading out our kitchen without customers noticing, and without bothering the chefs too much. Large projects tend to be full of unique problems and unexpected hurdles, and this one was no exception. This post is the second in a short series on all the things that we learned during the migration, and is adapted in part from a talk I gave at JSConf 2019._ _The first post can be found here

._

------------------------- THE STATE OF JAVASCRIPT AT ETSY LAST YEAR. At Etsy, we have a whole lot of JavaScript. This alone doesn’t make us very unique, but we have something that not every other company has: a monorepo . When we deploy our web code, we need to build and deploy over 1200 different JavaScript assets made up from over twelve thousand different JavaScript files or modules. Like the rest of the industry, we find ourselves relying more and more on JavaScript, which means that a good bit more of our code base ends in “.js” this year than last. When starting to adopt Webpack, one of the first places we saw an early win was in our development experience. Up to and until this point, our engineers had been using a development server that we had written in-house. We ran a copy of it on every developer machine, where it built files as they were requested. This approach meant that you could reliably navigate around Etsy.com in development without needing to think about a build system at all. It also meant that we could start and restart an instance of the development server without worrying about losing state or interrupting developers much. Conceptually, this made things very simple to maintain. You truly couldn’t have asked for a simpler

diagram.

In practice, however, developers were asking for more from JavaScript _and_ from their build systems. We started adopting React a few years prior using the then-available JSXTransform tool, which we added to our build system with a fair amount of wailing and gnashing of teeth. The result was a server that successfully, yet sluggishly, supported JSX. Because it wasn’t designed with large applications in mind, our development server didn’t do things like cache transpiled JSX between builds. Building some of our weightier JavaScript code often took the better part of a minute, and most of our developers grew increasingly frustrated with the long iteration cycles it produced. Worse yet, because we were using JSXTransform, rather than something like Babel, our developers could use JSX but weren’t able to use any ES6 syntax like arrow functions or classes. BENDING WEBPACK TO OUR WILL. Clearly, there was a lot with our development environment that could be improved. To be worth the effort of adopting, any new build system we adopted would at least have to support the ability to transpile syntaxes like JSX, while still allowing for fast rebuild times for developers. Webpack seemed like a pretty safe bet — it was widely

adopted

;

it was actively developed

and funded

; and everyone who

had experience with it seemed to like it (in spite of its intimidating configuration). So, we spent a good bit of time configuring Webpack to work with our codebase (and vice versa). This involved writing some custom loaders for things like templates and translations, and it meant updating some of the older parts of our codebase that relied on the specifics of Require.js to work properly. After a lot of planning, testing, and editing, we were able to get Webpack to fully build our entire codebase. It took half an hour, and that was only when it didn’t fill all 16 gigabytes of our development server’s memory. Clearly, we had a lot more work on our plates. _This is one of our beefiest machines maxing out all 32 of its processors and eating up over 20 gigs of memory trying to run Webpack

once._

When Webpack typically runs in development mode, it behaves much differently than our old development server did. It starts by compiling all your code as it would for a production build, leaving out optimizations that don’t make sense in development (like minification and compression). It then switches to “watch mode”, where it listens to your source files for changes and kicks off partial recompilations when any of your source code updates. This keeps it from starting from scratch every time an asset updates, and watching the filesystem lets builds start a few seconds before the assets are requested by the browser. Webpack is very effective at partial rebuilds, which is how it’s able to remain fast and effective, even for larger projects. …AND MAYBE BENDING OUR WILL TO WEBPACK’S. Although Webpack was designed for large projects, it wasn’t designed for a whole company’s worth of large projects. Our monorepo contains JavaScript code from every part of Etsy. Making Webpack try to build everything at once was a fool’s errand, even after playing with plugins like HardSource

, CacheLoader

, and HappyPack

to either speed up the build time or reduce its resource footprint. We ended up admitting to ourselves that building everything at once was impossible. If your solution to a problem just barely works today, it’s not going to be very useful when your problem doubles in size in a few years’ time. A pretty straightforward next step would be to split up our codebase into logical regions and make a webpack config for each one, rather than using one big config to build everything. Splitting things up would allow each individual build to be reasonably sized, cutting back on both build times and resource utilization. Plus, production builds wouldn’t need to change much, since Webpack is perfectly happy accepting either a single configuration or an array

of them

There was one problem with this approach though: if we only built one slice of the site at a time, we wouldn’t be able to allow developers to easily browse around Etsy.com in development unless they manually started and stopped multiple instances of Webpack. There are a lot of features in Etsy that touch multiple parts of the site; adding a change to how a listing might appear could mean a change for our search page, the seller dashboard, and our internal tools as well. We needed a solution that would both allow us to only build parts of the site that made sense, while maintaining the “it just works!” behavior of our old system. So, we wrote something we’re calling Kevin.

THIS IS KEVIN.

Kevin (technically “kevin-middleware”) is an express-style middleware that manages multiple instances of Webpack for you. Its job is to make it easier to build a monorepo’s worth of JavaScript while maintaining the resource footprint of something much smaller. It was both inspired by and meant as a replacement to webpack-dev-middleware

, which is what

Webpack’s own development server uses to manage a single instance of Webpack under the hood. If you happen to be using that, Kevin will probably feel a bit familiar. Kevin works by reading in a list of Webpack configurations and determining all of the assets that each one could be responsible for. It then listens for requests for those assets, determines the config that is responsible for that asset, and then starts an instance of Webpack with that config. It’ll keep a few instances around in memory based on a simple frecency algorithm, and will monitor your source files in order to eagerly rebuild any changes. When there are more instances than a configured limit, the least used compiler is shut down and cleaned up. While otherwise being a lot cooler in every respect, Kevin has an objectively more complicated diagram. Webpack’s first build often takes a while. Like I mentioned before, it has to do a first pass of all the assets it needs to build before it’s able to do fast, iterative rebuilds. If a developer requests an asset from a config that isn’t being built by an active compiler, that request might time out before a fresh compiler finishes its first build. Kevin tries to offset this problem by serving some static code that renders an overlay whenever an asset is requested from a compiler that’s still running its first build. The overlay code communicates back with your development server to check on the status of your builds, and automatically reloads the page once everything is

complete.

Using Kevin is meant to be really straightforward. If you don’t already have a development server of some sort, creating one with Kevin and Express is maybe a dozen lines of code. Here’s a snippet taken from Kevin’s documentation: const express = require("express"); const Kevin = require("kevin-middleware"); // This is an array of webpack configs. Each config **must** be named so that we // can uniquely identify each one consistently. A regular ol' webpack config // should work just fine as well. const webpackConfigs = require("path/to/webpack.config.js"); // Setup your server and configure Kevin const app = express(); const kevin = new Kevin(webpackConfigs, { kevinPublicPath = "http://localhost:3000"

});

app.use(kevin.getMiddleware()); // Serve static files as needed. This is required if you generate async chunks; // Kevin only knows about the entrypoints in your configs, so it has to assume // that everything else is handled by a different middleware. app.use("/ac/webpack/js", express.static(webpackConfigs.output.path));

// Let 'er rip

app.listen(9275);

We’ve also made a bunch of Kevin’s internals accessible through Webpack’s own tapable plugin system. At Etsy, we use these hooks to integrate with our monitoring

system

,

and to gracefully restart active compilers that have pending updates to their configurations. In this way, we can keep our development server up to date while keeping developer interruptions to a minimum. SOMETIMES, A LITTLE CUSTOM CODE GOES A LONG WAY. In the end, we were able to greatly improve the development experience. Rebuilding our seller tools, which previously took almost a minute on every request, now takes under 30 seconds when we’re starting a fresh compiler, and subsequent requests take only a second or two. Navigating around Etsy.com in development still takes very little interaction with the build system from our engineers. Plus, we can now support all the other things that Webpack enables for us, like ES6, better asset analysis, and even TypeScript. This is the part where I should mention that Kevin is officially open-source software. Check out the source on GITHUB , and install it from npm as KEVIN-MIDDLEWARE . If you have any feedback about it, we would welcome an issue on Github. I really hope you get as much use out of it as we did. ------------------------- _This post is the second in a two-part series on our migration to a modern JavaScript build system. The first part can be found here

._

4 COMMENTS

THE CAUSAL ANALYSIS OF CANNIBALIZATION IN ONLINE PRODUCTS Posted by Xuan Yin and Ercan Yildiz on February 24, 2020

/ 2 Comments

> This article mainly draws on our published paper

> in

> KDD 2019 (oral presentation, selection rate 6%, 45 out of 700).

INTRODUCTION

Nowadays an internet company typically has a wide range of online products to fulfill customer needs. It is common for users to interact with multiple online products on the same platform and at the same time. Consider, for example, Etsy’s marketplace. There are organic search, recommendation modules (recommendations), and promoted listings enabling users to find interesting items. Although each of them offers a unique opportunity for users to interact with a portion of the overall inventory, they are functionally similar and contest the limited time, attention, and monetary budgets of users. To optimize users’ overall experiences, instead of understanding and improving these products separately, it is important to gain insights into the evidence of CANNIBALIZATION: an improvement in one product induces users to decrease their engagement with other products. Cannibalization is very difficult to detect in the offline evaluation, while frequently shows up in online A/B tests. Consider the following example, an A/B test for a recommendation module. A typical A/B test of a recommendation module commonly involves the change in the underlying machine learning algorithm, its user interface, or both. The recommendation change significantly increased users’ clicks on the recommendation while significantly decreasing users’ clicks on organic search results. Table 1: A/B Test Results for Recommendation Module （Simulated Experiment Data to Imitate the Real A/B Test） % CHANGE = EFFECT/MEAN OF CONTROL Recommendation Clicks

+28%***

Search Clicks

-1%***

Conversion

+0.2%

GMS

-0.3%

Note: ‘***’ p<0.001, ‘**’ p<0.01, ‘*’ p<0.05, ‘.’ p<0.1. The two-tailed p-value is derived from the z-test for H0: the effect is zero, which is based on asymptotic normality. There is an intuitive explanation to the drop in search clicks: users might not need to search as much as usual because they could find what they were looking for through recommendations. In other words, improved recommendations effectively diverted users’ attention away from search and thus cannibalized the user engagement in search. Note that increased recommendation clicks did not translate into observed gains in key performance indicators: conversion and Gross Merchandise Sales (GMS). Conversion and GMS are typically measured at the sitewide level because the ultimate goal of the improvement of any product on our platform is to facilitate a better user experience about etsy.com. The launch decision of a new algorithm is usually based on the significant gain in conversion/GMS from A/B tests. The insignificant conversion/GMS gain and the significant lift in recommendation clicks challenge product owners when deciding to terminate the new algorithm. They wonder whether the cannibalization in search clicks could, in turn, cannibalize conversion/GMS gain from recommendation. In other words, it is plausible that the improved recommendations should have brought more significant increases of conversion/GMS than what the A/B test shows, and its positive impact is partially offset by the negative impact from the cannibalized user engagement in search. If there is cannibalization in conversion/GMS gain, then, instead of terminating it, it is advisable to launch the new recommendation algorithm and revise the search algorithm to work better with the new recommendation algorithm; otherwise, the development of recommendation algorithms would be hurt. The challenge asks for separating the revenue loss (through search) from the original revenue gain (from the recommendation module change). Unfortunately, from the A/B tests, we can only observe the cannibalization in user engagement (the induced reduction in search

clicks).

FLAWS OF PURCHASE-FUNNEL BASED ATTRIBUTION METRICS Specific product revenue is commonly attributed based on purchase-funnel/user-journey. For example, the purchase-funnel of recommendations could be defined as a sequence of user actions: “click A in recommendations → purchase A”. To compute recommendation-attributed conversion rate, we have to segment all the converted users into two groups: those who follow the pattern of the purchase-funnel and those who do not. Only the first segment is used for counting the number of conversions. However, the validity of the attribution is questionable. In many A/B tests of new recommendation algorithms, it is common for recommendation-attributed revenue change to be over +200% and search-attributed revenue change to be around -1%. It is difficult to see how the conversion lift is cannibalized and dropped from +200% to the observed +0.2%. These peculiar numbers remind us that attribution metrics based on purchase-funnel are unexplainable and unreliable for at least two reasons. First, users usually take more complicated journeys than a heuristically-defined purchase-funnel can capture. Here are two

examples:

* If the recommendations make users stay longer on Etsy, and users click listings on other pages and modules to make purchases, then the recommendation-attributed metrics fail to capture the contribution of the recommendations to these conversions. The purchase-funnel is based on “click”, and there is no way to incorporate “dwell time” to the purchase-funnel. * Suppose the true user journey is “click A in recommendation → search A → click A in search results → click A in many other places → purchase A”. Shall the conversion be attributed to recommendation or search? Shall all the visited pages and modules share the credit of this conversion? Any answer would be too heuristic

to be convincing.

Second, attribution metrics cannot measure any causal effects. The random assignment of users in an A/B test makes treatment and control buckets comparable and thus enables us to calculate the average treatment effect (ATE). The segments of users who follow the pattern of purchase-funnel may not be comparable between the two buckets, because the segmentation criterion (i.e., user journey) happens after random assignment and thus the segments of users are not randomized between the two buckets. In causal inference, factors that cause users to follow the pattern of purchase-funnel would be introduced by the segmentation and thus confound the causality between treatment and outcome. Any post-treatment segmentation could break the ignorability assumption of the causal identification and invalidate the causality in experiment analysis (see, e.g., Montgomery et al., 2018

).

CAUSAL MEDIATION ANALYSIS We exploit search clicks as a mediator in the causal path between recommendation improvement and conversion/GMS, and extend a formal framework of causal inference, causal mediation analysis

(CMA), to separate

the cannibalized effect from the original effect of the recommendation module change. CMA splits the observed conversion/GMS gains (average treatment effect, ATE) in A/B tests into the gains from the recommendation improvement (direct effect) and the losses due to cannibalized search clicks (indirect effect). In other words, the framework allows us to measure the impacts of recommendation improvement on conversion/GMS directly as well as indirectly through a mediator such as search (Figure 1). The significant drop in search clicks makes it a good candidate for the mediator. In practice, we can try different candidate mediators and use the analysis to confirm which one is the mediator. FIGURE 1: DIRECTED ACYCLIC GRAPH (DAG) TO ILLUSTRATE THE CAUSAL MEDIATION IN RECOMMENDATION A/B TEST. However, it is challenging to implement CMA of the literature directly in practice. An internet platform typically has tons of online products and all of them could be mediators on the causal path between the tested product and the final business outcomes. Figure 2 shows that multiple mediators (M1, M0, and M2) are on the causal path between treatment T and the final business outcome Y. In practice, it is very difficult to measure user engagement in all these mediators. Multiple unmeasured causally-dependent mediators in A/B tests break the sequential ignorability assumption in CMA and invalidates CMA (see

Imai et al. (2010)

for assumptions in CMA). Figure 2: DAG of Multiple Mediators Note: M0 and M2 are upstream and downstream mediators of the mediator

M1 respectively.

We define generalized average causal mediation effect (GACME) and generalized average direct effect (GADE) to analyze the second cannibalism. GADE captures the average causal effect of the treatment T that goes through all the channels that do not have M1. GACME captures the average causal effect of the treatment T that goes through all the channels that have M1. We proved that, under some assumptions, GADE and GACME are identifiable even though there are numerous unmeasured causally-dependent mediators. If there is no unmeasured mediator, then GADE and GACME collapse to ADE and ACME. If there is, then ADE and ACME cannot be identified while GADE and

GACME can.

Table 2 shows the sample results. The recommendation improvement led to a 0.5% conversion lift, but the cannibalized search clicks resulted in a 0.3% conversion loss, and the observed revenue did not change significantly. When the outcome is GMS, we can see the loss through cannibalized search clicks as well. The results justify the cannibalization in conversion lift, and serve as evidence to support the launch of the new recommendation module. Table 2: Causal Mediation Analysis on Simulated Experiment Data % CHANGE = EFFECT/MEAN OF CONTROL CANNIBALIZATION IN GAIN

CAUSAL MEDIATION

CONVERSION

GMS

The Original Gain from recommendation GADE(0) (Direct Component)

0.5%*

0.2%

The Loss Through Search GACME(1) (Indirect Component)

-0.3***

-0.4%***

The Observed Gain

ATE (Total Effect)

0.2%

-0.3%

Note: ‘***’ p<0.001, ‘**’ p<0.01, ‘*’ p<0.05, ‘.’ p<0.1. The two-tailed p-value is derived from the z-test for H0: the effect is zero, which is based on asymptotic normality. The implementation follows a causal mediation-based methodology we recently developed and published on KDD 2019

.

We also made a fun video describing the intuition behind the methodology. It is easy to implement and only requires solving two linear regression equations simultaneously (Section 4.4 ). We simply need the treatment assignment indicator, search clicks, and the observed revenue for each experimental unit. Interested readers can refer to

our paper

for more details and our GitHub repo

for analysis code.

We have successfully deployed our model to identify products that are prone to cannibalization. In particular, it has helped product and engineering teams understand the tradeoffs between search and recommendations, and focus on the right opportunities. The direct effect on revenue is a more informative key performance indicator than the observed average treatment effect to measure the true contribution of a product change to the marketplace and to guide the decision on the launch of new product features.

2 COMMENTS

THE JOURNEY TO FAST PRODUCTION ASSET BUILDS WITH WEBPACK Posted by Jonathan Lai on February 3, 2020 / 2 Comments Etsy has switched from using a RequireJS-based JavaScript build system to using Webpack . This has been a crucial cornerstone in the process of modernizing Etsy’s JavaScript ecosystem. We’ve learned a lot from addressing our multiple use-cases during the migration, and this post is the first of two parts documenting our learnings. Here, we specifically cover the production use-case — how we set up Webpack to build production assets in a way that meets our needs. The second post can be found here

.

We’re proud to say that our Webpack-powered build system, responsible for over 13,200 assets and their source maps, finishes in FOUR MINUTES on average. This fast build time is the result of countless hours of optimizing. What follows is our journey to achieving such speed, and what we’ve discovered along the way. PRODUCTION EXPECTATIONS One of the biggest challenges of migrating to Webpack was achieving production parity with our pre-existing JavaScript build system, named Builda. It was built on top of the RequireJS optimizer , a build tool predating Webpack, with extensive customization to speed up builds and support then-nascent standards like JSX. Supporting Builda became more and more untenable, though, with each custom patch we added to support JavaScript’s evolving standards. By early 2018, we consequently decided to switch to Webpack; its community support and tooling offered a sustainable means to keep up with JavaScript modernization. However, we had spent many years optimizing Builda to accommodate our production needs. Assets built by Webpack would need to have 100% functional parity with assets built by Builda in order for us to have confidence around switching. Our primary expectation for any new build system is that it takes less than five minutes on average to build our production assets. Etsy is one of the earliest and biggest proponents of continuous deployment, where our fast build/deploy pipeline allows us to ship changes up to 30 times a day. Builda was already meeting this expectation, and we would negatively impact our deployment speed/frequency if a new system took longer to build our JavaScript. Build times, however, tend to increase as a codebase grows, productionization becomes more complex, and available computing resources are maxed out. At Etsy, our frontend consists of over 12,000 modules that eventually get bundled into over 1,200 static assets. Each asset needs to be localized and minified, both of which are time-consuming tasks. Furthermore, our production asset builds were limited to using 32 CPU cores and 64GB of RAM. Etsy had not yet moved to the cloud when we started migrating to Webpack, and these specs were of the beefiest on-premise hosts available. This meant we couldn’t just add more CPU/RAM to achieve

faster builds.

So, to recap:

* Our frontend consists of over 1,200 assets made up of over 12,000

modules.

* Each asset needs to be localized and minified as part of

productionization.

* We are limited to 32 CPU cores and 64GB of RAM. * Production asset builds need to finish in less than five minutes on

average.

We got this.

LOCALIZATION

From the start, we knew that localization

would be a major

obstacle to achieving sub-five-minute build times. Localization strings are embedded in our JavaScript assets, and at Etsy we officially support eleven locales. This means we need to produce eleven copies of each asset, where each copy contains localization strings of a specific locale. Suddenly, building over 1,200 assets balloons into building over 1,200 × 11 = 13,200 assets. General caching solutions help reduce build times, idempotent of localization’s multiplicative factor. After we solved the essential problems of resolving our module dependencies and loading our custom code with Webpack, we incorporated community solutions like

cache-loader and

babel-loader ’s caching options. These solutions cache intermediary artifacts of the build process, which can be time-consuming to calculate. As a result, asset builds after the initial one finish much faster. Still though, we needed more than caching to build localized assets quickly. One of the first search results for Webpack localization was the now-deprecated i18n-webpack-plugin

. It expects a

separate Webpack configuration for each locale, leading to a separate production asset build per locale. Even though Webpack supports multiple configurations via its MultiCompiler mode

, the documentation

crucially points out that “each configuration is only processed after the previous one has finished processing.” At this stage in our process, we measured that a single production asset build without minification was taking ~3.75 minutes with no change to modules and a hot cache (a no-op build). It would take us ~3.75 × 11 = ~41.25 minutes to process all localized configurations for a no-op build. We also ruled out using this plugin with a common solution like parallel-webpack to process configurations in parallel. Each parallel production asset build requires additional CPU and RAM, and the sum far exceeded the 32 CPU cores and 64GB of RAM available. Even when we limited the parallelism to stay under our resource limits, we were met with overall build times of ~15 minutes for a no-op build. It was clear we need to approach localization differently. LOCALIZATION INLINING To localize our assets, we took advantage of two characteristics about our localization. First, the way we localize our JavaScript code is through a module abstraction. An engineer defines a module that contains only key-value pairs. The value is the US-English version of the text that needs to be localized, and the key is a succinct description of the text. To use the localized strings, the engineer imports the module in their source code. They then have access to a function that, when passed a string corresponding to one of the keys, returns the localized value of the text. example of how we include localizations in our JavaScript For a different locale, the message catalog contains analogous localization strings for the locale. We programmatically handle generating analogous message catalogs with a custom Webpack loader that applies whenever Webpack encounters an import for localizations. If we wanted to build Spanish assets, for example, the loader would look something like

this:

example of how we would load Spanish localizations into our assets Second, once we build the localized code and output localized assets, the only differing lines between copies of the same asset from different locales are the lines with localization strings; the rest are identical. When we build the above example with English and Spanish localizations, the diff of the resulting assets confirms this: diff of the localized copies of an asset Even when caching intermediary artifacts, our Webpack configuration would spend over 50% of the overall build time constructing the bundled code of an asset. If we provided separate Webpack configurations for each locale, we would repeat this expensive asset construction process eleven times. diagram of running Webpack for each locale We could never finish this amount of work within our build-time constraints, and as we saw before, the resulting localized variants of each asset would be identical except for the few lines with localizations. What if, rather than locking ourselves into loading a specific locale’s localization and repeating an asset build for each locale, we returned a placeholder where the localizations should go? code to load placeholders in place of localizations We tried this placeholder loader approach, and as long as it returned syntactically valid JavaScript, Webpack could continue with no issue and generate assets containing these placeholders, which we call “sentinel assets”. Later on in the build process a custom plugin takes each sentinel asset, finds the placeholders, and replaces them with corresponding message catalogs to generate a localized asset. diagram of our build process with localization inlining We call this approach “localization inlining”, and it was actually how Builda localized its assets too. Although our production asset builds write these sentinel assets to disk, we do not serve them to users. They are only used to derive the localized assets. With localization inlining, we were able to generate all of our localized assets from one production asset build. This allowed us to stay within our resource limits; most of Webpack’s CPU and RAM usage is tied to calculating and generating assets from the modules it has loaded. Adding additional files to be written to disk does not increase resource utilization as much as running an additional production asset build does. Now that a single production asset build was responsible for over 13,200 assets, though, we noticed that simply writing this many assets to disk substantially increased build times. It turns out, Webpack only uses a single thread to write a build’s assets to disk. To address this bottleneck, we included logic to write a new localized asset only if the localizations or the sentinel asset have changed — if neither have changed, then the localized asset hasn’t changed either. This optimization greatly reduced the amount of disk writing after the initial production asset build, allowing subsequent builds with a hot cache to finish up to 1.35 minutes faster. A no-op build without minification consistently finished in ~2.4 minutes. With a comprehensive solution for localization in place, we then focused on adding minification.

MINIFICATION

Out of the box, Webpack includes the terser-webpack-plugin

for asset

minification

.

Initially, this plugin seemed to perfectly address our needs. It offered the ability to parallelize minification, cache minified results to speed up subsequent builds, and even extract license comments into a separate file. When we added this plugin to our Webpack configuration, though, our initial asset build suddenly took over 40 minutes and used up to 57GB of RAM at its peak. We expected the initial build to take longer than subsequent builds and that minification would be costly, but this was alarming. Enabling any form of production source maps also dramatically increased the initial build time by a significant amount. Without the terser-webpack-plugin, the initial production asset build with localizations would finish in ~6 minutes. It seemed like the plugin was adding an unknown bottleneck to our builds, and ad hoc monitoring with htop during the initial production asset build seemed to confirmed our suspicions: htop during minification At some points during the minification phase, we appeared to only use a single CPU core. This was surprising to us because we had enabled parallelization in terser-webpack-plugin’s options. To get a better understanding of what was happening, we tried running strace on the main thread to profile the minification

phase:

strace during minification At the start of minification, the main thread spent a lot of time making memory syscalls

(mmap and munmap).

Upon closer inspection of terser-webpack-plugin’s source code, we found that the main thread needed to load the contents of every asset to generate parallelizable minification jobs for its worker threads. If source maps were enabled, the main thread also needed to calculate each asset’s corresponding source map

.

These lines explained the flood of memory syscalls we noticed at the

start.

Further into minification, the main thread started making recvmsg and write syscalls to communicate between threads. We corroborated these syscalls when we found that the main thread needed to serialize the contents of each asset (and source maps if they were enabled) to send it to a worker thread to be minified. After receiving and deserializing a minification result received from a worker thread, the main thread was also solely responsible for caching the result to disk

.

This explained the stat, open, and other write syscalls we observed because the Node.js code promises to write the contents. The underlying epoll_wait syscalls then poll to check when the writing finishes so that the promise can be resolved. The main thread can become a bottleneck when it has to perform these tasks for a lot of assets, and considering our production asset build could produce over 13,200 assets, it was no wonder we hit this bottleneck. To minify our assets, we would have to think of a

different way.

POST-PROCESSING

We opted to minify our production assets outside of Webpack, in what we call “post-processing”. We split our production asset build into two stages, a Webpack stage and a post-processing stage. The former is responsible for generating and writing localized assets to disk, and the latter is responsible for performing additional processing on these assets, like minification: running Webpack with a post-processing stage diagram of our build process with localization inlining and

post-processing

For minification, we use the same terser library the terser-webpack-plugin uses. We also baked parallelization and caching into the post-processing stage, albeit in a different way than the plugin. Where Webpack’s plugin reads the file contents on the main thread and sends the whole contents to the worker threads, our parallel-processing jobs send just the file path to the workers. A worker is then responsible for reading the file, minifying it, and writing it to disk. This reduces memory usage and facilitates more efficient parallel-processing. To implement caching, the Webpack stage passes along the list of assets written by the current build to tell the post-processing stage which files are new. Sentinel assets are also excluded from post-processing because they aren’t served to

users.

Splitting our production asset builds into two stages does have a potential downside: our Webpack configuration is now expected to output un-minified text for assets. Consequently, we need to audit any third-party plugins to ensure they do not transform the outputted assets in a format that breaks post-processing. Nevertheless, post-processing is well worth it because it allows us to achieve the fast build times we expect for production asset builds.

BONUS: SOURCE MAPS

We don’t just generate assets in under five minutes on average — we also generate corresponding source maps for all of our assets too. Source maps allow engineers to reconstruct the original source code that went into an asset. They do so by maintaining a mapping of the output lines of a transformation, like minification or Webpack bundling, to the input lines. Maintaining this mapping during the transformation process, though, inherently adds

time.

Coincidentally, the same localization characteristics that enable localization inlining also enable faster source map generation. As we saw earlier, the only differences between localized assets are the lines containing localization strings. Subsequently, these lines with localization strings are the only differences between the _source maps_ for these localized assets. For the rest of the lines, the source map for one localized asset is equivalently accurate for another because each line is at the same line number between localized

assets.

If we were to generate source maps for each localized asset, we would end up repeating resource-intensive work only to result in nearly identical source maps across locales. Instead, we only generate source maps for the sentinel assets the localized assets are derived from. We then use the sentinel asset’s source map for each localized asset derived from it, and accept that the mapping for the lines with localization strings will be incorrect. This greatly speeds up source map generation because we are able to reuse a single source map that applies to many assets. For the minification transformation that occurs during post-processing, terser accepts a source map

alongside the

input to be minified. This input source map allows terser to account for prior transformations when generating source mappings for its minification. As a result, the source map for its minified results still maps back to the original source code before Webpack bundling. In our case, we pass terser the sentinel asset’s source map for each localized asset derived from it. This is only possible because we aren’t using terser-webpack-plugin, which (understandably) doesn’t allow mismatched asset/source map pairings. diagram of our complete build process with localization inlining, post-processing, and source map optimizations Through these source map optimizations, we are able to maintain source maps for all assets while only adding ~1.7 minutes to our build time average. Our unique approach can result in up to a 70% speedup in source map generation compared to out-of-the-box options offered by Webpack, a dramatic reduction in the time.

CONCLUSION

Our journey to achieving fast production builds can be summed up into three principles: reduce, reuse, recycle.

* REDUCE

Reduce the workload on Webpack’s single thread. This goes beyond applying parallelization plugins and implementing better caching. Investigating our builds led us to discover single-threaded bottlenecks like minification, and after implementing our own parallelized post-processing we observed significantly faster build

times.

* REUSE

The more existing work our production build can reuse, the less it has to do. Thanks to the convenient circumstances of our production setup, we are able to reuse source maps and apply them to more than one asset each. This avoids a significant amount of unnecessary work when generating source maps, a time-intensive process.

* RECYCLE

When we can’t reuse existing work, figuring out how to recycle it is equally valuable. Deriving localized assets from sentinel assets allows us to recycle the expensive work of producing an asset from an entrypoint, further speeding up builds. While some implementation details may become obsolete as Webpack and the frontend evolve, these principles will continue to guide us towards faster production builds. _This post is the first in a two-part series on our migration to a modern JavaScript build system. The second part can be found here

._

2 COMMENTS

Older Posts

The engineers who make Etsy make our living with a craft we love: software. This is where we'll write about our craft and our collective experience building and running the world's most vibrant handmade

marketplace.

Visit Etsy.com • RSS • Twitter

• GitHub

• Careers

This work is

licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License

.

Details

Image Url

HTML Url

Moderation By

More Annotations

James Lee

2020-04-16 09:32:07

James Lee

2020-04-16 09:33:36

James Lee

2020-04-16 09:34:05

James Lee

2020-04-16 09:34:23

James Lee

2020-04-16 09:35:24

James Lee

2020-04-16 09:35:41

James Lee

2020-04-16 09:36:05

James Lee

2020-04-16 09:36:18

James Lee

2020-04-16 09:37:08

James Lee

2020-04-16 09:37:18

James Lee

2020-04-16 09:37:30

James Lee

2020-04-16 09:37:43

Favourite Annotations

James Lee

2020-12-11 11:55:27

James Lee

2020-12-11 11:55:52

James Lee

2020-12-11 11:56:11

James Lee

2020-12-11 11:56:51

James Lee

2020-12-11 11:57:12

James Lee

2020-12-11 11:57:35

James Lee

2020-12-11 11:57:43

James Lee

2020-12-11 11:57:51

James Lee

2020-12-11 11:58:34

James Lee

2020-12-11 11:58:50

James Lee

2020-12-11 11:59:05

James Lee

2020-12-11 11:59:17

Text

CODEASCRAFT.COM

CODEASCRAFT.COM

CODEASCRAFT.COM

CODEASCRAFT.COM

CODEASCRAFT.COM

CODEASCRAFT.COM

CODEASCRAFT.COM

CODEASCRAFT.COM

vibrant

of

faster test runs.

is vital.

EXECUTING A SUNSET

infrastructure

HOW ETSY SHIPS APPS

obtain an

CODEASCRAFT.COM

JSX.

CODEASCRAFT.COM

CODEASCRAFT.COM

JSX.

CODEASCRAFT.COM

vibrant

ENGINEERING

is vital.

Human Factors

IRC.

vibrant

CODEASCRAFT.COM