آیا نمایه سازی Google Pass می تواند از BERT استفاده کند؟

thumbnail

12 ماه از زمان اعلام برای به روزرسانی جدید به نام BERT در جستجوی تولید می گذرد و جای تعجب نیست که رویداد اخیر Search On ، که تقریباً در آستانه تولید اولین تولد BERT افتاده است ، شامل بسیاری از پیشرفت ها و پیشرفت های بزرگ با استفاده از هوش مصنوعی و BERT در سال گذشته.

خلاصه ای از آنچه BERT است

به طور خلاصه ، به روزرسانی Google BERT اکتبر 2019 یک به روزرسانی یادگیری ماشینی است که گفته می شود به Google کمک می کند تا سوالات و محتوا را بهتر درک کند ، از طریق درک بیشتر “معنی کلمه” (زمینه) از تفاوت در کلمات چندرسمی استفاده می کند. به روزرسانی اولیه فقط 10٪ از درخواستهای انگلیسی را تحت تأثیر قرار داده و همچنین قطعه هایی را در مناطقی که ظاهر شده اند ، نشان می دهد.

نکته مهم ، آن به روزرسانی جستجوی اولیه BERT عمدتاً برای ابهام زدایی ، و همچنین استخراج متن و جمع بندی آن در بریده های برجسته بود. جنبه ابهام زدایی بیشتر در مورد جملات و عبارات اعمال می شد.

در عرض یک ماه یا بیشتر از اعلامیه جستجوی تولید BERT ، در بسیاری از کشورها شروع به کار کرد ، البته هنوز فقط 10٪ از درخواست ها را در همه مناطق تحت تأثیر قرار می دهد.

در ابتدا اعلامیه اکتبر 2019 سر و صدای زیادی در دنیای SEO ایجاد کرد ، به این دلیل که به گفته گوگل ، هنگام اعلام BERT ، این به روزرسانی “بزرگترین جهش در پنج سال گذشته و یکی از بزرگترین جهش های تاریخ” است. جستجو. ”

این کاملاً مهمترین اعلامیه از زمان RankBrain و بدون اغراق بود – و این فقط برای دنیای جستجوی وب نیست. تحولات مربوط به BERT طی 12 ماه گذشته در زمینه درک زبان طبیعی (یک منطقه نیم قرن مطالعه) ، مسلماً یادگیری ها را در یک سال بیش از پنجاه مورد قبلی به جلو برده است.

دلیل این امر BERT دیگری بود – مقاله دانشگاهی 2018 توسط محققان گوگل Devlin و همکاران با عنوان “BERT: آموزش قبل از ترانسفورماتورهای دو طرفه عمیق برای درک زبان”. توجه داشته باشید ، من در اینجا به چندین مقاله دانشگاهی مراجعه خواهم کرد. لیستی از منابع و منابع را در انتهای این مقاله خواهید یافت

BERT (مقاله) متعاقباً برای سایر افراد جامعه یادگیری ماشین تهیه شده بود و بدون شک کمک قابل توجهی به پیشرفت زبانشناسی محاسباتی چشمگیر جهان داشت.

ایده اصلی BERT این است که از پیش آموزش دو جهته در پنجره زمینه کلمات از یک مجموعه متن بزرگ (En Wikipedia و BookCorpus) با استفاده از یک مکانیسم “توجه” ترانسفورماتور استفاده می کند تا تمام کلمات به سمت چپ و سمت چپ را ببیند. سمت راست یک هدف در یک پنجره کشویی به طور همزمان برای زمینه بیشتر.

پس از آموزش ، می توان از BERT به عنوان پایه استفاده کرد و سپس در سایر کارهای دقیق تر ، با تمرکز بیشتر بر تحقیقات در مورد درک زبان طبیعی پایین دست و پاسخ به سوالات ، تنظیم دقیق کرد.

مثالی برای شفافیت “پنجره زمینه” برای “معنی کلمه”

از آنجا که دامنه یک پنجره زمینه یک مفهوم مهم است ، من مثالی برای مثال ارائه داده ام:

اگر یک پنجره زمینه 10 کلمه باشد و کلمه هدف در موقعیت 6 در یک “پنجره زمینه” کشویی 10 کلمه ای باشد ، BERT نه تنها می تواند کلمات 1-5 در سمت چپ ، بلکه کلمات 7-10 در سمت راست را نیز در در همان زمان با استفاده از درجه دوم توجه “جفت کلمات”.

این پیشرفت بزرگی است. مدلهای قبلی یک جهته بودند ، یعنی فقط کلمات 1-5 را در سمت چپ می دیدند ، اما 7-10 را نمی دیدند تا اینکه در پنجره کشویی به آن کلمات رسیدند. استفاده از این ماهیت دو جهته و توجه همزمان زمینه کاملی را برای یک کلمه معین فراهم می کند (البته در محدودیت های طول پنجره).

به عنوان مثال ، کلمه “بانک” متفاوت درک می شود اگر کلمات دیگر در پنجره زمینه نیز شامل “رودخانه” یا “پول” باشند. کلمات مشترک در پنجره زمینه به معنی اضافه می شوند و ناگهان “بانک” به عنوان “بانک مالی” یا “ساحل رودخانه” شناخته می شود.

بازگشت به اطلاعیه به روزرسانی Google BERT در اکتبر 2019

اطلاعیه جستجوی تولید در تاریخ 25 اکتبر 2019 ، به دنبال سالی پرمخاطب با تمرکز بر BERT در جامعه تحقیق زبان بود.

در بازه زمانی 2018 تا 2019 ، همه شیوه های شخصیت Sesame Street به نام مدل های نوع BERT ، از جمله ERNIE ، از بایدو ظاهر شد. فیس بوک و مایکروسافت همچنین مشغول ساختن مدلهایی مانند BERT و بهبود هرچه بیشتر BERT بودند. فیس بوک ادعا کرد مدل ROBERTA آنها به سادگی نسخه آموزش دیده قوی تری از BERT بوده است. (مایکروسافت می گوید از آوریل 2019 از BERT در Bing استفاده می کند ،)

تیم های بزرگ هوش مصنوعی در تابلوهای مختلف وظایف زبان یادگیری ماشین ، یکدیگر را جست و خیز می کنند ، محبوب ترین آنها در میان آنها SQuAD (مجموعه سوالات و پاسخ استنفورد) ، GLUE (ارزیابی عمومی درک زبان) و RACE (درک خواندن از ارزیابی). ضرب و شتم معیارهای درک زبان انسان در حین کار.

اما سال 2020 چه می شود؟

در حالی که دنیای SEO در اواخر موضوع BERT ساکت تر بود (تا این ماه) ، اشتیاق به یادگیری عمیق و جهان پردازش زبان طبیعی در اطراف BERT تسریع شده است ، به جای اینکه در سال 2020 کاهش یابد.

تحولات 2019/2020 در هوش مصنوعی و درک زبان طبیعی کاملاً باعث می شود که سئوکاران یکبار دیگر بازی BERT-stalking خود را انجام دهند. به ویژه با توجه به تحولات این هفته ، به ویژه متعاقب اعلامیه های رویداد آنلاین جستجوی Google.

BERT همیشه به معنای BERT نیست

یک یادداشت مهم قبل از ادامه:

“BERT-like” – یک اصطلاح توصیفی برای پیش آموزش یک مدل متن بزرگ بدون برچسب در مورد “زبان” و سپس استفاده از یادگیری انتقال از طریق فن آوری های ترانسفورماتور برای تنظیم دقیق مدل ها با استفاده از طیف وسیعی از کارهای دقیق تر است.

در حالی که به روزرسانی 2019 گوگل BERT نامیده می شد ، به احتمال زیاد ارجاع به روشی است که اکنون در بخشهایی از جستجو و به طور کلی در زمینه زبان یادگیری ماشین به کار می رود تا یک الگوریتم واحد به گفته هر کدام ، از زمان BERT ، و BERT مانند حتی در سال 2019 تقریباً در دنیای زبان یادگیری ماشین به عنوان صفت شناخته می شد.

بازگشت به هوش مصنوعی Google در اطلاعیه های جستجو

Prabhakar Raghavan در طی سخنرانی گفت: “با پیشرفت های اخیر در AI ، ما جهش های بزرگتری در پیشرفت در Google داریم که در دهه گذشته دیده ایم ، بنابراین حتی برای شما آسان تر است که فقط آنچه را که می خواهید پیدا کنید.” رویداد اخیر Search On.

و او اغراق نکرد ، زیرا گوگل برخی از ویژگی های جدید مهیج را نشان می دهد که به زودی به جستجو می پردازد ، از جمله بهبود الگوریتم های غلط املایی ، عوامل مکالمه ، فناوری تصویر و زمزمه دستیار Google.

خبرهای بزرگ در قسمت استفاده از BERT. افزایش بسیار زیاد استفاده از فقط 10٪ س quالات تقریباً در هر سeryال به زبان انگلیسی.

“امروز ما خوشحالیم که به اشتراک می گذاریم که BERT اکنون تقریباً در هر سeryال به زبان انگلیسی استفاده می شود ، به شما کمک می کند نتایج با کیفیت تری برای سوالات خود دریافت کنید.

(Prabhakar Raghavan ، 2020)

نمایه سازی گذرگاه

گذشته از اخبار گسترش استفاده از BERT ، به ویژه یکی دیگر از اعلان ها دنیای سئو را به هیجان آورد.

عنوان “نمایه سازی گذرگاه” که به موجب آن Google بخشهای خاصی از صفحات و اسناد را در پاسخ به برخی س someالات رتبه بندی کرده و نشان می دهد.

Raghavan گوگل توضیح می دهد:

“جستجوی بسیار خاص می تواند دشوارترین نتیجه برای جستجو باشد ، زیرا گاهی اوقات جمله ای که به س yourال شما پاسخ می دهد ممکن است در عمق یک صفحه وب دفن شود. ما اخیراً دستیابی به موفقیت در رتبه بندی انجام داده ایم و اکنون می توانیم نه تنها صفحات وب را ، بلکه مقاطع جداگانه از صفحات را فهرست بندی کنیم. با درک بهتر ارتباط متغیرهای خاص ، نه فقط صفحه کلی ، می توانیم اطلاعات مورد نیاز سوزن در انبار را پیدا کنیم. این فناوری در حالی که ما آن را در سطح جهانی گسترش می دهیم ، 7 درصد از جستجو در همه زبانها را بهبود می بخشد. “

(Prabhakar ، 2020)

مثالی برای نشان دادن تأثیر تغییر پیش رو ارائه شد.

“با استفاده از فناوری جدید ، قادر خواهیم بود قسمت های اصلی را در یک صفحه وب بهتر شناسایی و درک کنیم. این به ما کمک می کند محتوای سطحی که در غیر اینصورت ممکن است در هنگام در نظر گرفتن یک صفحه فقط به عنوان یک کل ، مرتبط به نظر نرسد … “، Google توضیح داد هفته گذشته

به عبارت دیگر ، یک پاسخ خوب را می توان در یک متن یا پاراگراف در یک سند موضوع دیگر با موضوعی گسترده یا صفحه تاریک تصادفی بدون تمرکز زیاد یافت. به عنوان مثال بسیاری از پست های وبلاگ و نظرات را در نظر بگیرید که بسیاری از آنها با محتوای بسیار نامربوط یا موضوعات مختلط در یک وب سایت کاملاً بی ساختار و متفاوت از محتوای همیشه در حال افزایش وجود دارد.

به آن نمایه سازی عبور گفته می شود ، اما نه آنطور که ما می شناسیم

اعلامیه “نمایه سازی عبور” باعث ایجاد سردرگمی در جامعه جستجوگرها شد ، زیرا چندین تغییر در ابتدا به عنوان “نمایه سازی” تفسیر شدند.

یک فرض طبیعی که باید مطرح شود زیرا نام “نمایه سازی گذرگاه” به معنای “عبور” و “نمایه سازی” است.

به طور طبیعی برخی از افراد جستجوگر س whetherال داشتند که آیا بخشهای جداگانه به جای صفحات منفرد به فهرست اضافه می شوند یا خیر ، اما چنین نیست ، به نظر می رسد ، از آنجا که گوگل روشن کرده است که به روزرسانی پیش رو در واقع مربوط به یک مسئله رتبه بندی عبور است ، نه یک مسئله نمایه سازی.

راگاوان توضیح داد: “ما اخیراً در دستیابی به رتبه بندی پیشرفت چشمگیری داشته ایم و اکنون می توانیم نه تنها صفحات وب را فهرست بندی کنیم ، بلکه قسمتهای جداگانه صفحه ها را نیز فهرست بندی کنیم.” “با درک بهتر ارتباط متغیرهای خاص ، نه فقط صفحه کلی ، می توانیم اطلاعات مورد نیاز سوزن در انبار را پیدا کنیم.”

این تغییر به جای نمایه سازی برای هر کلمه مربوط به رتبه بندی است.

این دستیابی به موفقیت ممکن است باشد و این به کجا منتهی می شود؟

در حالی که تنها 7٪ از س quالات در انتشار اولیه تأثیرگذار است ، گسترش بیشتر این سیستم نمایه سازی گذرگاه جدید می تواند مفهوم بسیار بیشتری از آنچه ممکن است در ابتدا شک شود داشته باشد.

بدون اغراق ، هنگامی که شروع به جستجوی ادبیات سال گذشته در تحقیقات زبان طبیعی می کنید ، از این تغییر آگاه خواهید شد ، در حالی که در ابتدا نسبتاً ناچیز است (زیرا به هر حال فقط 7٪ سeriesالات را تحت تأثیر قرار می دهد) ، می تواند واقعاً تغییر کند چگونه رتبه بندی جستجو به طور کلی کار می کند به جلو.

ما بررسی خواهیم کرد که این پیشرفت ها چیست و چه عواقبی در آینده رخ می دهد.

نمایه سازی گذرگاه احتمالاً مربوط به BERT + چندین دوست دیگر است … به علاوه پیشرفت های جدید بیشتر

امیدواریم که در بررسی چشم انداز زیر ، چیزهای بیشتری روشن خواهد شد ، زیرا ما باید عمیق تر برویم و به BERT ، پیشرفت در NLP AI در مورد پیشرفت های بزرگ نزدیک به BERT و در جهان تحقیق رتبه بندی در سال گذشته برویم.

اطلاعات زیر بیشتر از مقالات پژوهشی اخیر و مقالات کنفرانس (از جمله تحقیقات مهندسین جستجوی گوگل یا قبل از کار در Google یا هنگام کار در Google) در سراسر جهان بازیابی اطلاعات گرفته شده است (زمینه بنیادی که جستجوی وب بخشی از آن است) )

در جایی که به مقاله ای مراجعه می شود ، نویسنده و سال را اضافه کرده ام که یک مقاله آنلاین است تا از ادعای لفاظی جلوگیری شود. این همچنین به وضوح نشان می دهد برخی از تغییرات بزرگ رخ داده است با نشان دادن نوعی جدول زمانی و پیشرفت منتهی به ، و از طریق 2019 و 2020.

BERT بزرگ در همه جا

از زمان اعلام اکتبر 2019 ، BERT در همه جا در میان رهبران مختلف صنعت تحقیق در زمینه یادگیری عمیق حضور داشته است. و نه فقط BERT ، بلکه بسیاری از مدلهای BERT مانند که از معماری ترانسفورماتور BERT مانند استفاده می شوند یا از آن استفاده می کنند.

با این حال ، یک مشکل وجود دارد.

مدل های BERT و BERT مانند ، در عین حال بسیار چشمگیر ، معمولاً از نظر محاسباتی گران هستند و بنابراین آموزش آنها از نظر مالی گران است و در محیط های تولیدی با رتبه بندی کامل در مقیاس گنجانده شده اند ، و باعث می شود نسخه 2018 BERT یک گزینه غیرواقعی در جستجوی تجاری در مقیاس بزرگ باشد موتورها

دلیل اصلی این است که BERT از فناوری ترانسفورماتور استفاده می کند که به مکانیزم توجه به خود متکی است بنابراین هر کلمه می تواند از دیدن کلمات اطراف خود به طور هم زمان زمینه کسب کند.

“در مورد متن 100K کلمه ، این به ارزیابی 100K x 100K جفت کلمه یا 10 میلیارد جفت برای هر مرحله نیاز دارد” ، هر گوگل امسال. این سیستم های ترانسفورماتور در جهان BERT در حال فراگیر شدن هستند ، اما این مسئله وابستگی درجه دو با مکانیسم توجه در BERT کاملاً شناخته شده است.

به بیان ساده تر: هر چه کلمات بیشتری به یک دنباله اضافه شوند ، ترکیب کلمات بیشتری باید در طول آموزش به یک باره متمرکز شوند تا زمینه کامل یک کلمه بدست آید.

اما مسئله این است که “بزرگتر بهتر است” وقتی که صحبت از آموزش این مدل ها می شود.

در واقع ، حتی Jacob Devlin ، یکی از نویسندگان اصلی BERT در این ارائه در Google BERT ، تأثیر اندازه مدل را با یک اسلاید تأیید می کند. “مدل های بزرگ کمک زیادی می کنند.”

به نظر می رسد که مدلهای بزرگ از نوع BERT به دلیل بزرگتر بودن از مدعیان قبلی ، از نظر معیارهای SOTA (وضعیت پیشرفته) بهبود می یابند. تقریباً مانند “Skyscraper SEO” که می دانیم در مورد شناسایی آنچه که یک رقیب قبلاً در اختیار دارد و “انداختن طبقه دیگر روی آن (بعد یا ویژگی)” است ، به سادگی با انجام یک کار بزرگتر یا بهتر ، می توان آن را شکست داد. به همین ترتیب ، مدلهای بزرگتر و بزرگتر BERT مانند صرفاً با افزودن پارامترهای بیشتر و آموزش داده های بیشتر به منظور غلبه بر مدلهای قبلی ، ساخته شده اند.

مدل های عظیم از شرکت های عظیم الجثه آمده است

چشمگیرترین این مدل های عظیم (یعنی مدل هایی که SOTA (دولت پیشرفته) را در تابلوهای مختلف یادگیری ماشین شکست می دهند ، کار تیم های تحقیقاتی در شرکت های بزرگ فناوری و در درجه اول مانند Microsoft (MT-DNN ، Turing-NLG) ، گوگل (BERT ، T5 ، XLNet) ، فیس بوک (RoBERTa) ، بایدو (ERNIE) و Open AI (GPT ، GPT-2 ، GPT-3).

Microsoft’s Turing-NLG اخیراً همه مدل های قبلی را به عنوان یک مدل زبان پارامتر 17 میلیاردی کوتوله کرده است. این در پیشنهاد خودکار بینگ و سایر ویژگی های جستجو استفاده می شود. تعداد پارامترها در تصویر زیر نشان داده شده است و Turing-NLG را در مقایسه با برخی دیگر از مدل های صنعت نشان می دهد.

(اعتبار تصویر: Hugging Face)

GPT-3

حتی در مقایسه با 175 میلیارد پارامتر OpenAI با مدل GPT-3 ، حتی 17 میلیارد پارامتر هم چیزی نیستند.

چه کسی می تواند مقاله هیجان انگیز روزنامه گاردین در سپتامبر 2020 در مورد GPT-3 که برای شوک ایجاد شده است را فراموش کند ، با عنوان “کل این مقاله توسط یک ربات نوشته شده است. آیا شما هنوز انسان ترسیده اید؟ “

در واقع این صرفاً پیش بینی جمله بعدی در مقیاس عظیم بود ، اما از نظر غرفه ای که از تحولات صورت گرفته در فضای زبان طبیعی اطلاع نداشت ، جای تعجب نیست که این مقاله با چنین دستبند روبرو شده باشد.

گوگل T5

Google T5 (ترانسفورماتور انتقال متن به متن) ، (جدیدترین مدل مبتنی بر ترانسفورماتور نسبت به BERT) ، که در فوریه سال 2020 منتشر شد ، دارای تنها 11 میلیارد پارامتر بود.

این امر علی رغم آموزش قبلی یک تیم تحقیقاتی گوگل در مورد مجموعه متن متشکل از یک وب خزنده بزرگ پتابایت میلیاردها صفحه وب که به سال 2011 از The Common Crawl باز می گردد و به درستی C4 نامگذاری شده است ، به دلیل چهار C در نام ‘Colossal Clean Crawled Corpus ، به دلیل بزرگی آن.

اما با مدل های بزرگ و چشمگیر هزینه می شود.

BERT گران است (از نظر مالی و محاسباتی)

هزینه سرسام آور آموزش مدل های SOTA AI

در مقاله ای تحت عنوان “هزینه سرسام آور آموزش مدل های هوش مصنوعی SOTA (State of the Art)” ، Synced Review هزینه های احتمالی مربوط به آموزش برخی از جدیدترین مدل های AI SOTA NLP AI را با رقم های مختلف از صدها در ساعت بررسی کرد (و آموزش می تواند ساعتهای زیادی طول می کشد) ، تا صدها هزار هزینه کل برای آموزش یک مدل.

این هزینه ها مورد بحث بسیاری بوده است ، اما به طور گسترده ای پذیرفته شده است ، صرف نظر از صحت برآورد شخص ثالث ، هزینه های مربوطه اخاذی است

الیوت ترنر ، بنیانگذار AlchemyAPI (خریداری شده توسط IBM Watson) هزینه آموزش XLNet را حدس زد (یانگ و همکاران ، 2019)، یک کار ترکیبی بین تیم Google Brain و Carnegie Mellon که در ژانویه 2020 منتشر شد ، در منطقه 245،000 $ بود.

این جرقه زد کاملاً در توییتر بحث شده است، تا جایی که حتی جف دین گوگل هوش مصنوعی با نشان دادن یک توییت برای نشان دادن جبران شده گوگل در قالب انرژی تجدیدپذیر همکاری می کرد:

و در اینجا این مسئله را دروغ گفت ، و احتمالاً چرا با وجود گسترش سرزمین ، BERT در راه اندازی تولید در سال 2019 فقط در 10٪ از سوالات توسط Google استفاده شد.

سطح تولید مدل های BERT مانند از نظر محاسباتی و مالی بسیار گران بودند.

چالش هایی با محتوای شکل بلند و مدل های BERT مانند

محدودیت های ترانسفورماتور

یک چالش دیگر نیز با مقیاس گذاری عملی مدلهای BERT مانند وجود دارد و آن مربوط به طول توالی های موجود برای حفظ متن word است. بیشتر این موارد به بزرگ بودن پنجره زمینه در ساختار ترانسفورماتور مرتبط است.

اندازه پنجره ترانسفورماتور متن یک کلمه بسیار مهم است زیرا “زمینه” فقط می تواند کلمات موجود در محدوده آن پنجره را در نظر بگیرد.

خوش آمديد “مصلح”

برای کمک به بهبود اندازه موجود در پنجره های زمینه ترانسفورماتور در ژانویه 2020 ، Google “Reformer: The Efficial Transformer” را راه اندازی کرد.

از مقاله VentureBeat در اوایل سال 2020 با عنوان اصلاح کننده مدل هوش مصنوعی گوگل می تواند کل رمان را پردازش کندs: “… ترانسفورماتور با هیچ کششی کامل نیست – گسترش آن به متن های بزرگتر محدودیت های آن را آشکار می کند. برنامه هایی که از ویندوزهای بزرگ استفاده می کنند ، به حافظه متفاوتی از اندازه گیگابایت تا ترابایت نیاز دارند ، به این معنی که مدل ها فقط می توانند چند پاراگراف متن را بلعیده و یا قطعات کوتاه موسیقی تولید کنند. به همین دلیل امروز Google Reformer را معرفی کرد ، تحولی از Transformer که برای مدیریت پنجره های زمینه تا 1 میلیون کلمه طراحی شده است. ”

گوگل کمبود اساسی ترانسفورماتورها را در رابطه با پنجره زمینه در یک پست وبلاگ در سال جاری توضیح داد: “قدرت ترانسفورماتور از توجه، روندی که طی آن تمام جفتهای ممکن کلمات را در پنجره زمینه در نظر می گیرد تا ارتباطات بین آنها را درک کند. بنابراین ، در مورد متن 100K کلمه ، این نیاز به ارزیابی 100K x 100K جفت کلمه ، یا 10 میلیارد جفت برای هر مرحله دارد ، که غیر عملی است. ”

رئیس هوش مصنوعی Google ، جف دین گفته است که زمینه بزرگتر تمرکز اصلی فعالیت های پیش رو Google است. وی گفت: “ما هنوز هم دوست داریم که بتوانیم مدلهای متنوع تری را انجام دهیم.” “مانند همین حالا BERT و سایر مدل ها روی صدها کلمه به خوبی کار می کنند ، اما نه 10،000 کلمات به عنوان زمینه. بنابراین این نوع است [an] جهت جالب توجه است. ”” دین در دسامبر به VentureBeat گفت.

گوگل همچنین در توئیت های پیگیری شفاف سازی خود در مورد توسعه فهرست بندی جدید عبور در هفته گذشته ، ضعف را در سیستم های رتبه بندی فعلی (حتی جدا از مدل های مبتنی بر ترانسفورماتور یا اصلاح کننده) تأیید می کند:

“به طور معمول ، ما تمام محتوای یک صفحه وب را ارزیابی می کنیم تا مشخص کنیم که آیا مربوط به یک جستجو است. اما گاهی اوقات صفحات وب می توانند بسیار طولانی باشند ، یا دارای موضوعات مختلف باشند ، که ممکن است باعث شود که چگونه بخشهایی از صفحه برای س quالات خاص مرتبط باشند …. “، این شرکت گفت.

محدودیت های محاسباتی BERT در حال حاضر 512 نشانه است ، و مدلهای BERT مانند را برای هر چیزی طولانی تر از گذرگاهها غیرقابل اجرا می کند.

BERT برای تولید در مقیاس بزرگ در 2018/2019 امکان پذیر نبود

بنابراین ، در حالی که ممکن است BERT “داشتن خوبی” باشد ، اما در واقع در قالب 2018/2019 آن غیر واقعی بود به عنوان یک راه حل برای کمک به درک مقیاس بزرگ زبان طبیعی و رتبه بندی کامل در جستجوی وب ، و فقط فقط در بیشتر موارد استفاده می شود از نظر سeriesالات با چند معنی در جملات و عبارات ، و قطعاً در هر مقیاس ، متفاوت است.

اما این همه خبر بد برای BERT نیست

در طول 2019 و 2020 جهش های بزرگی به جلو انجام شده است که هدف آن ساخت فناوری های BERT بسیار مفیدتر از “خوب داشتن” چشمگیر است.

موضوع محتوای طولانی سند قبلاً بررسی شده است

Big Bird ، Longformer و ClusterFormer

از آنجا که به نظر می رسد اکثر مسائل مربوط به عملکرد در مورد این وابستگی درجه دو در ترانسفورماتورها و تأثیر آن بر عملکرد و هزینه باشد ، کارهای جدیدتر سعی در تبدیل این وابستگی درجه دو به خطی دارد که برجسته ترین آنها در میان آنها Longformer است: (Beltagy ، 2020) و Big Bird گوگل (Zaheer و همکاران ، 2020).

در خلاصه مقاله Big Bird آمده است: “توجه کم ارائه شده می تواند توالی هایی با طول تا 8 برابر آنچه قبلاً با استفاده از سخت افزار مشابه امکان پذیر بود را کنترل کند. به عنوان یک نتیجه از توانایی مدیریت زمینه های طولانی تر ، BigBird عملکردهای مختلف NLP مانند پاسخ دادن به سوالات و خلاصه کردن را به شدت بهبود می بخشد. ”

در اواسط ماه اکتبر ، محققان مایکروسافت از این موضوع قافل نیستند (وانگ و همکاران ، 2020) مقاله خود را در مورد Cluster-Former ارائه دادند. Cluster-Former مدل SOTA در تابلوی سئوالات طبیعی Google “پاسخ طولانی” است. هر دوی این مدل ها همچنین به دنبال رفع محدودیت ها با محتوای فرم طولانی هستند.

و اکنون “مجریان” در مورد ترانسفورماتورها تجدید نظر می کنند

همچنین اخیراً (اکتبر ، 2020) ، یک کار ترکیبی بین Google ، کمبریج ، DeepMind و م Instituteسسه آلن تورینگ برای پرداختن به مسائل مربوط به کارایی و مقیاس با ساختار ترانسفورماتور به طور کلی در مقاله ای با عنوان “بازاندیشی توجه با مجریان” منتشر شد. (Choromanski و همکاران ، 2020)، پیشنهاد یک مرور کامل به روش اساسی که مکانیسم توجه کار می کند ، طراحی شده برای کاهش هزینه های مدل های نوع ترانسفورماتور.

Synced Review در 2 اکتبر 2020 در این باره گزارش داد.

اما همه اینها کارهای بسیار بسیار تازه ای است ، و احتمالاً خیلی جدید است که تأثیر قریب الوقوع در وضعیت نمایه سازی عبور (در حال حاضر) ندارد ، بنابراین “احتمالاً” نه پیشرفت هایی است که گوگل هنگام اعلام نمایه سازی عبور به آن اشاره کرد. .

مطمئناً بین مدل های محتوای فرم طولانی مانند Big Bird و ClusterFormer و پیشرفت های قابل توجه در اسناد طولانی مانند BERT و همکاران ، در جستجوی تولید ، یک تاخیر وجود خواهد داشت.

بنابراین ، به نظر می رسد در حال حاضر محققان زبان طبیعی و موتورهای جستجو مجبور بوده اند با توالی کوتاه تر از محتوای فرم طولانی (به عنوان مثال قسمت ها) کار کنند.

بنابراین ، به وضعیت فعلی بازگردیم.

آدرس دهی به مناطق حل نشده مدل های NLP

به نظر می رسد بیشتر تمرکز در سال های 2019 و 2020 معطوف به پرداختن به مناطق حل نشده مدل های NLP شده است که Jacob Devlin در ارائه خود که قبلاً اشاره کردم به آن اشاره شده است. اینها هستند:

  • مدلهایی که هزینه آموزش را در مقابل دقت سخت افزار مدرن به حداقل می رسانند.
  • مدل هایی که از نظر پارامتر بسیار کارآمد هستند (به عنوان مثال برای استقرار تلفن همراه).
  • مدلهایی که نشان دهنده دانش / زمینه در فضای نهفته هستند.
  • مدل هایی که داده های ساخت یافته را نشان می دهند (به عنوان مثال نمودار دانش).
  • مدلهایی که بصورت مشترک بیانگر بینش و زبان هستند.

در حالی که در چندین منطقه در اطراف BERT در این لیست کار شده است ، و به ویژه نمودارهای دانش ، برای تمرکز این مقاله ، ما باید به بررسی هزینه های آموزش و نکات کارآیی پارامتر بپردازیم.

کارآمدتر و مفیدتر کردن BERT

اولین مورد در لیست Devlin پیشرفت خوبی داشته است ، با تحقیقات زیادی که در زمینه ایجاد مدل هایی انجام شده است که می توانند از لحاظ اقتصادی و احتمالاً در محیط تولید استفاده شوند.

مدل های کارآمد تر

در حالی که 2020 شاهد موجی از مدل های بزرگ بوده است ، تقریباً همزمان موجی از مدل های کارآمد تر و تقطیر مانند BERT در میان جامعه تحقیقاتی با هدف به حداکثر رساندن ماندگاری اثربخشی ظاهر می شود در حالی که هزینه های مرتبط با کارایی را نیز کاهش می دهد.

DistilBERT ، ALBERT ، TinyBERT و ELECTRA: حداقل ضرر برای حداکثر سود

نمونه های قابل توجهی از بهبود کارایی شامل Hugging Face’s DistilBERT ، AlBERT Google (یک BERT ساده) و TinyBERT (مدل BERT نوع معلم / دانش آموز که در آن دانش از یک معلم بزرگ BERT به یک دانش آموز کوچک BERT (TinyBERT) منتقل می شود. Google ELECTRA را نیز معرفی کرد) که از نوع متفاوتی از فناوری ماسک برای بهبود عملکرد به میزان قابل توجهی استفاده می کند در حالی که اکثر کارایی را دوباره حفظ می کند.

طبق Google AI ، “ELECTRA هنگام استفاده از کمتر از ¼ محاسبه ، عملکرد RoBERTa و XLNet را در معیار درک زبان طبیعی GLUE مطابقت داده و به نتایج پیشرفته ای در معیار پاسخ سوالات SQuAD دست می یابد. این پیشرفتها در نتیجه استفاده از روشهای کارآمدتر از پوشاندن 15٪ کلمات هنگام آموزش مدل BERT است که از نظر محاسباتی بسیار گران است. “

هر یک از سازگاری های فوق الذکر بسیار کارآمدتر از مدل اصلی BERT بوده و حداقل در اثربخشی از دست می دهند.

ارتشی از مهندسان تحقیق و داده های رایگان

تقویت دیگر پیشرفت در قالب یک جامعه تحقیقاتی است که یک بار دیگر به چالش (به معنای واقعی کلمه) مربوط به بهبود درک زبان ماشین می پردازد.

اما شرکت کنندگان متقاضی برای آموزش مدل های بهتر به داده نیاز دارند.

همانطور که Devlin در سخنرانی خود اظهار داشت ، او معتقد است “پیشرفت های کوتاه مدت در NLP بیشتر مربوط به استفاده هوشمندانه از داده های” رایگان “خواهد بود.”

در حالی که منابع در حال رشد برای تعداد زیادی مجموعه داده رایگان برای استفاده دانشمندان داده وجود دارد (فکر می کنید کاگل در سال 2017 توسط گوگل خریداری شده است). مسلماً بزرگترین جامعه دانشمندان داده با میلیونها کاربر ثبت شده در مسابقات یادگیری ماشین). با این حال ، داده های نوع “دنیای واقعی” برای تحقیق در مورد زبان طبیعی “واقعی” ، مبتنی بر وب واقعی روزمره و به ویژه پرسش ها ، کمتر.

با این وجود ، منابع داده های “رایگان” زبان طبیعی در حال رشد است و در حالی که اکنون چندین مورد وجود دارد ، بسیاری از داده های اهدا شده به جامعه تحقیقات زبان طبیعی توسط موتورهای جستجو برای تحریک تحقیقات است.

MSMARCO (مایکروسافت)

از سال 2016 مجموعه داده های MSMARCO یکی از تمرینات غالب برای مدل های تنظیم دقیق است.

Microsoft’s MSMARCO ، در ابتدا مجموعه ای متشکل از 100000 س questionsال و پاسخ از موتورهای جستجوی بینگ واقعی و ارسال های جستجوی دستیار Cortana بود اما ده برابر شده و به بیش از 1.000.000 سوال و پاسخ رسیده است. علاوه بر این ، ویژگی های MSMARCO گسترش یافته است که شامل کارهای آموزشی اضافی فراتر از وظایف عمومی درک زبان طبیعی و پرسش و پاسخ است.

س questionsالات طبیعی Google (Google)

مانند MSMARCO ، Google مجموعه داده های پرسش و پاسخ زبان طبیعی خود را دارد که متشکل از س userالات واقعی کاربر در موتور جستجوی گوگل است ، همراه با یک جدول رتبه بندی و وظایفی که باید انجام شود ، به نام “س Naturalالات طبیعی Google”.

“س questionsالات شامل پرسش های واقعی ناشناس ، جمع شده برای موتور جستجوی Google است. از روشهای ابتکاری ساده برای فیلتر کردن س questionsالات از جریان پرسش استفاده می شود. بنابراین س questionsالات “طبیعی” هستند زیرا نمایانگر پرسش های واقعی از افراد جویای اطلاعات هستند. ”

(کویاتکوفسکی و دیگران ، 2019)

در س Naturalالات طبیعی Google ، محققان باید قبل از یافتن یک جواب طولانی و یک پاسخ کوتاه در یک پاراگراف از ویکی پدیا ، مدل های خود را آموزش دهند تا کل صفحه را بخوانند. (تجسم در زیر)

مجموعه داده TensorFlow C4 – خزش خالص عظیم

یک مجموعه داده جدیدتر C4 (Crawl Clean Colossal of Common Crawl) است که قبلاً هنگام معرفی T5 ذکر شد. در حالی که پیش از آموزش زبان اصلی BERT بر روی 2.5 میلیارد کلمه انگلیسی ویکی پدیا و BookCorpus (800 میلیون کلمه) بود ، زبان ویکی پدیا نماینده زبان طبیعی روزمره نیست زیرا تعداد کمی از وب از همان ساختارهای نیمه ساختار یافته تشکیل شده است – ساختار پیوند خورده. C4 قبل از آموزش را از طریق زبان طبیعی دنیای واقعی به چیزی شبیه واقعیت می برد و برای پیش آموزش مدل T5 Google استفاده شد.

C4 Clean Colossal Crawl Dataset از یک خزیدن “عظیم” به اندازه پتابایت ساخته شده است که میلیاردها صفحه از The Common Crawl (نمونه های عظیم “وب واقعی” از سال 2011 تاکنون) ، از دیگ بخار تمیز شده است (کلمات فحش ، اعلان های جاوا اسکریپت ، کد و از جمله حواس پرتی های دیگر برای از بین بردن “سر و صدا”). باز هم ، این مجموعه داده پس از تمیز کردن در دسترس دیگران قرار گرفت تا دیگران از آن یاد بگیرند.

بسیاری از تحقیقات NLP به معابر و رتبه بندی تغییر یافته است

مطالب مرتبط  SEL 20200518

بازیابی و رتبه بندی معابر به یکی از زمینه های مورد علاقه تحقیق در طی دو سال گذشته تبدیل شده است.

بازیابی قسمتهایی از اسناد ، بازیابی گذرگاه AKA یا بازیابی زیر سند ، در بازیابی اطلاعات جدید نیست. تصویر زیر را به عنوان مثال ثبت اختراع سیستم بازیابی زیر سند بازیابی اطلاعات از سال 1999 مشاهده کنید. (ایوانز ، 1999)

Evans، D.A.، Claritech Corp، 1999. بازیابی اطلاعات براساس استفاده از اسناد فرعی. حق ثبت اختراع ایالات متحده 5،999،925.

ما همچنین می توانیم مقالات تحقیق IR از سال 2008 و قبل از آن را در مورد رتبه بندی گذرگاه پیدا کنیم ، به عنوان مثال ، “رتبه بندی مجدد نتایج جستجو با استفاده از نمودار عبور از سند” (Bendersky و همکاران ، 2008)، و مطمئنا موارد دیگر وجود خواهد داشت.

همچنین می توانیم ببینیم که بازیابی گذرگاه در اوایل سال 2018 با فیلم هایی در YouTube ، یک تحقیق فعال بود:

تمام “ویژگی های رتبه بندی عبور” را در تصویر بالا مشاهده خواهید کرد ، اما کاملاً بر اساس “تعداد” موجودیت ها ، n گرم ، کلمات جستجو (کلمات کلیدی) و کلمات ، کلمات ، کلمات است. کلمات کلیدی در همه جا

اما این در ژوئن 2018 بود ، بنابراین می تواند تفاوت زیادی بین وزن ویژگی هایی داشته باشد که در ژوئن 2018 و اکنون اهمیت داشتند.

… و این قبل از BERT بود.

BERT سهم بزرگی در اشتیاق تحقیق در رتبه بندی گذرگاه داشته است و احتمالاً به دلیل موارد فوق الذکر با ناکارآمدی و محدودیت های طول معماری ترانسفورماتور BERt است.

“همانطور که قبلاً به طور گسترده در مورد آن بحث کردیم ، به چندین دلیل BERT در توالی ورودی طولانی تر از 512 نشانه مشکل دارد. البته راه حل واضح تقسیم متن به متن است. ” (لین و همکاران ، 2020)

اما یک دلیل دیگر نیز وجود دارد که چرا رتبه بندی گذرگاه به یک فعالیت محبوب یادگیری ماشین برای محققان مبتلا به BERT تبدیل شده است.

MSMARCO’s Passage Ranking Task و Leaderboard

از اکتبر 2018 ، یک کار رتبه بندی Passage در MS MARCO و رهبران مربوطه وجود دارد و تعداد زیادی ورودی از محققان زبان ، از جمله موارد شرکت های بزرگ فناوری مانند Facebook ، Google ، Baidu و Microsoft را به خود جلب می کند.

در واقع ، فقط در همین هفته گذشته ، همانطور که MS MARCO در توییتر اعلام کرد آنها به زودی رهبران س Questionال و پاسخ خود را بازنشسته می کنند ، زیرا در حال حاضر پیشرفت محدودی در آن منطقه وجود دارد و تأکید کرد که از آنجا که تمرکز الان بود.

در کار رتبه بندی گذرگاه MS MARCO یک مجموعه داده 8.8 میلیون پاساژ ارائه شده است.

طبق وب سایت MS MARCO:

“متن های متن ، که از آنها پاسخ داده های داده شده است ، با استفاده از پیشرفته ترین نسخه موتور جستجوی Bing ، از اسناد وب واقعی استخراج می شود. پاسخ س theالات در صورت ایجاد خلاصه پاسخ ، از طریق انسان ایجاد می شود. “

Passage Ranking Task به دو قسمت تقسیم شده است.

  1. رتبه بندی مجدد گذرگاه: با توجه به اینکه 1000 BMC برتر از طریق یک BM25 بازیابی شده است ، سطح عبور را از نظر مرتبطی رتبه بندی کنید.
  2. رتبه بندی کامل گذرگاه: با توجه به مجموعه ای از 8.8 متر ، گذرگاه هایی را ایجاد می کند که 1000 گذرگاه برتر را بر اساس ارتباط مرتب می کند.

برخی از پیشرفت ها

و این اکنون ما را به خوبی به جایی می رساند که ممکن است پیشرفتهای در رتبه بندی باشد که در هفته گذشته توسط Google در جستجو ذکر شده است.

It’s probably not just the passage ranking itself which is the breakthrough Google refers to, but rather breakthroughs in passage ranking and other “novel” findings discovered as a by-product of much activity in the passage retrieval research space, as well as new innovations from this research combined with current Google approaches to ranking (e.g. Learning to Rank (LeToR) with TensorFlow for example), plus plenty of developments within their own research teams separate to passage ranking specifically, and the industry improvements in AI overall.

For example, ROBERTA (more robustly trained BERT), and ELECTRA (Google, 2020) with its more efficient masking technique. There are other big breakthroughs too, which we will come to shortly.

In the same way the research community jumped on board with question and answering and natural language understanding overall, with iterative improvements resulting in BERT and friends, so too now big focus is on improving efficiencies and effectiveness in ranking, with a particular emphasis on passages.

Passages are smaller after all and within BERT’s constraints since it’s easy to chop a longer document up into several pieces.

And it does look like there are very significant developments.

In order to understand progress more fully we need to look at how ranking systems work as an industry standard overall, because it’s not quite as simple as a single fetch from the index it seems.

Two-stage ranking system

In two stage ranking there is first full ranking (the initial ranking of all the documents), and then re-ranking (the second stage of just a selection of top results from the first stage).

In information retrieval (and web search), two stage ranking is about firstly retrieving a large collection of documents using either a simple, classical retrieval algorithm such as BM25, or a query-expansion algorithm, a learning to rank algorithm, or a simple classifier approach.

A second stage is then carried out with greater precision and more resources over a list of top retrieved results from the first stage, likely using a neural re-ranker.

We do not have to go far through the research literature to find many confirmations of two (or multi stage) stage ranking systems as an industry standard.

“State-of-the-art search engines use ranking pipelines in which an efficient first-stage uses a query to fetch an initial set of documents from the document collection, and one or more re-ranking algorithms improve and prune the ranking.”

(Dai, 2019)

“Two step document ranking, where the initial retrieval is done by a classical information retrieval method, followed by a neural re-ranking model, is the new standard. The best performance is achieved by using transformer-based models as re-rankers, e.g., BERT.”

(Sekulic et al, 2020)

“Prior to two stage learning to rank a document set was often retrieved from the collection using a classical and simple unsupervised bag-of-words method, such as BM25.”

(Dang, Bendersky & Croft, 2013)

Note that BM25 stands for Best Match 25 Algorithm and is often favoured over the much talked about TF:IDF, and is so named because it was the 25th attempt at a particular ranking type algorithm which was the best match for the task of the time (trivia).

Whilst we can’t be sure Google and other search engines use BM25 of course, in any capacity, for those interested to learn ElasticSearch provides a good overview of the BM25 algorithm. However, it is still taught in many information retrieval lectures so relevant to some extent.

In the case of production search, it is likely something much more advanced than simply BM25 overall, but likely the more advanced and expensive resources are used in the second stage, rather than the initial fetch. Frederic Dubut from Bing confirmed Bing uses LambdaMART which is a Learning To Rank algorithm in much of its search engine (although he did not comment on whether this was in the first stage or second stage of ranking, or all ranking stages). Papers authored by researchers from Google state: “LambdaRank or its tree-based variant LambdaMART has been one of the most effective algorithms to incorporate ranking metrics in the learning procedure.” (Wang et al, 2018)

The main point is that it’s likely more powerful than systems used in research due to more resources (capacity / financial), however, the principles (and foundational algorithms) remain the same.

One caveat is that some commercial search engines may also be using “multi-stage” ranking neural models.

Referring to multi-stage ranking pipelines, Nogueria et al, wrote in 2019: “Known production deployments include the Bing web search engine (Pedersen, 2010) as well as Alibaba’s e-commerce search engine.”

They added further explained, “Although often glossed over, most neural ranking models today . . . are actually re-ranking models, in the sense that they operate over the output of a list of candidate documents, typically produced by a “bag of words” query. Thus, document retrieval with neural models today already uses multi-stage ranking, albeit an impoverished form with only a single re-ranking stage.”

Two stage indexing is not two stage ranking

A further clarification.  We know of two stage indexing / rendering and Google has provided plenty of information on the two stage indexing situation, but that is not two stage ranking nor is it really two stages of indexing.

Two stage ranking is entirely different.

First stage of two stage ranking: full ranking

In Two Stage Learning to Rank (Dang et al, 2013), a list of documents are first ranked based on a learned “model of relevance” containing a number of features and query expansions then the model is trained to recall documents based on this “model of relevance” in first recall phases.

The first stage of two stage ranking is really about retrieving as many potentially relevant pages as possible. This first stage likely expands something like BM25, a tf (term frequency) based approach, with various query expansion terms and perhaps classification features since, according to Vang et al, 2013, “it is better to fetch more documents in the initial retrieval so as to avoid missing important and highly relevant documents in the second stage.” (Vang et al, 2013).

On the topic of “Learning to Rank” and expanding the query set to include query expansion, Vang et al write; “This query expanded model is thought to outperform simple bag-of-words algorithms such as BM25 significantly due to including more documents in the initial first stage recall.” (Vang et al, 2013).

Two stage learning to rank for information retrieval

On “Learning to Rank”:

“We first learn a ranking function over the entire retrieval collection using a limited set of textual features including weighted phrases, proximities and expansion terms. This function is then used to retrieve the best possible subset of documents over which the final model is trained using a larger set of query- and document-dependent features.”

(Vang et al, 2013)

Whilst the 2013 paper is older, all the more reason why progress will have improved upon this, since the two stage system is still ‘the industry standard.’

Second stage of two stage ranking: Reranking

From this list of retrieved documents a second pass is performed on a specified top-X number of documents, known as top-K from the retrieved document list and fine tuned for precision using machine learning techniques.  You’ll often see in information retrieval papers the term [email protected] (Precision at K) which refers to the levels of precision in the top K against a “gold standard” or “ground truth” of relevance (K being a number, e.g. [email protected] would mean the number of accurate results judged to meet the user’s information needs in relation to a query in the top 10 results retrieved).

A good explanation of evaluation metrics such as [email protected] (and there are a number of others) is provided in this information retrieval lecture slide.

The second stage of two stage ranking is where precision is much more important, and much more resource is expended, whilst also possibly adding further measures of relevance to really separate the gold in top ranks.

The importance of ranking more precisely those documents selected for inclusion in stage 2 is key, and precision in the highly ranked results, even more so, since the probability of these results being seen by search engine users is high.

As the adage goes, “only SEOs look beyond page two of search results”.

In “Two Stage Learning to Rank for Information Retrieval” Dang et al say:

“At run-time, in response to user queries, the Stage A model is used again to retrieve a small set of highly ranked documents, which are then re-ranked by the Stage B model. Finally, the re-ranked results are presented to the user”

(Dang et al, 2013)

To summarize, efficiency and effectiveness combined are the main driver for two stage ranking processes. Use the most computationally expensive resources on the most important documents to get the greater precision because that’s where it matters most. Full ranking is stage one with reranking as stage two for improvements on the top-K retrieved from the full collection.

As an aside, it is also probably why Google’s Danny Sullivan said in a May tweet, “If you are in the top 10 you are doing things right.”

Since, the top 10 is likely the most important part of Top-K in the re-ranked “precision” stages, and maximum features and precision ‘learning’ will have been undertaken for those results.

Improving the second stage of ranking (precision) has been the focus

Given the importance of the second stage of ranking for precision the majority of research into ranking improvements focuses on this stage – the reranking stage.

Making the BEST use of BERT, for now

We know BERT in its 2018 / 2019 format was limited. Not least by sequence length / context window limitations, as well as expense, despite smaller models appearing.

How to make BERT something better than a “nice to have” dealing only with the most nuanced of disambiguation needs in web search at sentence level, and into something usable in a meaningful capacity?  Something which many researchers could jump on board with too?

BERT repurposed as a passage ranker and re-ranker

Aha… BERT As a passage ranker.

Once more to reinforce BERT’s limitations and ideal current use: “BERT has trouble with input sequences longer than 512 tokens for a number of reasons. The obvious solution, of course, is to split texts into passages,” per Lin et al this year.

One of the biggest breakthrough areas of research and development has been in the repurposing of BERT as a reranker, initially by Nogueria and Cho in 2019, in their paper “Passage Reranking with BERT,” and then others.

As Dia, 2019, points out in a 2019 paper: “BERT has received a lot of attention for IR, mainly focused on using it as a black-box re-ranking model to predict query-document relevance scores.”

On their 2019 paper “Passage Reranking with BERT,” Nogueira & Cho said they “describe a simple re-implementation of BERT for query-based passage re-ranking. Our system is the state of the art on the TREC-CAR dataset and the top entry in the leaderboard of the MS MARCO passage retrieval task, outperforming the previous state of the art by 27% (relative) in [email protected]

“We have described a simple adaptation of BERT as a passage re-ranker that has become the state of the art on two different tasks, which are TREC-CAR and MS MARCO.”

I spoke to Dr Mohammad Aliannejadi, author of several papers in the field of information retrieval and a post-doctoral researcher in Information Retrieval at The University of Amsterdam, exploring natural language, mobile search and conversational search.

“At the moment, BERT as a reranker is more practical, because full ranking is very hard and expensive,” Dr Aliannejadi said. “And, the improvements in effectiveness does not justify the loss of efficiency.”

He continued, “One would need a lot of computational resources to run full-ranking using BERT.”

BERT and passages

Subsequently, passage re-ranking (and increasingly passage re-ranking with BERT), is now amongst the favourite 2020 topics of the information retrieval and machine learning language research world, and an area where significant progress is being made, particularly when combined with other AI research improvements around efficiency, scale and two stage ranking improvements.

Passages and BERT (for the moment) go hand in hand

One only has to look at the table of contents in Lin et al’s recently published book “Pretrained Transformers for Text Ranking: BERT and Beyond” (Lin et al, 2020) to see the impact passage ranking is having on the recent “world of BERT,” with 291 mentions of passages, as Juan Gonzalez Villa pointed out:

Google research and passage ranking / reranking

Naturally, Google Research have a team which has joined the challenge to improve ranking and reranking with passages (Google TF-Ranking Team), competing on MSMARCO’s leaderboard, with an iteratively improving model (TFR-BERT), revised a number of times.

TFR-BERT is based around a paper entitled “Learning-to-Rank with BERT in TF-Ranking” (Han et al, 2020), published in April and with its latest revision in June 2020. “In this paper, we are focusing on passage ranking, and particularly the MS MARCO passage full ranking and re-ranking tasks,” the authors wrote.

“…we propose the TFR-BERT framework for document and passage ranking. It combines state-of-the-art developments from both pretrained language models, such as BERT, and learning-to-rank approaches. Our experiments on the MS MARCO passage ranking task demonstrate its effectiveness,” they explained.

TFR-BERT – BERT-ensemble model — Google’s ensemble of BERTs

Google Research’s latest BERT’ish model has evolved into an ensemble of BERTs and other blended approaches – a combination of parts of other models or even different full models, methods and enhancements grouped.

Many BERTs as passage rankers and rerankers are actually ‘SuperBERT’s

Since much of the code in the BERT research space is open source, including plenty from major tech companies such as Google, Microsoft and Facebook, those seeking to improve can build ensemble models to make “SuperBERT.”

2020 has seen a wave of such “SuperBERT” models emerge in the language model space, and across the leaderboards.

The use of BERT in this way is probably not like the BERT that was used in just 10% of queries. That was probably for simple tasks such as disambiguation and named entity determination on very short pieces of text and sentences to understand the difference between two possible meanings in the words in queries. There is actually a BERT called SentenceBERT from a paper entitled “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” (Reimers & Gurevych, 2019), but this does not mean that SentenceBERT was used in the 10% of queries mentioned in 2019 of course.

The main point is, passage ranking with BERT is BERT completely repurposed to add contextual meanings to a training set of passages in two stages. Full ranking and then re-ranking, and probably quite different in use to the 10% implementation in production search from 2019.

On the subject of “SuperBERTs” an SEO community friend (Arnout Hellemans) refers to my dog BERT as “SuperBERT,” so it seemed an appropriate excuse to add a picture of her.

Learning-to-rank with BERT in TF-Ranking (han et al, 2020)

Han et al, 2020, explain the additional integrations which take the original BERT and TF-Ranking model to an ensemble which combines ELECTRA and RoBERTa with BERT and TF-Ranking approaches through five different submissions to the MS MARCO passage ranking leaderboard.

TF-Ranking is described as a “TF-Ranking: A Scalable TensorFlow Library for Learning-to-Rank” (Pasumarthi et al, 2019)

“This paper describes a machine learning algorithm for document (re)ranking, in which queries and documents are firstly encoded using BERT, and on top of that a learning-to-rank (LTR) model constructed with TF-Ranking (TFR) is applied to further optimize the ranking performance. This approach is proved to be effective in a public MS MARCO benchmark.”

(Han et al, 2020)

“To leverage the lately development of pre-trained language models, we recently integrated RoBERTa and ELECTRA.”

(Han, Wang, Bendersky, Najork, 2020)

Whilst ELECTRA was published by Google, as you may recall, RoBERTa was published by Facebook.

But we can also see an additional element mentioned as well as RoBERTa, BERT, TF-Ranking and ELECTRA from the paper extract. Mention of DeepCT.

According to the “Learning-to-rank with BERT in TF Ranking”  paper:

“The 5 Submissions of Google’s TFR-BERT to the MS MARCO Passage Ranking Leaderboard were as follows:

  • Submission #1 (re-ranking): TF-Ranking + BERT (Softmax Loss, List size 6, 200k steps)
  • Submission #2 (re-ranking): TF-Ranking + BERT (Ensemble of pointwise, pairwise and listwise losses)
  • Submission #3 (full ranking): DeepCT Retrieval + TF-Ranking BERT Ensemble
  • Submission #4 (re-ranking): TF-Ranking Ensemble of BERT, RoBERTa and ELECTRA
  • Submission #5 (full ranking): DeepCT + TF-Ranking Ensemble of BERT, RoBERTa and ELECTRA

Whilst early submission were simply BERT and TF-Ranking (TensorFlow Ranking), with RoBERTa and ELECTRA added later to another leaderboard submission, the biggest gains seem to be the adding of DeepCT with sharp improvements between submissions 3 and 5 on the Full Ranking Passage Ranking task, although Deep-CT is not mentioned on the paper abstract.

Google’s SuperBERT ensemble model (evolved from TFR-BERT) is performing well on MS MARCO both full ranking and reranking passage ranking leaderboard.

You can see it here currently (October 2020) at position 5 in the image below entitled:

DeepCT + TF-Ranking Ensemble of BERT, ROBERTA and ELECTRA (1) Shuguang Han, (2) Zhuyun Dai, (1) Xuanhui Wang, (1) Michael Bendersky and (1) Marc Najork – 1) Google Research, (2) Carnegie Mellon – Paper and Code

Also note Dai has now been added to the Google TF-Ranking team members on the leaderboard submission from April onwards, although not listed on the original paper.

Digging in to the body of the “Learning-to-rank with BERT in TF-Ranking” paper we see the following: “We discovered that DeepCT helps boost the re-ranking of BM25 results by a large margin, and a further combination of both BM25 and DeepCT re-ranked lists brings additional gains.”

Looking at the model revisions which include DeepCT specifically, Han et al continue: “With Submission #3, we achieved the second best overall performance on the leaderboard as of April 10, 2020. With the recent Submission #5, we further improved our previous performance, and obtained the third best performance on the leaderboard as of June 8, 2020 (with tens of new leaderboard submissions in between)”

Also, it’s important to remember the sharp improvements are on the Full Ranking task, rather than the ReRanking task.  Note both of the Full Ranking tasks include DeepCT, but the ReRanking tasks do not.

  • 5 – DeepCT + TF-Ranking Ensemble of BERT, ROBERTA and ELECTRA (1) Shuguang Han, (2) Zhuyun Dai, (1) Xuanhui Wang, (1) Michael Bendersky and (1) Marc Najork – 1) Google Research, (2) Carnegie Mellon – Paper and Code. Full Ranking June 2, 2020
  • 11 – DeepCT Retrieval + TF-Ranking BERT Ensemble 1) Shuguang Han, (2) Zhuyun Dai, (1) Xuanhui Wang, (1) Michael Bendersky and (1) Marc Najork – (1) Google Research, (2) Carnegie Mellon University – Paper [Han, et al. ’20] Code. Full Ranking April 10, 2020
  • 14 – TF-Ranking Ensemble of BERT, ROBERTA and ELECTRA (1) Shuguang Han, (2) Zhuyun Dai, (1) Xuanhui Wang, (1) Michael Bendersky and (1) Marc Najork – 1) Google Research, (2) Carnegie Mellon – Paper and Code. ReRanking June 2, 2020
  • 25 – TF-Ranking + BERT(Ensemble of pointwise, pairwise and listwise losses)TF-Ranking team (Shuguang Han, Xuanhui Wang, Michael Bendersky and Marc Najork) of Google Research – Paper [Han, et al. ’20] و [Code]. ReRanking March 30, 2020

DeepCT

DeepCT appears to be a secret sauce ingredient responsible for some significant gains in quick succession in the MS MARCO full ranking task leaderboard for the Google TF-Ranking Research team. Recall the full ranking stage relates to the first stage of the two stage task.

In the case of MS MARCO it’s the ranking of the 8.8 million passages provided, with re-ranking relating to fine-tuning the top 1000 results retrieved from that initial first ranking stage.

So DeepCT is the difference to the first stage full ranking here, which is the first stage.

So just what is DeepCT and could it be significant to more than just passage ranking leaderboards?

DeepCT stands for “Deep Contextualized Term Weighting Framework” and was proposed in a paper entitled “Context Aware Term Weighting For First Stage Passage Retrieval.” (Dai, 2020)

The Inventor of DeepCT, Dai, describes the framework as: “DeepCT, a novel context-aware term weighting approach that better estimates term importance for first-stage bag-of-words retrieval systems.”

But that doesn’t really do it justice since there is plenty more to DeepCT than one first suspects.

Greater context in passages, an alternative to tf (term frequency) and improved first stage ranking with DeepCT

Dai, DeepCT’s inventor, shows DeepCT not only improves first stage ranking results and adds a context-awareness to terms in passages, but also when combined with BERT Re-ranker (in the second stage) (Bert repurposed as a re-ranker by Nogueria and Cho, 2019) is very effective in both improving precision in “intent-aligned” ranking results for passages, coupled with efficiency, and shows potential for scale to production environments, without much modification to existing architectures.

Indeed, DeepCT seems very effective in passage-indexing which is a ranking process, but in DeepCT’s case there is an “index” element involved, but not as we know it in the SEO space (and papers on the topic of DeepCT do reference passage indexing).

At the moment DeepCT’s use is limited to the default BERT 512 tokens but that is ideal for passages, and passages are parts of documents of anyway since they really are just chopped up documents. Therefore, normal documents become a group of passages with sequences usually well within the 512 token scope limitations of BERT.

To reiterate Lin’s quote from earlier: “As we’ve already discussed extensively, BERT has trouble with input sequences longer than 512 tokens for a number of reasons. The obvious solution, of course, is to split texts into passages.”

Why is DeepCT so significant?

Whilst DeepCT is limited currently within the constraints of the 512 token limitations of BERT, and therefore passages, DeepCT could constitute a ranking “breakthrough.”

Importantly, DeepCT not only seeks to provide a context-aware passage ranking solution but also begins to address some long standing information retrieval industry-wide issues around long established ranking and retrieval models, and systems. These developments could extend far beyond the limited focus of DeepCT and the passage indexing update we are concerned with today, particularly as other improvements around efficiency and context windows begin to be addressed in BERT-like systems and transformers.

The problem with term frequency (tf) in passages

The first issue DeepCT seeks to address relates to the use of tf (term frequency) in first stage ranking systems.

As Dai points out: “State-of-the-art search engines use ranking pipelines in which an efficient first stage uses a query to fetch an initial set of documents, and one or more re-ranking algorithms to improve and prune the ranking. Typically the first stage ranker is a bag-of-words retrieval model that uses term frequency (tf ) to determine the document specific importance of terms. However, tf does not necessarily indicate whether a term is essential to the meaning of the document, especially when the frequency distribution is flat, e.g., passages. In essence, tf ignores the interactions between a term and its text context, which is key to estimating document-specific term weights.”

Dai suggests a word “being frequent” does not mean “being relevant” in a given passage content, whilst also confirming the fundamental role bag-of-words approaches has had in legacy and at the same time highlighting the shortcomings of current systems.

“The bag-of-words plays a fundamental role in modern search engines due to its efficiency and ability to produce detailed term matching signals,” says Dai. “Most bag-of-words representations and retrieval models use term weights based on term frequency (tf ), for example tf.idf and BM25. However, being frequent does not necessarily lead to being semantically important. Identifying central words in a text also requires considering the meaning of each word and the role it plays in a specific context.”

Dai describes frequency-based term weights as a “crude tool” (albeit they have been a huge success), since tf does not differentiate between words which are central to the overall text meaning and words which are not, and particularly so in passages and sentences, and proposes a need to understand word’s meaning within the context of text content as a “critical problem.”

“Frequency-based term weights have been a huge success, but they are a crude tool,” Dai and Callan wrote in 2019. “Term frequency does not necessarily indicate whether a term is important or central to the meaning of the text, especially when the frequency distribution is flat, such as in sentences and short passages”

Dai further noted, “To estimate the importance of a word in a specific text, the most critical problem is to generate features that characterize a word’s relationships to the text context.”

The problem with multi stage ranking systems

The second problem relates to efficiencies and computational costs in first stage ranking systems, and subsequently the focus of deep-learning research being concentrated on re-ranking (the fine-tuning, second, or later stages of ranking in the case of multi-stage ranking systems), in recent times, due to computational expenses in deep learning, rather than full ranking (the initial first stage).

“Most first-stage rankers are older-but-efficient bag-of-words retrieval models that use term frequency signals, and much of the research work on ranking has been focused on the later stages of ranking – the fine-tuning stages,” said Dai in 2019.

Dai suggests the computational (and subsequently financial) costs associated with first stage ranking limits the use of complex deep learning which might otherwise overcome the “lack of central” focus on terms in relation to other surrounding text in passages (word’s context).

“Classic term frequency signals cannot tell whether the text is centered around a term or just mentions that term when discussing some topic. This issue is especially difficult in first-stage full-collection ranking, where complex features and models are too expensive to apply,” Dai wrote.

We know improvements to the first stage of ranking was a primary rationale for the research undertaken in “Two Stage Learning to Rank in Information Retrieval.”  Even then they acknowledge the vast majority of research into ranking is on the second stage (re-ranking, hence their work motivation was designed to improve the first stage with a better initial yield using e.g. query expansion techniques for better fine tuning (Vang et al, 2013).

There are likely many others who have sought to address this first stage ranking improvements further as well, but the primary focus has certainly been on stage two for the aforementioned reasons around the importance of the highly ranked top-K results probability to be seen, combined with computational / financial expense.

This focus on second stage results has also continued even as BERT was repurposed as a passage and researchers were enthused to follow the BERT re-ranking path for passages.

Improving the first stage of ranking AND gaining word’s context in passages too

DeepCT seeks to make inroads to solve both of these issues simultaneously.

First stage ranking improvements with DeepCT

Dai’s work with DeepCT focuses on the first stage of retrieval, whilst also aiding downstream re-ranking stages significantly.

“Most of the prior neural-IR research, including recent research on leveraging BERT for IR, focused on re-ranking stages due to the complexity of neural models. Our work adds the ability to improve existing first-stage rankers. More accurate first stage document rankings provide better candidates for downstream re-ranking, which improves end-to-end accuracy and/or efficiency.”

“Although much progress has been made toward developing better neural ranking models for IR, computational complexity often limits these models to the re-ranking stage. DeepCT successfully transfers the text understanding ability from a deep neural network into simple signals that can be efficiently consumed by early-stage ranking systems and boost their performance.”

(Dia, 2020)

A new alternative to term frequency using BERT – tfDeepCT

مطالب مرتبط  20201118 SEL مختصر

In this first stage of ranking, Dai also focuses on moving toward more contextual understanding of words in passages than merely their counts(tf).

Dai proposes an alternative to tf term frequency with a part of the Deep Contextualized Term Weighting Framework called “tfDeepCT.”

Instead of merely counting term frequency, tfDeepCT identifies a deep contextual meaning and context for the words in a passage.

Using BERT representations, DeepCT assigns an importance score to words based on their centrality and importance to the topic given their context in a passage.  DeepCT assigns a higher weight to important terms and suppresses low importance or off-topic terms in the passage.

These weights are then assigned to an ordinary inverted index with no new posts added but with a replacement for tf called tfDeepCT (the weighted terms based on their contextual importance in a passage as deemed by BERT’s transformer attention architecture).

DeepCT-Index

This is called DeepCT-Index.

“tfDeepCT is used to replace the original tf in the inverted index. The new index, DeepCT-Index, can be searched by mainstream bag-of-words retrieval models like BM25 or query likelihood models. The context-aware term weight tfDeepCT is expected to bias the retrieval models to central terms in the passage, preventing off-topic passages being retrieved. The main difference between DeepCT-Index and a typical inverted index is that the term weight is based on tfDeepCT instead of tf. This calculation is done offline.”

(Dai, 2020)

IMPORTANT – This does not mean that this is a new document indexing situation. Passage indexing is about passage ranking. On the subject of the forthcoming passage indexing overall, Google has made it clear the new passage indexing changes ahead relate to a ranking change, and not an indexing change to documents.  Passages are not going to be indexed separately as well as, or instead of, documents according to Google’s recent clarifications.

DeepCT-Index (if used) appears to propose simply adding alternative ranking weights to the existing index but which replaces tf with tfDeepCT for passages.

Dai also makes it clear in the literature around DeepCT that “No new posting lists are created.”

But also refers to the use of DeepCT for passage indexing: “Section 3 describes the Deep Contextualized Term Weighting framework (DeepCT), its use for passage indexing (DeepCT-Index).”

IMPORTANT — I’d like to caveat this by saying DeepCT-Index is a central piece to the DeepCT framework in the literature.  Google Research has acknowledged the use of DeepCT in their research paper “Learning to Rank with BERT in TF-Ranking” in both the acknowledgements section and throughout the paper.

“We would like to thank Zhuyun Dai from Carnegie Mellon University for kindly sharing her DeepCT retrieval results.”

(Han et al, 2020)

DeepCT is also part of the current research model submissions for full ranking currently submitted to MS MARCO passage ranking leaderboard.

However, it does not mean it is in production, nor will be. but it does show promise and a new and interesting direction, not only for the use of BERT with passage ranking for greater contextual search, but for more efficient and effective “context-aware,” improved search overall, since if implemented it will likely lead to far greater resources at scale being used on the whole end-end-end ranking system.

Even more so given the significant results received lately in the passage ranking leaderboards and the results reported in the papers presented by Dai around the DeepCT Framework. The inventor of DeepCT has also now joined the Google TF-Ranking team and is listed on the lately submissions of models on the MS MARCO passage ranking leaderboards.

Some of the legacy challenges which appear to be overcome by DeepCT in the results in both the current TFR-BERT research model and in Dai’s papers could be seen as “a breakthrough in ranking.”

Recall from last week’s Search On event Google’s Prabhakar Raghavan, when announcing “passage indexing” and saying, “We’ve recently made a breakthrough in ranking.”

DeepCT kind of sounds like it could perhaps be quite a significant breakthrough in ranking.

So how does DeepCT work?

Instead of using term frequency in the first stage of information retrieval / ranking, DeepCT appears to propose to replace term frequency (TF) with tfDeepCT.  With DeepCT, word’s contextual meaning is identified as an alternative to simply counting the number of times a keyword is mentioned in a passage, using deep contextualized representations through BERT transformers.

Important words in context are weighted more even if they are mentioned less and an importance score is assigned given the context of the word in a paragraph or particular context, since words have different meanings at different times and in different scenarios.  More important words to the passage and the topic (central terms) are scored with a higher importance score, whereas words which are less important are given a lower score and / or suppressed entirely if they are off-topic or contribute nothing to the importance of the passage.

A strong bias is generated towards words which are “on-topic” with a suppression of “off-topic” words.

To quote Ludwig Wittgenstein in 1953, “The meaning of a word is its use in the language.”

Whilst some commentary has been added by me to the content to follow I did not want to distort the meanings in the technical explanations of DeepCT due to my limited understanding on the new and complex topic of DeepCT, therefore primarily DeepCT explanations are quotes from Dai’s paper.

DeepCT, tfDeepCT and DeepCT-Index

The fundamental parts of DeepCT seem to be:

  • tfDeepCT – An alternative to term frequency which replaces tf with tfDeepCT
  • DeepCT-Index – Alternative weights added to an original index, with no additional postings. Weighting is carried out offline, and therefore does not add any latency to search engine online usage
  • DeepCT-Query – An updated bag-of-words query which has been adapted using the deep contextual features from BERT to identify important terms in a given text context or query context.

According to Dia:

“We develop a novel DeepCT-Index that offline weights and indexes terms in passage-long documents.  It trains a DeepCT model to predict whether a passage term is likely to appear in relevant queries. The trained model is applied to every passage in the collection. This inference step is query-independent, allowing it to be done offline during indexing. The context-based passage term weights are scaled to tf -like integers that are stored in an ordinary inverted index that can be searched efficiently by common first-stage retrieval models”

“Analysis shows the main advantage of DeepCT over classic term weighting approaches: DeepCT finds the most central words in a text even if they are mentioned only once. Non-central words, even if mentioned frequently in the text, are suppressed. Such behavior is uncommon in previous term weighting approaches. We view DeepCT as an encouraging step from “frequencies” to “ meanings.”

(Dia, 2020)

Dai, highlights the novel nature and effectiveness of DeepCT:

“Analysis shows that DeepCT’s main advantage is the ability to differentiate between key terms and other frequent but non-central terms.”… “DeepCT-Index aggressively emphasizes a few central terms and suppresses the others.”

“When applied to passages, DeepCT-Index produces term weights that can be stored in an ordinary inverted index for passage retrieval. When applied to query text, DeepCT-Query generates a weighted bag-of-words query. Both types of term weight can be used directly by typical first-stage retrieval algorithms. This is novel because most deep neural network based ranking models have higher computational costs, and thus are restricted to later-stage rankers.

“This paper presents a novel approach that runs DeepCT at offline index time, making it possible to use it in first-stage retrieval where efficiency is crucial. Our approach applies DeepCT over each passage in the corpus, and stores the context-aware term weights in an ordinary inverted index to replace tf. The index can be searched efficiently using common bag-of-words retrieval models such as BM25 or statistical query likelihood models.”

(Dai, 2020)

To emphasise the efficiency of DeepCT, tfDeepCT and DeepCT-Index

“No new posting lists are created, thus the query latency does not become longer. To the contrary, a side-effect …is that tfDeepCT of some terms becomes negative, which may be viewed as a form of index pruning.”

(Dai, 2020)

DeepCT-Index could make neural ranking practical “end-to-end?”

It seems computational expense even when using BERT in the re-ranking stage and the latency issues caused are a significant bottleneck to using them at scale in production environments.  Dai stresses the huge benefit to improving the first stage with DeepCT, and thereby reducing the burden at the re-ranking stage.

The main point is, improving the first stage has the potential to both dramatically improve the first stage and the second stage. Indeed, greatly improved first stage might well reduce the need for second stages and multi-stages dramatically, Dai claims, and compares DeepCT’s performance to a standard BM25 first-stage ranking system.

“The high computational cost of deep neural-based re-rankers is one of the biggest concerns about adopting them in online services. Nogueira et al. reported that adding a BERT Re-Ranker, with a re-ranking depth of 1000, introduces 10× more latency to a BM25 first-stage ranking even using GPUs or TPUs. DeepCT-Index reduces the re-ranking depth by 5× to 10×, making deep neural based re-rankers practical in latency-/resource-sensitive systems”

(Dai, 2019)

This development is as a result of DeepCT not adding any latency to the search system since nothing is added further per say.

“DeepCT-Index does not add latency to the search system. The main difference between DeepCT-Index and a typical inverted index is that the term importance weight is based on TFDeepCT instead of TF.”

(Dai, 2020)

DeepCT results

Dai, emphasises the uncommon results achieved using DeepCT and in particular as an alternative to term frequency measures, which have been in use for many years, and makes the case that the results illustrated by DeepCT clearly show other term importance signals can be generated beyond term frequency.

“It is uncommon in prior research for a non-tf term weighting method to generate such substantially better rankings. These results show that tf is no longer sufficient, and that better term importance signals can be generated with deep document understanding.”

(Dai, 2020)

But not only is DeepCT impressive for first stage ranking but the improved first stage results naturally feed forward to better second stage rankings, whilst also finding the central meanings in passages using tfDeepCT and DeepCT-Index.

“Experimental results show that DeepCT improves the accuracy of popular first-stage retrieval algorithms by up to 40%. Running BM25 on DeepCT-Index can be as effective as several previous state-of-the-art rankers that need to run slow deep learning models at the query time. The higher-quality ranking enabled by DeepCT-Index improves the accuracy/efficiency tradeoff for later-stage re-rankers. Analysis shows that DeepCT is capable of finding the central words in a text even if they are mentioned only once. We view DeepCT as an encouraging step from “frequencies” to “ meanings.”

(Dai, 2020)

Here are some of the results of the capabilities from the DeepCT experiments as a contextual aware first stage ranker curated from various parts of the DeepCT (Dai, 2020) papers:

  • A BM25 retrieval on DeepCT-Index can be 25% more accurate than classic tf -based indexes, and are more accurate than some widely-used multi-stage retrieval systems.
  • These results indicate that it is possible to replace some pipelined ranking systems with a single-stage retrieval using DeepCT-Index.
  • A single-stage BM25 retrieval from DeepCT-Index was better than several reranking pipelines
  • It is more accurate than feature based LeToR (Learning to Rank), a widely used reranking approach in modern search engines
  • The improved first stage ranking further benefits the effectiveness and efficiency of downstream re-rankers.
  • DeepCT-Index reduces the re-ranking depth by 5× to 10×, making deep neural based re-rankers practical in latency-/resource-sensitive systems
  • Ranking with DeepCt in the first stage, provided more relevant passages to a reranker for better end to end ranking.
  • DeepCT had higher recall at all depths, meaning a ranking from DeepCT provided more relevant passages to a reranker.
  • For BERT ReRanker, DeepCT enabled it to achieve similar accuracy using much fewer passages…meaning that the reranker can be 5-10× more efficient. In summary
  • DeepCT puts relevant passages at the top, so that downstream rerankers can achieve similar or higher accuracy with much smaller candidate sets, leading to lower computational cost in the retrieval pipeline

A breakthrough in first stage ranking using word’s context rather than just keyword frequencies or similar?

The results achieved with DeepCT could be seen as “a breakthrough in ranking.”  Certainly DeepCT represents a step toward improvement in “end-to-end-ranking” (albeit for passages at the moment), and could surely be seen as progress, particularly when coupled with a breakthrough in identifying the contextual meanings using deep learning representations with the ability to simply add weights to the current tf terms and replacing them with tfDeepCT?

And Dai does seem to shake things up in her claims effectively saying tf is no longer sufficient and it’s time for a revisit to the current systems of old:

She makes the case that term frequency is not sufficient any more.

“Results from this paper indicate that tf is no longer sufficient. With recent advances in deep learning and NLP, it is time to revisit the indexers and retrieval models, towards building new deep and efficient first stage rankers.”

(Dai, 2020)

And summarises her case as follows:

“The higher-quality ranking enabled by DeepCT-Index improves the accuracy/efficiency tradeoff for later-stage re-rankers. A state-of-the-art BERT-based re-ranker achieved similar accuracy with 5× fewer candidate documents, making such computation-intensive re-rankers more practical in latency-/resource-sensitive systems. Although much progress has been made toward developing better neural ranking models for IR, computational complexity often limits these models to the re-ranking stage. DeepCT successfully transfers the text understanding ability from a deep neural network into simple signals that can be efficiently consumed by early-stage ranking systems and boost their performance. Analysis shows the main advantage of DeepCT over classic term weighting approaches: DeepCT finds the most central words in a text even if they are mentioned only once. Non-central words, even if mentioned frequently in the text, are suppressed. Such behavior is uncommon in previous term weighting approaches. We view DeepCT as an encouraging step from “frequencies” to “meanings.”

“There is much prior research about passage term weighting, but it has not been clear how to effectively model a word’s syntax and semantics in specific passages. Our results show that a deep, contextualized neural language model is able to capture some of the desired properties, and can be used to generate effective term weights for passage indexing. A BM25 retrieval on DeepCT-Index can be 25% more accurate than classic tf -based indexes, and are more accurate than some widely-used multi-stage retrieval systems. The improved first stage ranking further benefits the effectiveness and efficiency of downstream re-rankers.”

(Dai, 2020)

Back to Google’s passage indexing announcement

Let’s just revisit the key message from Google during the Search On event about passage-indexing: “With our new technology, we’ll be able to better identify and understand key passages on a web page. This will help us surface content that might otherwise not be seen as relevant when considering a page only as a whole….”

Which sounds similar to Dai: “A novel use of DeepCT is to identify terms that are central to the meaning of a passage, or a passage-long document, for efficient and effective passage/short-document retrieval.”

Back to the Search On event: “This change doesn’t mean we’re indexing individual passages independently of pages. We’re still indexing pages and considering info about entire pages for ranking. But now we can also consider passages from pages as an additional ranking factor….”

Which may be this (but on the same index), as a weighted contextual ranking factor applied at a passage level within the current document index.

Remember Dai, 2020, makes it clear no further postings are created in DeepCT-Index. Nothing changes to the index, but perhaps different contextual measures are added using BERT and perhaps tfDeepCT adds that context. (Note, I have no proof of this beyond the literature and the current TFR-BERT model submissions):

“This paper also presents a novel approach that runs DeepCT at offline index time, making it possible to use it in first-stage retrieval where efficiency is crucial. Our approach applies DeepCT over each passage in the corpus, and stores the context-aware term weights in an ordinary inverted index to replace tf. The index can be searched efficiently using common bag-of-words retrieval models such as BM25 or statistical query likelihood models.”

(Dai, 2019)

What could be the significance of DeepCT to passage-indexing?

Well, if DeepCT were used, it may just mean those “counts of keywords” and “some of the counts of x, y and z” in the features referred to in the 2018 video on passage retrieval may not be quite as important as SEOs hoped, when passage indexing rolls out later this year since DeepCT (if it is used), might take a different approach to those in YouTube videos from 2018 on passage retrieval.

I mean, seriously, how many entities and keywords could one stuff into a passage in text anyway without it being spammy?

That’s not to say things from 2018 are not important, because there is also work going on with BERT and knowledge bases which might impact, and furthermore the work on T5, by Google explored whether models like BERT could augment knowledge in its parameters from simply a large crawl of the web. As too, does some other work of Dai, in HDCT (Dai, 2019) which is another framework for passage retrieval and indexing.  There Dai, does appear to give weights to the positions of passages in a document and also the passage deemed the “best” in a document too. Titles and inlinks are seen as indicators of importance too in HDCT.

But Google has not chosen to include HDCT in their submitted TFR-BERT and I suspect (opinion) that it is related to the potential for spam in models which merely weight terms by how many inlinks and keywords in page titles. But, that is just my opinion.

If DeepCT is used, it really will be about providing a rich depth of compelling and authoritative content with focus and structure in sections on a page. The semantic headings and page title will likely also help of course, but after all there is only so much one can do with those features to differentiate oneself from competitors.

One other point

You’ll also notice many of the 2018 videos on passage retrieval are around the topic of “Factoid Search,” which is not the same as “open domain answers,” which are longer, less simple to provide answers to, and much more nuanced.

The answers to factoid questions are easy to find in knowledge bases compared with nuanced complex open-domain questions such as the one in the passage indexing example provided by Google. Those types of questions require understanding of the true context of each work, and likely only met by contextual term understanding models such as BERT which did not appear until late 2018 in the first place. Answering more complex open domain questions might well constitute the 7% of queries mentioned as the starting point during the Search On event since this is not high.

If DeepCT (or future iterations of DeepCT), is used in production search passage ranking it could have the potential to bring huge efficiencies to first stage ranking and improved second stage ranking overall in search engines (particularly, as with all things, it will be built upon and improved further by the research world).

DeepCT, or innovations similar to DeepCT could also be the secret sauce which take search engines truly from “keyword counts (tf)” in first stage retrieval to much more capable of understanding word’s meaning.  Initially in passages, but then… who knows?

We’ve heard already about the efficiency problems involved with first stage ranking and the need to only use deep learning at the later stages as a re-ranker, but things may be about to change.  Furthermore, search engines have relied on first stage rankings involving systems such as term frequency for many years according to plenty of literature, and that too may be about to change.

That’s not to say a passage, or document without a single relevant word is going to rank easily, because it “probably” won’t, although we do know now that it’s not just words on a page which add value.

BERT everywhere

Whilst we know now BERT is used in nearly all queries, the use of BERT for passage indexing and the initial 7% of queries, could be more prevalent, and increasingly so into the future, if, and when, passage indexing expands to impact more queries.

BERT everywhere would likely be a prerequisite if DeepCT were used in order to build the tfDeepCT embeddings in the index.

That said, BERT and other neural networks are likely not always needed anyway on very short or navigational queries.

There’s not a lot of natural language understanding needed for the query “red shoes” or “ASOS dresses,” after all, since the intent is usually quite clear, aside from whether the query requires different media to a simple ten blue links (e.g. images).

But, as mentioned DeepCT may not even be in the production mix

At this stage however, Google may simply be happy enough with BERT as a re-ranker on long open-domain questions rather than factoid questions which are easier to answer, but that doesn’t really feel like “a breakthrough in ranking” since passage ranking has been around for quite some time, albeit the re-ranking element is fairly recent.

In any event, even without DeepCT, given the overwhelming use of BERT and BERT-like systems in passage re-ranking it is “probably” part of the forthcoming passage update.

So, where to next and why just 7% of queries?

So, we know BERT was being used, at least in part for 10% of queries, and it was probably in the second stage of ranking (re-ranking) because of computational costs,  and probably only on the most nuanced of queries, and probably not as a passage re-ranker or ranker but as a sentence level disambiguation task tool and text summarization (featured snippets).

We know that neural ranking approaches with BERT and other deep neural networks have been too computationally expensive to run at the first stage of search across the search industry and there have been limitations on the number of tokens BERT can work with – 512 tokens.  But 2020 has been a big year and the developments to scale natural language machine learning attention systems have included such innovations as Big Bird, Reformer, Performers and ELECTRA plus T5 to test the limitations of transfer learnings, making huge inroads. And those are just projects Google is involved with in some capacity. Not to mention the other big tech search companies.

Whilst much of this work is very new, a year is a long time in the AI NLP research space so expect huge changes by this time next year.

Regardless of whether DeepCT is used in the forthcoming production search passage indexing feature, it is highly likely BERT has a strong connection to the change, given the overwhelming use of BERT (and friends) as a passage reranker in the research of the past 12 months or so.

Passages, with their limited number of tokens, if taken as stand alone pieces, on can argue, by their nature, limit the effectiveness of keywords alone without contextual representation, and certainly, a keyword stuffed passage to overcome this would be a backwards step rather than a move away from the keyword-esque language search engines are trying to move away from.

By using the contextual representations to understand word’s meaning in a given context intent detection of searchers is greatly improved.

Whilst there are currently limitations for BERT in long documents, passages seem an ideal place to start toward a new intent-detection led search. This is particularly so, when search engines begin to “Augment Knowledge” from queries and connections to knowledge bases and repositories outside of standard search, and there is much work in this space.

What does this mean for SEOs?

As you may recall, the Frederic Dubut of Bing video from early 2020 and remember that Bing have been using BERT since last April and also claim to be using something BERT like everywhere in their search engine systems.  Whilst Bing may not have the same search market share as Google, they do have an impressive natural language understanding research team, well respected in their space.

Frederic said it was time for SEOs to focus on intent research practices, but I do not believe that meant we should not consider words, since after all, language is built on words. Even DeepCT does not claim to be able to understand intent without words. But Frederic was perhaps advising SEOs to move away from the keyword-esque type “x number of keyword mentions on a page” approaches and more toward aligning increasingly with truly understanding the intent behind information needs.

That said, structure and focus in content has ALWAYS mattered, and never more so than now when contextual clarity will be even more important in writing, plus subtopics throughout a long form document as a whole will be an important part of that, since passages will likely be these long documents chopped into parts.

Clear section headings and focus to meet an information need at each stage are undoubtedly always going to be useful, despite this not necessarily being an SEO ‘thing’. I’d certainly be revisiting those spurious blog posts with little topical centrality and improving them to add further value as a first point of advice.

Plus, the use of

in html5 is not there for no reason after all.

The Mozilla Foundation provides a great example of the use of this ‘standalone’ section markup and content combined.

Also, don’t just rely on rank trackers to understand intent. The SERPs and the types of sites ranking and the content within them are undoubtedly the best measure of what you should be talking about in your passages to meet informational needs.  It’s not always what you expect.

These developments with BERT everywhere (and passages if BERT and DeepCT is used), reinforce that further.

As Google’s Prabhakar Raghavan said, “This is just the start.”

He is not wrong.

Whilst there are currently limitations for BERT in long documents, passages seem an ideal place to start toward a new ‘intent-detection’ led search. This is particularly so, when search engines begin to ‘Augment Knowledge’ from queries and connections to knowledge bases and repositories outside of standard search, and there is much work in this space ongoing currently.

But that is for another article.

References and sources

Beltagy, I., Peters, M.E. and Cohan, A., 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.

Bendersky, M. and Kurland, O., 2008, July. Re-ranking search results using document-passage graphs. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 853-854).

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L. and Belanger, D., 2020. Rethinking Attention with Performers. arXiv preprint arXiv:2009.14794.

Clark, K., Luong, M.T., Le, Q.V. and Manning, C.D., 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.

Dai, Z. and Callan, J., 2019. An Evaluation of Weakly-Supervised DeepCT in the TREC 2019 Deep Learning Track. In TREC.

Dai, Z. and Callan, J., 2019, July. Deeper text understanding for IR with contextual neural language modeling. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 985-988).

Dai, Z. and Callan, J., 2019. Context-aware sentence/passage term importance estimation for first stage retrieval. arXiv preprint arXiv:1910.10687.

Dai, Z. and Callan, J., 2020, July. Context-Aware Term Weighting For First Stage Passage Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1533-1536).

Dai, Z. and Callan, J., 2020, April. Context-Aware Document Term Weighting for Ad-Hoc Search. In Proceedings of The Web Conference 2020 (pp. 1897-1907).

Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Evans, D.A., Claritech Corp, 1999. Information retrieval based on use of sub-documents. U.S. Patent 5,999,925.

Han, S., Wang, X., Bendersky, M. and Najork, M., 2020. Learning-to-Rank with BERT in TF-Ranking. arXiv preprint arXiv:2004.08476.

Joshi, M., Choi, E., Weld, D.S. and Zettlemoyer, L., 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.

Karpukhin, V., Oğuz, B., Min, S., Wu, L., Edunov, S., Chen, D. and Yih, W.T., 2020. Dense Passage Retrieval for Open-Domain Question Answering. arXiv preprint arXiv:2004.04906.

Kitaev, N., Kaiser, Ł. and Levskaya, A., 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451.

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K. and Toutanova, K., 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, pp.453-466.

Lin, J., Nogueira, R. and Yates, A., 2020. Pretrained Transformers for Text Ranking: BERT and Beyond. arXiv preprint arXiv:2010.06467.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R. and Deng, L., 2016. Ms marco: A human-generated machine reading comprehension dataset.

Nogueira, R. and Cho, K., 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085.

Nogueira, R., Yang, W., Cho, K. and Lin, J., 2019. Multi-stage document ranking with BERT. arXiv preprint arXiv:1910.14424.

Pasumarthi, R.K., Wang, X., Li, C., Bruch, S., Bendersky, M., Najork, M., Pfeifer, J., Golbandi, N., Anil, R. and Wolf, S., 2018. TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank.(2018). arXiv. arXiv preprint arXiv:1812.00073.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.

Search On with Google. 2020. Search On with Google 2020. [ONLINE] Available at: https://searchon.withgoogle.com/. [Accessed 25 October 2020].

Sekulić, I., Soleimani, A., Aliannejadi, M. and Crestani, F., 2020. Longformer for MS MARCO Document Re-ranking Task. arXiv preprint arXiv:2009.09392.

seroundtable.com. 2020. Google Says Being On The First Page Of Search Means You Are Doing Things Right. [ONLINE] Available at: https://www.seroundtable.com/google-first-page-doing-things-right-29431.html. [Accessed 25 October 2020].

Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and TrendsR in Information Retrieval, 3(3):225–331, 2009.

Wang, S., Zhou, L., Gan, Z., Chen, Y.C., Fang, Y., Sun, S., Cheng, Y. and Liu, J., 2020. Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding. arXiv preprint arXiv:2009.06097.

Wang, X., Li, C., Golbandi, N., Bendersky, M. and Najork, M., 2018, October. The lambdaloss framework for ranking metric optimization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (pp. 1313-1322).

Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L. and Ahmed, A., 2020. Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062.

The post Could Google passage indexing be leveraging BERT? appeared first on Search Engine Land.

پاسخی بگذارید

نشانی ایمیل شما منتشر نخواهد شد. بخش‌های موردنیاز علامت‌گذاری شده‌اند *

Back To Top